RO-ViT: Region-aware pre-training for open-vocabulary object detection with vision transformers

RO-ViT: Region-aware pre-training for open-vocabulary object detection with vision transformers
Region-aware image-text pre-training
- There is a mismatch between the way positional embeddings are used in existing contrastive pre-training approaches and open-vocabulary detection.
- Pre-training approaches use full-image positional embeddings, but open-vocabulary detection requires the embeddings to generalize to unseen regions.
- We propose cropped positional embedding (CPE) which randomly crops and resizes a region of positional embeddings instead of using the whole-image positional embedding.
- We also use a focal loss to learn from hard examples.
Open-vocabulary detector fine-tuning
- An open-vocabulary detector is trained with detection labels of base categories but needs to detect both base and novel (unlabeled) categories at test time.
- At test time, we append text embeddings of novel categories and compute detection scores with the union of base and novel categories.
Results
- RO-ViT outperforms state-of-the-art ViT-based and CNN-based methods on the LVIS open-vocabulary detection benchmark.
- We show zero-shot image-text retrieval on MS COCO and Flickr30K benchmarks and compare with dual-encoder methods.
- Visualization of positional embeddings shows the effectiveness of the learned representation.