Google AI BlogAugust 28, 2023

RO-ViT: Region-aware pre-training for open-vocabulary object detection with vision transformers

Region-aware image-text pre-training

There is a mismatch between the way positional embeddings are used in existing contrastive pre-training approaches and open-vocabulary detection.
Pre-training approaches use full-image positional embeddings, but open-vocabulary detection requires the embeddings to generalize to unseen regions.
We propose cropped positional embedding (CPE) which randomly crops and resizes a region of positional embeddings instead of using the whole-image positional embedding.
We also use a focal loss to learn from hard examples.

An open-vocabulary detector is trained with detection labels of base categories but needs to detect both base and novel (unlabeled) categories at test time.
At test time, we append text embeddings of novel categories and compute detection scores with the union of base and novel categories.

RO-ViT outperforms state-of-the-art ViT-based and CNN-based methods on the LVIS open-vocabulary detection benchmark.
We show zero-shot image-text retrieval on MS COCO and Flickr30K benchmarks and compare with dual-encoder methods.
Visualization of positional embeddings shows the effectiveness of the learned representation.