Open Vocabulary Object Detection with Pseudo Bounding-Box Labels: Towards a Universal Object Detector

13 min read

AUTHORS: Chen Xing, Mingfei Gao, Donald Rose

TL;DR: Most AI object detection methods work only on limited object categories, due to the human effort required for bounding-box annotations of training data. We developed a new method that automatically generates pseudo bounding-box annotations of diverse objects from large-scale image-caption pairs, removing the bottleneck caused by the need for human labeling. Experimental results, detailed in our research paper (accepted by ECCV 2022), show that our method outperforms the SOTA open vocabulary object detector. The Big Picture result: our method’s AI-generated pseudo bounding-box labels, plus its strong generalization performance on novel datasets, brings us closer to our dream of a universal object detector.


For a review of some terms and definitions used in this blog, please see our Glossary.

Object detection is a core task in computer vision that has been considerably advanced with the adoption of deep learning, and continues to attract significant research effort. Given an image as input, a trained object detector outputs the object names and bounding boxes for objects that it is trained to recognize. (Bounding boxes are the tight and accurate boxes around the target objects.) Object detectors usually require humans to draw such accurate boxes around target objects as training data for localizing the target objects. Figure 1 shows example output of an object detector, with labeled bounding boxes for several recognized objects in the input image.

Current AI object-detection methods achieve extraordinary performance when learning a predefined set of object categories that have been annotated in a large number of training images (such as PASCAL VOC and COCO).

Figure 1: Example output of an object detector given an input image. (Image from

Limitations of Existing AI Object-Detection Methods

Unfortunately, the success of these systems is still limited to detecting a small number of object categories (for instance, 80 categories in COCO).  One reason for this: most detection methods rely on supervision in the form of instance-level bounding-box annotations, requiring expensive human labeling efforts to build training datasets. Furthermore, when we need to detect objects from a new category, one has to further annotate a large number of bounding-boxes in images for this new object category.

While zero-shot object detection and open-vocabulary object detection (OVD) methods have been proposed to improve generalization on novel object categories, the potential of such methods is constrained by the small size of the base category set (categories with human-provided bounding-box labels) at training, due to the high cost of acquiring large-scale bounding-box annotations of diverse objects. As a result, it’s still challenging for them to generalize well to diverse objects of novel categories in practice.

In short, the requirement for human annotation of objects in training data has been a bottleneck on the road towards true OVD, and the ultimate goal of a universal object-detection system.

Ideas to Improve Existing OVD Methods

One potential avenue for improvement is to enable OVD models to utilize a larger set of base classes of diverse objects by reducing the requirement of manual annotations. Hence, we ask:

  • Can we automatically generate bounding-box annotations for objects at scale using existing resources?
  • Can we use these AI-generated annotations to improve open vocabulary detection?

The most recent progress on vision-language pre-training gives us hope. Vision-language models are pre-trained with large-scale image-caption pairs from the web. They show amazing zero-shot performance on image classification, as well as promising results on tasks related to word-region alignment (aligning text with specific regions of a given image), such as referring expressions comprehension (finding a target object in an image described by a natural language expression). This implies strong localization ability: given an image and the name of an object in that image, the vision-language models can localize where the objects are, without the help of any extra supervision.

Our Solution: OVD with Pseudo Bounding-Box Labels

Motivated by these observations, we developed an AI model that performs pseudo bounding-box open-vocabulary object detection. ("Pseudo" refers to annotations or labels that are not created by humans but rather artificially generated by an AI model.) Our work improves OVD by using pseudo bounding-box annotations generated from large-scale image-caption pairs, taking advantage of the localization ability of pre-trained vision-language models. While the pseudo bounding-box annotations generated by our model are different from traditional human-made labels, they achieve state-of-the-art (SOTA) performance results (described later), and may potentially provide substantial time and cost savings by reducing or eliminating the human effort typically required to create these labels.

Figure 2 gives a quick overview of our method (on the right) versus previous methods (on the left). The latter depend on human annotations of predefined base classes during training, and try to generalize to objects of novel classes during the inference stage. In contrast, we designed a pseudo bounding-box label generation strategy, using pre-trained vision-language models to automatically obtain box annotations of a diverse set of objects from existing image-caption datasets. Then we use these pseudo labels to train our open vocabulary object detector to make it generalizable to more diverse objects. Note the bigger cylinders on the right, indicating that our method results in accurate detection of a far greater number of novel object classes.

Figure 2. Previous methods (left) rely on human-provided box-level annotations of predefined base classes during training and attempt to generalize to objects of novel classes during inference. Our method (right) generates pseudo bounding-box annotations from large-scale image-caption pairs by leveraging the localization ability of pre-trained vision-language models. Then, we utilize the pseudo bounding-box annotations to improve our open vocabulary object detector.

Helps Solve Real-World Problems

Our method also makes it possible to achieve true OVD, a system that can recognize diverse and countless real world objects – because in our method, pseudo labels of rare objects can be automatically and efficiently generated as long as we have existing image-caption pairs that cover such objects. By removing the bottleneck of requiring human-created annotations, we can at last make great strides along the path to developing a universal object-detection system.

With such a universal object detector available as a backbone, solutions to many vision-related real world problems can be greatly enhanced, including robot navigation, autopilot, and intelligent transportation.

Deep Dive: How Our Method Works

Our framework contains two components: a pseudo bounding-box label generator and an open vocabulary object detector. Our pseudo label generator automatically generates bounding-box labels for a diverse set of objects by leveraging a pre-trained vision-language model. We then train our detector directly with the generated pseudo labels.

Label Generation: Pseudo Bounding-Box Labels via AI

Figure 3 illustrates the overall procedure of our pseudo label generation.  Our goal is to generate pseudo bounding-box annotations for objects of interest in an image, by leveraging the implicit alignment between regions in the image and words in its corresponding caption in a pre-trained vision-language model.

Figure 3. Illustration of our pseudo bounding-box annotation generation process.

The input to the system is an image-caption pair. We use image and text encoders to extract the visual and text embeddings of the image and its corresponding caption. We then obtain multi-modal features by image-text interaction via cross-attention. We maintain objects of interest in our predefined object vocabulary. For each object of interest embedded in the caption (for example, racket in this figure), we use Grad-CAM to visualize its activation map in the image. This map indicates the contribution of the image regions to the final representation of the object word. Finally, we determine the pseudo bounding-box label of the object by selecting the object proposal that has the largest overlap with the activation.

Figure 4 shows some examples of the activation maps, which demonstrate that the activated regions correspond well with the relevant regions. The generated bounding boxes are of good quality. When they are directly used to train an open vocabulary object detector, the object detector significantly outperforms the current SOTA open-vocabulary/zero-shot object detectors.

Figure 4. Visualization of some activation maps. Colorful blocks indicate values of Grad-CAM activation maps in the corresponding regions. We zero out blocks with values smaller than half of the max value in the map, so the main focus is highlighted. Black boxes indicate object proposals and red boxes indicate the final selected pseudo bounding-box labels.

Detector Training: OVD Learning with Pseudo Labels

After we get pseudo bounding-box labels, we can use them to train an open vocabulary object detector. Since our pseudo-label generation is disentangled from the detector training process, our framework can accommodate detectors with any architecture. In this work, we focus on the open vocabulary scenario where a detector aims at detecting arbitrary objects during inference.

In our detector, an image is processed by a feature extractor followed by a region proposal network. Region-based features are then calculated by applying RoI pooling/RoI align over region proposals and the corresponding visual embeddings are obtained. Similarity of the visual and text embeddings of the same object are encouraged during training.

Figure 5. Illustration of our detector.

Performance Results

Outperforms Existing OVD Methods, Improves SOTA

Since our method for generating pseudo bounding-box labels is fully automated, with no manual intervention, the size and diversity of the training data (including the number of training object categories) can be greatly increased. This enables our approach to outperform existing zero-shot/open vocabulary detection methods that are trained with a limited set of base categories.

We evaluate the effectiveness of our method by comparing it with the SOTA zero-shot and open vocabulary object detectors on four widely used datasets: COCO, PASCAL VOC, Objects365, and LVIS. Experimental results show that our method outperforms the best open vocabulary detection method by 8% AP (Average Precision, or accuracy, when detecting objects) on novel objects on COCO, when both of the methods are fine-tuned with COCO base categories. This means that significantly more objects are accurately detected. Surprisingly, we also found that even when not fine-tuned with COCO base categories, our method can still outperform the fine-tuned SOTA baseline by 3% AP.

We also evaluate the generalization performance of our method on other data sets (except COCO), to be presented at ECCV 2022. Experimental results show that under this setting, our method outperforms existing approaches by 6.3%, 2.3%, and 2.8% on the PASCAL VOC, Objects365, and LVIS datasets, respectively.

A Closer Look

To get a more precise picture of our method’s improvements, we quantitatively compared it with existing baselines. We measured AP -- Average Precision (accuracy) when detecting objects -- for our method versus baselines on validation datasets. A validation dataset is usually used to provide an unbiased evaluation of a model fit on the training dataset. In our evaluation, since the formal test set is not available, we follow previous work to evaluate our method on the COCO validation dataset.

Table 1 shows our model’s performance when fine-tuned – and not fine-tuned – with COCO base categories. Fine-tuning with COCO base categories means that after our model is trained with our pseudo bounding-boxes, we further train our detector with the human-provided bounding box labels of COCO base categories, which is the same procedure as our baselines.

After our model is fine-tuned using COCO base categories, our method pre-trained with pseudo labels outperforms our strongest baseline (Zareian et al.) by 8% AP on novel categories. When not fine-tuned using COCO base categories and only trained with generated pseudo labels, our method achieves 25.8% AP on novel categories, which still outperforms the SOTA method (Zareian et al.) by 3% AP.

Generalization ability to a wide range of datasets is also important for an open vocabulary object detector, since it makes a detector directly usable as an out-of-the-box method in the wild. Table 2 shows the generalization performance of detectors to different datasets, where both our method and our baseline are not trained using these datasets. Since Objects365 and LVIS have a large set of diverse object categories, evaluation results on these datasets would be more representative to demonstrate generalization ability.

Results show that our method achieves better performance than Zareian et al. on all three datasets when both of the methods are fine-tuned with COCO base categories. When not fine-tuned with COCO base categories, our method still outperforms Zareian et al. (fine-tuned with COCO base categories) on the Objects365 and LVIS datasets.

Beyond the quantitative results, we also present a case study in Figure 6 to show some predicted bounding boxes provided by our model.

Figure 6. Some example results from our open vocabulary detector. The categories shown here are from novel categories in COCO.

The Big Picture: Research and Societal Impacts

We believe our work will positively influence OVD research going forward. By automating the process of creating bounding-box annotations of training data, we have taken an important step towards reducing – and perhaps eventually eliminating – the need for human time and effort during the annotation process, and the expense that entails.

While we feel its overall impact will be strongly positive, our method does present some potential for negative impact. Since our pseudo label generator mines annotations of objects from input captions without human intervention, our pseudo labels might be biased, due to any bias that may be embedded in the vision-language model and/or the image-caption pairs. Manually filtering out biased image-caption data samples or vocabulary object names could be two effective solutions for this potential issue. However, because human annotators will likely introduce their own inherent biases during the annotation process, automating the annotation process does not necessarily lead to more biased results. Future work could evaluate the bias of automated versus manual annotations.

The Bottom Line


Key takeaways and results:

  • New approach to bounding-box label generation: We developed a new framework that trains an open vocabulary object detector with pseudo bounding-box labels that are automatically generated from large-scale image-caption pairs.
  • Reduced need for expensive human labeling: By generating these pseudo bounding-box labels, our method reduces the need for (expensive) human-labeling efforts.
  • Expanding the label-space: In effect, we expanded the space of potential labels to include either "real" (human-made) annotations or "pseudo" (AI-made) annotations.
  • Proof of concept that’s generalizable: Our method not only shows that it’s possible to automatically and efficiently generate pseudo bounding-box labels of good quality for various objects, but also exhibits strong generalization performance on novel objects.
  • Optimal OVD: The excellent performance results we found (beating existing SOTA methods) indicates that our approach could make it possible to achieve true OVD – a system that can recognize diverse and countless real-world objects, without the limits faced by other OVD methods that are dependent on human annotations or lack generalization power.

Future directions and goals:

  • Better pre-training: We intend to further improve the quality of pseudo bounding-box labels by designing better pre-training strategies for vision-language models to promote their localization capabilities.
  • Towards a universal object detector + real-world benefits: With its AI-generated pseudo bounding-box labels (removing the bottleneck of requiring human-created annotations), and its solid generalization performance, our method brings us closer to our dream of a universal object detector. In future work, we will aim to train a universal object detector with pseudo labels to detect a broader range of real-world objects, and apply it to improve solutions for several vision-related real-world problems.

Explore More

Salesforce AI Research invites you to dive deeper into the concepts discussed in this blog post (links below). Connect with us on social media and our website to get regular updates on this and other research projects.

About the Authors

Chen Xing is a Research Scientist at Salesforce Research in Palo Alto, CA. She received her Ph.D. degree from Nankai University, China. Her work lies in the domain of “smart applications” of large pre-trained models: she is passionate about effectively pre-training large language or vision-language models in an unsupervised manner, and applying them intelligently to benefit downstream tasks.

During this work, Mingfei Gao was a Senior Research Scientist at Salesforce Research in Palo Alto, CA. She received her Ph.D. degree from the Computer Science Department at the University of Maryland College Park. Her research interests include 3D scene understanding, object detection, vision and language, action recognition, weakly/self-supervised learning, and multimodal learning. Mingfei is now a Senior Applied Scientist at Apple in Sunnyvale, CA.

Donald Rose is a Technical Writer at Salesforce AI Research. He earned his Ph.D. degree in Information and Computer Science at the University of California, Irvine. Specializing in technical content creation and editing, Dr. Rose works on multiple projects — including blog posts, video scripts, newsletters, media/PR material, social media, and writing workshops. His passions include helping researchers transform their work into publications geared towards a wider audience, leveraging existing content in multiple media modes, and writing think pieces about AI.


  • Annotation: A short description (for example, a concept name) assigned to data. (Like the word “dog” assigned to an image of a dog.) AI models use annotated data during training to learn how to recognize similar patterns when presented with new data. Another term for Label.
  • Bounding box: A tight and accurate box around a target object; a box drawn in an image that’s intended to capture the target concept. It bounds (surrounds) the target concept (object). A bounding box can be input data for training, when an AI model is trying to learn that concept – or an output prediction made by the AI model.
  • Bounding-box label/annotation: A label created and assigned (typically by a human) to a bounding box.
  • Ground-truth bounding box: A bounding box that’s labeled by hand on data that will be used for training and testing; hence, it is considered an optimally-correct (true) box that bounds (surrounds) the target concept (object) with perfect or near-perfect accuracy.
  • Label: Another term for Annotation (see “Annotation” definition).
  • Localization: Given an image and the name of an object in that image, the vision-language models can localize where the objects are, without the help of any extra supervision.
  • Open-vocabulary object detection or Open-vocabulary detection (OVD): In its ultimate realization, one could say this is a “Holy Grail” of AI object detection: the ability to go beyond a limited set of object categories (learned during training) when detecting new objects in real-world images (test data). A truly universal object detector would be fully open, able to recognize any object (correctly assign any label) in any image – as opposed to most current systems, which are “closed” (limited in how many types of objects they can detect).
  • Pseudo bounding-box label/annotation: a label/annotation created and assigned by an AI model (instead of a human) to a bounding box.
  • Referring expressions comprehension: Finding a target object in an image, given a natural language description – in other words, a task that involves localizing the region within an image that is described or referred to by a natural language expression.
  • Validation dataset: Typically used to provide an unbiased evaluation of a model fit on the training dataset. In our evaluation, since the formal test set is not available, we follow previous work to evaluate our method on the COCO validation dataset.