Authors: Chen Xing, Mingfei Gao, Donald Rose
TL;DR: Most AI object detection methods work only on limited object categories, due to the human effort required for bounding-box annotations of training data. We developed a new method that automatically generates pseudo bounding-box annotations of diverse objects from large-scale image-caption pairs, removing the bottleneck caused by the need for human labeling. Experimental results, detailed in our research paper (accepted by ECCV 2022), show that our method outperforms the SOTA open vocabulary object detector. The Big Picture result: our method’s AI-generated pseudo bounding-box labels, plus its strong generalization performance on novel datasets, brings us closer to our dream of a universal object detector.
For a review of some terms and definitions used in this blog, please see our Glossary.
Object detection is a core task in computer vision that has been considerably advanced with the adoption of deep learning, and continues to attract significant research effort. Given an image as input, a trained object detector outputs the object names and bounding boxes for objects that it is trained to recognize. (Bounding boxes are the tight and accurate boxes around the target objects.) Object detectors usually require humans to draw such accurate boxes around target objects as training data for localizing the target objects. Figure 1 shows example output of an object detector, with labeled bounding boxes for several recognized objects in the input image.
Current AI object-detection methods achieve extraordinary performance when learning a predefined set of object categories that have been annotated in a large number of training images (such as PASCAL VOC and COCO).
Figure 1: Example output of an object detector given an input image. (Image from https://dagshub.com/blog/yolov6/.)
Unfortunately, the success of these systems is still limited to detecting a small number of object categories (for instance, 80 categories in COCO). One reason for this: most detection methods rely on supervision in the form of instance-level bounding-box annotations, requiring expensive human labeling efforts to build training datasets. Furthermore, when we need to detect objects from a new category, one has to further annotate a large number of bounding-boxes in images for this new object category.
While zero-shot object detection and open-vocabulary object detection (OVD) methods have been proposed to improve generalization on novel object categories, the potential of such methods is constrained by the small size of the base category set (categories with human-provided bounding-box labels) at training, due to the high cost of acquiring large-scale bounding-box annotations of diverse objects. As a result, it’s still challenging for them to generalize well to diverse objects of novel categories in practice.
In short, the requirement for human annotation of objects in training data has been a bottleneck on the road towards true OVD, and the ultimate goal of a universal object-detection system.
One potential avenue for improvement is to enable OVD models to utilize a larger set of base classes of diverse objects by reducing the requirement of manual annotations. Hence, we ask:
The most recent progress on vision-language pre-training gives us hope. Vision-language models are pre-trained with large-scale image-caption pairs from the web. They show amazing zero-shot performance on image classification, as well as promising results on tasks related to word-region alignment (aligning text with specific regions of a given image), such as referring expressions comprehension (finding a target object in an image described by a natural language expression). This implies strong localization ability: given an image and the name of an object in that image, the vision-language models can localize where the objects are, without the help of any extra supervision.
Motivated by these observations, we developed an AI model that performs pseudo bounding-box open-vocabulary object detection. ("Pseudo" refers to annotations or labels that are not created by humans but rather artificially generated by an AI model.) Our work improves OVD by using pseudo bounding-box annotations generated from large-scale image-caption pairs, taking advantage of the localization ability of pre-trained vision-language models. While the pseudo bounding-box annotations generated by our model are different from traditional human-made labels, they achieve state-of-the-art (SOTA) performance results (described later), and may potentially provide substantial time and cost savings by reducing or eliminating the human effort typically required to create these labels.
Figure 2 gives a quick overview of our method (on the right) versus previous methods (on the left). The latter depend on human annotations of predefined base classes during training, and try to generalize to objects of novel classes during the inference stage. In contrast, we designed a pseudo bounding-box label generation strategy, using pre-trained vision-language models to automatically obtain box annotations of a diverse set of objects from existing image-caption datasets. Then we use these pseudo labels to train our open vocabulary object detector to make it generalizable to more diverse objects. Note the bigger cylinders on the right, indicating that our method results in accurate detection of a far greater number of novel object classes.
Figure 2. Previous methods (left) rely on human-provided box-level annotations of predefined base classes during training and attempt to generalize to objects of novel classes during inference. Our method (right) generates pseudo bounding-box annotations from large-scale image-caption pairs by leveraging the localization ability of pre-trained vision-language models. Then, we utilize the pseudo bounding-box annotations to improve our open vocabulary object detector.
Our method also makes it possible to achieve true OVD, a system that can recognize diverse and countless real world objects – because in our method, pseudo labels of rare objects can be automatically and efficiently generated as long as we have existing image-caption pairs that cover such objects. By removing the bottleneck of requiring human-created annotations, we can at last make great strides along the path to developing a universal object-detection system.
With such a universal object detector available as a backbone, solutions to many vision-related real world problems can be greatly enhanced, including robot navigation, autopilot, and intelligent transportation.
Our framework contains two components: a pseudo bounding-box label generator and an open vocabulary object detector. Our pseudo label generator automatically generates bounding-box labels for a diverse set of objects by leveraging a pre-trained vision-language model. We then train our detector directly with the generated pseudo labels.
Figure 3 illustrates the overall procedure of our pseudo label generation. Our goal is to generate pseudo bounding-box annotations for objects of interest in an image, by leveraging the implicit alignment between regions in the image and words in its corresponding caption in a pre-trained vision-language model.
Figure 3. Illustration of our pseudo bounding-box annotation generation process.
The input to the system is an image-caption pair. We use image and text encoders to extract the visual and text embeddings of the image and its corresponding caption. We then obtain multi-modal features by image-text interaction via cross-attention. We maintain objects of interest in our predefined object vocabulary. For each object of interest embedded in the caption (for example, racket in this figure), we use Grad-CAM to visualize its activation map in the image. This map indicates the contribution of the image regions to the final representation of the object word. Finally, we determine the pseudo bounding-box label of the object by selecting the object proposal that has the largest overlap with the activation.
Figure 4 shows some examples of the activation maps, which demonstrate that the activated regions correspond well with the relevant regions. The generated bounding boxes are of good quality. When they are directly used to train an open vocabulary object detector, the object detector significantly outperforms the current SOTA open-vocabulary/zero-shot object detectors.
Figure 4. Visualization of some activation maps. Colorful blocks indicate values of Grad-CAM activation maps in the corresponding regions. We zero out blocks with values smaller than half of the max value in the map, so the main focus is highlighted. Black boxes indicate object proposals and red boxes indicate the final selected pseudo bounding-box labels.
After we get pseudo bounding-box labels, we can use them to train an open vocabulary object detector. Since our pseudo-label generation is disentangled from the detector training process, our framework can accommodate detectors with any architecture. In this work, we focus on the open vocabulary scenario where a detector aims at detecting arbitrary objects during inference.
In our detector, an image is processed by a feature extractor followed by a region proposal network. Region-based features are then calculated by applying RoI pooling/RoI align over region proposals and the corresponding visual embeddings are obtained. Similarity of the visual and text embeddings of the same object are encouraged during training.
Figure 5. Illustration of our detector.
Since our method for generating pseudo bounding-box labels is fully automated, with no manual intervention, the size and diversity of the training data (including the number of training object categories) can be greatly increased. This enables our approach to outperform existing zero-shot/open vocabulary detection methods that are trained with a limited set of base categories.
We evaluate the effectiveness of our method by comparing it with the SOTA zero-shot and open vocabulary object detectors on four widely used datasets: COCO, PASCAL VOC, Objects365, and LVIS. Experimental results show that our method outperforms the best open vocabulary detection method by 8% AP (Average Precision, or accuracy, when detecting objects) on novel objects on COCO, when both of the methods are fine-tuned with COCO base categories. This means that significantly more objects are accurately detected. Surprisingly, we also found that even when not fine-tuned with COCO base categories, our method can still outperform the fine-tuned SOTA baseline by 3% AP.
We also evaluate the generalization performance of our method on other data sets (except COCO), to be presented at ECCV 2022. Experimental results show that under this setting, our method outperforms existing approaches by 6.3%, 2.3%, and 2.8% on the PASCAL VOC, Objects365, and LVIS datasets, respectively.
To get a more precise picture of our method’s improvements, we quantitatively compared it with existing baselines. We measured AP -- Average Precision (accuracy) when detecting objects -- for our method versus baselines on validation datasets. A validation dataset is usually used to provide an unbiased evaluation of a model fit on the training dataset. In our evaluation, since the formal test set is not available, we follow previous work to evaluate our method on the COCO validation dataset.
Table 1 shows our model’s performance when fine-tuned – and not fine-tuned – with COCO base categories. Fine-tuning with COCO base categories means that after our model is trained with our pseudo bounding-boxes, we further train our detector with the human-provided bounding box labels of COCO base categories, which is the same procedure as our baselines.
After our model is fine-tuned using COCO base categories, our method pre-trained with pseudo labels outperforms our strongest baseline (Zareian et al.) by 8% AP on novel categories. When not fine-tuned using COCO base categories and only trained with generated pseudo labels, our method achieves 25.8% AP on novel categories, which still outperforms the SOTA method (Zareian et al.) by 3% AP.
Generalization ability to a wide range of datasets is also important for an open vocabulary object detector, since it makes a detector directly usable as an out-of-the-box method in the wild. Table 2 shows the generalization performance of detectors to different datasets, where both our method and our baseline are not trained using these datasets. Since Objects365 and LVIS have a large set of diverse object categories, evaluation results on these datasets would be more representative to demonstrate generalization ability.
Results show that our method achieves better performance than Zareian et al. on all three datasets when both of the methods are fine-tuned with COCO base categories. When not fine-tuned with COCO base categories, our method still outperforms Zareian et al. (fine-tuned with COCO base categories) on the Objects365 and LVIS datasets.
Beyond the quantitative results, we also present a case study in Figure 6 to show some predicted bounding boxes provided by our model.
Figure 6. Some example results from our open vocabulary detector. The categories shown here are from novel categories in COCO.
We believe our work will positively influence OVD research going forward. By automating the process of creating bounding-box annotations of training data, we have taken an important step towards reducing – and perhaps eventually eliminating – the need for human time and effort during the annotation process, and the expense that entails.
While we feel its overall impact will be strongly positive, our method does present some potential for negative impact. Since our pseudo label generator mines annotations of objects from input captions without human intervention, our pseudo labels might be biased, due to any bias that may be embedded in the vision-language model and/or the image-caption pairs. Manually filtering out biased image-caption data samples or vocabulary object names could be two effective solutions for this potential issue. However, because human annotators will likely introduce their own inherent biases during the annotation process, automating the annotation process does not necessarily lead to more biased results. Future work could evaluate the bias of automated versus manual annotations.
Key takeaways and results:
Future directions and goals:
Salesforce AI Research invites you to dive deeper into the concepts discussed in this blog post (links below). Connect with us on social media and our website to get regular updates on this and other research projects.
Chen Xing is a Research Scientist at Salesforce Research in Palo Alto, CA. She received her Ph.D. degree from Nankai University, China. Her work lies in the domain of “smart applications” of large pre-trained models: she is passionate about effectively pre-training large language or vision-language models in an unsupervised manner, and applying them intelligently to benefit downstream tasks.
During this work, Mingfei Gao was a Senior Research Scientist at Salesforce Research in Palo Alto, CA. She received her Ph.D. degree from the Computer Science Department at the University of Maryland College Park. Her research interests include 3D scene understanding, object detection, vision and language, action recognition, weakly/self-supervised learning, and multimodal learning. Mingfei is now a Senior Applied Scientist at Apple in Sunnyvale, CA.
Donald Rose is a Technical Writer at Salesforce AI Research. He earned his Ph.D. degree in Information and Computer Science at the University of California, Irvine. Specializing in technical content creation and editing, Dr. Rose works on multiple projects — including blog posts, video scripts, newsletters, media/PR material, social media, and writing workshops. His passions include helping researchers transform their work into publications geared towards a wider audience, leveraging existing content in multiple media modes, and writing think pieces about AI.