Video annotator: building video classifiers using vision-language models and active learning
Video Annotator: Building Video Classifiers Using Vision-Language Models and Active Learning
Introduction
The framework, Video Annotator (VA), leverages active learning techniques and the zero-shot capabilities of large vision-language models to guide users in focusing on progressively harder examples, enhancing sample efficiency and keeping costs low. It supports a continuous annotation process, allowing for rapid model deployment, quality monitoring, and swift fixes on edge cases.
Problem
High-quality and consistent annotations are crucial for developing robust machine learning models. The lengthy cycle of annotation, model training, review, and deployment can lead to model drift, impacting usefulness and stakeholder trust.
Video Classification
Video classification involves assigning labels to video clips, often accompanied by probability scores. VA helps build binary video classifiers for scalable scoring and retrieval of a vast content catalog.
Video Annotator (VA)
Step 1 - Search
Users select initial examples in a large corpus for annotation. VA builds a binary classifier over video embeddings and presents examples for further annotation and refinement.
Step 2 - Annotation
Annotators label examples in various feeds, including a random feed for diverse examples. This process can be iterated as needed to improve the classifier.
Step 3 - Review
The annotated clips are presented to the user for review.
Experiments
VA was evaluated using a diverse set of 56 labels annotated by three video experts on a corpus of 500k shots. VA outperformed baseline methods in creating high-quality video classifiers.
Conclusion
VA is an interactive framework that addresses challenges in training machine learning classifiers. It empowers domain experts to make improvements independently, fostering trust in the system. A dataset with 153k labels across 56 tasks and code to replicate experiments are released.