Netflix Tech BlogJune 19, 2024

Video annotator: building video classifiers using vision-language models and active learning

Video Annotator: Building Video Classifiers Using Vision-Language Models and Active Learning

Introduction

The framework, Video Annotator (VA), leverages active learning techniques and the zero-shot capabilities of large vision-language models to guide users in focusing on progressively harder examples, enhancing sample efficiency and keeping costs low. It supports a continuous annotation process, allowing for rapid model deployment, quality monitoring, and swift fixes on edge cases.

Problem

High-quality and consistent annotations are crucial for developing robust machine learning models. The lengthy cycle of annotation, model training, review, and deployment can lead to model drift, impacting usefulness and stakeholder trust.

Video Classification

Video classification involves assigning labels to video clips, often accompanied by probability scores. VA helps build binary video classifiers for scalable scoring and retrieval of a vast content catalog.

Video Annotator (VA)

Step 1 - Search

Users select initial examples in a large corpus for annotation. VA builds a binary classifier over video embeddings and presents examples for further annotation and refinement.

Step 2 - Annotation

Annotators label examples in various feeds, including a random feed for diverse examples. This process can be iterated as needed to improve the classifier.

Step 3 - Review

The annotated clips are presented to the user for review.

Experiments

VA was evaluated using a diverse set of 56 labels annotated by three video experts on a corpus of 500k shots. VA outperformed baseline methods in creating high-quality video classifiers.

Conclusion

VA is an interactive framework that addresses challenges in training machine learning classifiers. It empowers domain experts to make improvements independently, fostering trust in the system. A dataset with 153k labels across 56 tasks and code to replicate experiments are released.