Visual captions: Using large language models to augment video conferences with dynamic visuals
Introduction
In this research, we present a system called Visual Captions that uses verbal cues to augment synchronous video communication with real-time visuals. We fine-tuned a large language model to proactively suggest relevant visuals in open-vocabulary conversations using a dataset we curated for this purpose.
Design Space for Augmenting Verbal Communication with Dynamic Visuals
We invited 10 internal participants with various backgrounds to discuss their needs for a real-time visual augmentation service. People proposed different levels of “proactivity” and we explored varying levels of user interaction.
Visual Intent Prediction Model
To predict visuals that could supplement conversations, we trained a visual intent prediction model based on a large language model using the VC1.5K dataset. The model can handle open-vocabulary conversations and predict relevant visual content, visual source, and visual type.
System Workflow
Visual Captions captures user speech, retrieves the last sentences, feeds them into the visual intent prediction model every 100 ms, retrieves relevant visuals, suggests them in real time, and the user decides when and what to display.
User Study and Results
Participants found the suggested visuals to be informative, high-quality, and relevant to the original speech, and the predicted visual type and visual source to be accurate given the context of the conversation. They also had varying preferences for interacting with the system in-situ and preferred different levels of proactivity in different social scenarios.
Conclusions and Future Directions
Visual Captions is a system for real-time visual augmentation of verbal communication and was trained using a dataset of 1595 visual intents in 15 topic categories. It has been deployed in ARChat, which facilitates video conferences in Google Meet by transcribing meetings and augmenting the camera video streams. Future directions include exploring more sophisticated interaction models and personalization of visual preferences.