Google AI Blog

Modular visual question answering via code generation

CodeVQA is a framework for visual question answering that uses code generation. It generates a Python program (code) based on a given question and set of images, and executes this program to determine the answer. The framework incorporates different visual functions to process images, such as object counting and object localization. By providing prompts with descriptions of these functions and in-context examples, CodeVQA guides a large language model (LLM) to generate the appropriate Python program. 

The accuracy of CodeVQA was evaluated on three visual reasoning datasets: GQA, COVR, and NLVR2. Compared to the few-shot Plug-and-Play VQA method, CodeVQA consistently improved performance across all datasets. In GQA, CodeVQA outperformed the baseline by approximately 30% on spatial reasoning questions, 4% on "and" questions, and 3% on "or" questions. On COVR, the improvement of CodeVQA over the baseline was more pronounced with larger numbers of input images. Overall, CodeVQA demonstrates the effectiveness of code generation for few-shot visual question answering.