It is believed that learning-in-a-shot capabilities, which allow machine learning models to be adapted to new tasks with just a few instructions, will be a key aspect of next-generation artificial intelligence systems. While hit-and-miss learning has become a popular line of research in recent years, it remains particularly challenging in multimodal tasks such as those addressed by visual language models (VLMs).

In the new newspaper Flamingo: a visual language model for learning in a few strokesa DeepMind research team introduces Flamingo, a new family of Visual Language Models (VLMs) that can handle multimodal tasks such as captioning, visual dialogue, classification, and visual question answering when not given just a few input/output samples.

The team summarizes the main contributions of their proposed Flamingo framework as follows:

  1. A novel architecture for accepting arbitrarily interleaved text and visual data as input and generating output text in an open manner is provided.
  2. Architectural innovations and training strategies that effectively leverage large pre-trained vision-only and language-only models, preserving the benefits of these initial models while effectively merging modalities. From Chinchilla, a peak LM of 70B (Hoffmann et al., 2022), we train Flamingo, a VLM of parameter 80B.
  3. Efficient ways to adapt to visual inputs of different sizes, making Flamingo applicable to images and videos.

Flamingo takes interleaved text with images/videos as input and outputs free-form text. It is expressive enough to tackle both open-ended tasks that require text generation (e.g. visual question-answering and captioning) and closed classification tasks (e.g. choosing the best category or answer from a set ).

For the visual processing side of Flamingo, the team pre-trains a vision encoder via a CLIP-style contrastive text-image approach (Radford et al., 2021), which extracts relevant semantic spatial features (color, shape , nature, object positions, etc.) from visual data. The language side of the model, on the other hand, relies on a pre-trained autoregressive language (LM) model to equip Flamingo with strong generative language capabilities and provide access to the rich knowledge stored in the LM weights.

The researchers also introduce two learnable architecture components – a Perceiver Resampler and Cross-Attention Layers – to seamlessly connect the pre-trained vision and language patterns. The Perceiver Resampler accepts spatiotemporal characteristics from the vision encoder and generates a set of visual tokens. These visual tokens are then used to condition the frozen LM via freshly initialized cross-attention layers between the pre-trained LM layers, allowing the model to merge the visual information for the next token prediction task.

In their empirical study, the team evaluated the short-step learning performance of the Flamingo models on 16 multimodal language and image/video comprehension tasks.

In evaluations, the proposed Flamingo models outperformed state-of-the-art benchmarks such as CLIP and Florence on 6 of 16 tasks while using only 32 task-specific examples – representing approximately 1000 times less task-specific training . data as baselines. When a larger annotation budget was provided, Refined Flamingo also achieved new state-of-the-art results on five additional benchmarks: VQAv2, VATEX, VizWiz, MSRVTTQA, and HatefulMemes.

A Flamingo PyTorch implementation is available on the project site GitHub. The paper Flamingo: a visual language model for learning in a few strokes is on arXiv.


Author: Hecate He | Editor: Michel Sarazen


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Weekly Synchronized Global AI to get weekly AI updates.