Since video touches on a wide range of topics related to people’s daily lives, it has become a dominant form of media material. Real-world video applications, including video captioning, video content analysis, and video question answering, have grown in popularity due to the prevalence of video content today. They rely on models that can associate text or spoken words with video sequences. VideoQA is particularly difficult because it requires understanding both temporal information, which refers to how things move and interact, and semantic information, such as objects in a scene. Also, processing all the frames of a video to learn spatiotemporal information can be computationally expensive due to the presence of frames in the videos.

In their recent work titled “Video Question Answering with Iterative Video-Text Co-Tokenization”, Google AI search experts have developed a new method of learning from video and text. This iterative co-tokenization method can efficiently combine spatial, temporal, and linguistic data for VideoQA. It uses multiple streams and distinct backbone models for each scale of video to analyze and create video representations that capture various properties, such as high spatial resolution or long temporal durations. The model then uses the co-tokenization module to build efficient representations by combining the text and video streams. Compared to the previous methods, this model is very efficient and consumes at least 50% less resources while providing more remarkable performance.

The main goal of the model is to generate functionality from text and video that allows their associated inputs to interact. A second goal is to do this efficiently, which is crucial for videos because they have tens to hundreds of input frames. The model segments joint video language inputs into a smaller set of tokens that jointly and effectively represent the two modalities. Both modalities were used to create a combined compact representation during tokenization, which was then fed to a transformer layer to create the higher-level representation. The main problem here is that the video image often does not correspond directly to the corresponding text. Researchers address this by adding two teachable linear layers that, prior to tokenization, combine the dimensions of visual and textual feature sets. As a single tokenization step prevents future interaction between the two modalities, a new feature representation has been added to interact with the video input features and produce another set of tokenized features. The next transformer layer then receives these attributes. New features or tokens can be created through this iterative process, representing continuous improvement of the combined representation of the two modalities. The final step is to feed the functionality into a decoder, which outputs text.

The model is pre-trained on several datasets before being refined. The HowTo100M dataset, which consists of videos that can be automatically annotated with text based on speech recognition, was used to train the model. The model surpasses the latest state-of-the-art techniques and gives highly accurate results. In tests using the MSRVTT-QA, MSVD-QA, and IVQA benchmarks, the Video Language Iterative Co-Tokenization algorithm performed better than other leading models while being relatively small. Moreover, iterative learning of co-tokenization results in significant computational savings for video text learning tasks. This innovative method of learning through video-text modalities emphasizes cooperative learning and tackles the crucial and difficult problem of video question-answering. Moreover, it produces modest model sizes and can be further improved with larger models and more data. To enable more fluid interaction with vision-based media, Google Research hopes this effort will spur further research on visual language learning.

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'Video Question Answering with Iterative Video-Text Co-Tokenization'. All Credit For This Research Goes To Researchers on This Project. Check out the paper and reference article.

Please Don't Forget To Join Our ML Subreddit


Khushboo Gupta is an intern consultant at MarktechPost. She is currently pursuing her B.Tech from Indian Institute of Technology (IIT), Goa. She is passionate about the fields of machine learning, natural language processing and web development. She likes to learn more about the technical field by participating in several challenges.