TimeSformer: State-Of-The-Art for video classification

A pure and simple attention-based solution for reaching SOTA on video classification.

Parth Chokhra
5 min readApr 17, 2021
Photo by Jakob Owens on Unsplash

The field of machine learning never ceases to amaze me. From progressing to classifying 0–9 integers by CNN to understanding languages using Transformers (post-BERT era). So what is the next big thing in AI 🤔 ?! In my opinion, it would be the first conquering speech (audio data) which started with the release of Wave2vec 2.0 by Facebook AI moving towards Video data with TimeSformer by Facebook AI and then finally moving towards multimodal problems which combine both audio and video data. Seems pretty easy, huh!😅

In this blog, we are gonna discuss briefly the TimeSformer paper “Is space-time attention all you need for video understanding? ”by FaceBook AI. As per the official article Facebook AI blog post, it is the first video architecture that’s based purely on Transformers. In achieves SOTA performances on several video recognition benchmarks including kinetics-400 surpassing easily the modern 3D convolutional neural networks (CNNs), with 3 times faster to train and 10 times faster for inference.

ALERT: Also when I tell video we only refer to the visual part only one modality and not the speech or audio data.

Comparing TimeSformer with other models. Facebook AI

So let's dive deep into the paper and learn its working. I will try not to throw any random jargon and keep the blog in plain English.😇

TimesSformer

The authors explain the name “TimeSformer,” which adapts the standard Transformer architecture to video by enabling spatiotemporal( Spatial refers to space. Temporal refers to time. Spatiotemporal, or spatial-temporal, is used in data analysis when data is collected across both space and time. for eg: Video data. Capturing video frames as time passes) feature learning directly from a sequence of frame-level patches.

The paper revealed a concept of “divided attention” a different self-attention mechanism where temporal attention and spatial attention are separately applied within each block is the best way to classify videos based on accuracy.

😒 jargons!

According to Wikipedia, Attention is the behavioural and cognitive process of selectively concentrating on a discrete aspect of information, whether considered subjective or objective, while ignoring other perceivable information. In other words, Visual attention which the ability that allows us to focus on a certain region with “high resolution” and then adjust the focal point or do the inference accordingly. ( This is what we usually do when we identify people and things around us. We don’t observe all the minute simple details but just a few important ones to identify, unless you are Sherlock Holmes.)

Visual-spatial attention is a form of visual attention that involves directing attention to a location in space. Visual temporal attention is a special case of visual attention that involves directing attention to a specific instant of time.

So far we can understand, “divided attention” is something that deals with capturing details separately from applying attention in space and in time. (which in our case space is video frames and time is time as the video frames pass by)

Here is a bit about the dataset in which TimeSformer achieves state-of-the-art performance which is Kinetics 400/600. Basically, it is a collection of large-scale, high-quality datasets of URL links of up to 650,000 video clips that cover 400/600/700 human action classes. The videos include human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands and hugging.

Now, we are gonna dive a little deep into the working of TimeSformer. The authors first draw a similarity between Video understanding and NLP. Both are fundamentally sequential in nature. They also coined a great example to showcase this. Like how a meaning of a word can often be understood only by relating it to the other words in the sentence, small actions in the video to understand the whole video, we need to be contextualized with the rest of the video. Thus authors conclude that long-range self-attention models from NLP to be highly effective for video modelling.

Earlier for most computer vision models, CNNs were the go-to method. The authors next explain their reason for not choosing traditional CNNs rather using transformer models.

  • While CNN has strong inductive biases (e.g., local connectivity and translation equivariance) the performance gain when we have ample availability of data is not much. Compared to CNNs, Transformers impose less restrictive inductive biases. This broadens the family of functions they can represent and are better suited to big-data problems.
  • Secondly, Convolution kernels can only capture short-range spatiotemporal information, so anything beyond their receptive field would not be captured. Unlike Convolution kernels, the self-attention mechanism can be applied to capture both local as well as global long-range dependencies, much beyond the receptive field of traditional convolutional filters.
  • Also, the cost of training a CNN remains costly, especially for long or high-resolution videos. Authors conclude that with the same computational budget Transformer enjoy a larger learning capacity.

Inspired by the above observations authors proposed “TimeSformer” (from Time-Space Transformer) adapted from the “Vision Transformer” image model which considers a video as a sequence of patches extracted from individual frames. Due to the presence of space-time structure in videos ViT could not be directly applied to Transformers. To address this issue a “divided attention” architecture was designed which separately applies temporal attention and spatial attention within each block of the network.

So that sums up this article for now. I initially thought of a full paper summary but the blog would be too long to read and get boring for many people. Hope you like it.

Resources you might find useful

--

--