Abstract Model Dataset
Google Research
Machine Perception Group, Google Research

Stanford Vision Lab, Stanford

Detecting Events and Key Actors in Multi-Person Videos

Authors: Vignesh Ramanathan, Jonathan Huang, Sami Abu-El-Haija, Alexander Gorban, Kevin Murphy, Li Fei-Fei


Multi-person event recognition is a challenging task, often with many people active in the scene but only a small subset contributing to an actual event. In this paper, we propose a model which learns to detect events in such videos while automatically "attending" to the people responsible for the event. Our model does not use explicit annotations regarding who or where those people are during training and testing. In particular, we track people in videos and use a recurrent neural network (RNN) to represent the track features. We learn time-varying attention weights to combine these features at each time-instant. The attended features are then processed using another RNN for event detection/classification. Since most video datasets with multiple people are restricted to a small number of videos, we also collected a new basketball dataset comprising 257 basketball games with 14K event annotations corresponding to 11 event classes. Our model outperforms state-of-the-art methods for both event classification and detection on this new dataset. Additionally, we show that the attention mechanism is able to consistently localize the relevant players.



Figure 3. Our model, where each player track is first processed by the corresponding BLSTM network (shown in different colors). Pi-BLSTM corresponds to the i'th player. The BLSTM hiddenstates are then used by an attention model to identify the "key" player at each instant. The thickness of the BLSTM boxes shows the attention weights, and the attended person can change with time. The variables in the model are explained in the methods section. BLSTM stands for "bidirectional long short term memory".

Figure 2. We densely annotate every instance of 11 different basketball events in long basketball videos. As shown here, we collected both event time-stamps and an event labels through an AMT task.

Figure 4. We highlight (in cyan) the "attended" player at the beginning of different events. The position of the ball in each frame is shown in yellow. Each column shows a different event. In these videos, the model attends to the person making the shot at the beginning of the event

Figure 5. We visualize the distribution of attention over different positions of a basketball court as the event progresses. This is shown for 3 different events. These heatmaps were obtained by first transforming all videos to a canonical view of the court (shown in the background of each heatmap). The top row shows the sample frames which contributed to the "free-throw" success heatmaps. It is interesting to note that the model focuses on the location of the shooter at the beginning of an event and later the attention disperses to other locations


Download our dataset as a CSV: Latest Version

The CSV file contains fields:

You can also view our dataset using our Dataset Browser

Bounding Boxes and Tracks

We also release the player detection bounding boxes and tracks: Bounding Boxes and tracks

The CSV file has one row per player bounding box / basketball position, with these 7 columns: