Machine Perception Group, Google Research
Stanford Vision Lab, Stanford
Detecting Events and Key Actors in Multi-Person Videos
Multi-person event recognition is a challenging task, often with many people active in the scene but only a small subset contributing to an actual event. In this paper, we propose a model which learns to detect events in such videos while automatically "attending" to the people responsible for the event. Our model does not use explicit annotations regarding who or where those people are during training and testing. In particular, we track people in videos and use a recurrent neural network (RNN) to represent the track features. We learn time-varying attention weights to combine these features at each time-instant. The attended features are then processed using another RNN for event detection/classification. Since most video datasets with multiple people are restricted to a small number of videos, we also collected a new basketball dataset comprising 257 basketball games with 14K event annotations corresponding to 11 event classes. Our model outperforms state-of-the-art methods for both event classification and detection on this new dataset. Additionally, we show that the attention mechanism is able to consistently localize the relevant players.[ArXiv]
Figure 3. Our model, where each player track is first processed by the corresponding BLSTM network (shown in different colors). Pi-BLSTM corresponds to the i'th player. The BLSTM hiddenstates are then used by an attention model to identify the "key" player at each instant. The thickness of the BLSTM boxes shows the attention weights, and the attended person can change with time. The variables in the model are explained in the methods section. BLSTM stands for "bidirectional long short term memory".
Figure 2. We densely annotate every instance of 11 different basketball events in long basketball videos. As shown here, we collected both event time-stamps and an event labels through an AMT task.
Figure 4. We highlight (in cyan) the "attended" player at the beginning of different events. The position of the ball in each frame is shown in yellow. Each column shows a different event. In these videos, the model attends to the person making the shot at the beginning of the event
Figure 5. We visualize the distribution of attention over different positions of a basketball court as the event progresses. This is shown for 3 different events. These heatmaps were obtained by first transforming all videos to a canonical view of the court (shown in the background of each heatmap). The top row shows the sample frames which contributed to the "free-throw" success heatmaps. It is interesting to note that the model focuses on the location of the shooter at the beginning of an event and later the attention disperses to other locations
Download our dataset as a CSV: Latest Version
The CSV file contains fields:
- #YoutubeId: The YouTube Video ID. It is possible to watch the video by prefixing the ID with url http://youtube.com/watch?v=
- VideoWidth and VideoHeight: The dimensions of the video (in pixels).
- ClipStartTime and ClipEndTime: The timestamps (in microseconds) the segment that we ran training and inference on. We split the video files into clips indicated by those timestamps. We trained our bi-directional LSTM between those timestamps for training and inference. In practice, you may ignore those timestamps and concatenate all the events for the video. However, if you want to match our results, then you should use them.
- EventLabel: One of 3-pointer success, 3-pointer failure, free-throw success, free-throw failure, layup success, layup failure, other 2-pointer success, other 2-pointer failure, slam dunk success, slam dunk failure or steal success
- EventEndTime: The timestamp (in microseconds) when the event finished. For example, the timestamp when the ball enters the hoop (for success events) or when the ball bounces-off the ring (for fail events).
- EventStartTime, EventStartBallX and EventStartBallY: The event start time (in microseconds), and ball position measured as a fraction of the video (height, width) from the top-left corner. Annotators were instructed to click the "ball" as it is leaving the player's hands. These start annotations were not used during training. They were only used for evaluating the attention (Section 5.4 in the paper). During training, We assumed a fixed 4 seconds as an event duration. So far, we only collected the EventStart* for only 77% of the events. 16.4% of the unrated events are steal success, which we don't collect EventStart* for, as their EventStartTime should equal EventEndTime. 6.6% are tasks that Mechanical Turk Workers did not finish. We could collect them if there is demand. Regardless, feel free to use EventStart* during training.
- TrainValOrTest: One of "train", "val", "test", indicating the dataset partition that the video belongs.
You can also view our dataset using our Dataset Browser
Bounding Boxes and Tracks
We also release the player detection bounding boxes and tracks: Bounding Boxes and tracks
The CSV file has one row per player bounding box / basketball position, with these 7 columns:
- Column 1: Youtube ID
- Column 2: Time corresponding to the video frame, in microsec
- Column 3: Top-left x-coordinate of player/basketball bounding box in frame relative to frame width
- Column 4: Top-left y-coordinate of player/basketball bounding box in frame relative to frame height
- Column 5: Width of player bounding box relative to frame width.
- Column 6: Height of player bounding box relative to frame height.
- Column 7: A player-id such that boxes with the same player-id correspond to the same player. For basketball positions, this id is set to "basketball".