Deep Network for Group Discovery and Activity Recognition

Full paper link... Thesis download link... Code@Github...

Hierarchy comes naturally in crowd videos since people tend to interact with each other and form different groups. The spatio-temporal interaction among the individuals leads to different group activities and these group activities along with the scene context influence the scene activity.

An illustration of a hierarchy in an activity video is given in the figure below - there are six standing individuals and interaction among them generates two talking groups in the scene. Since the major activity is talking, the scene activity is also talking.

Illustration of hierarchy present in a video. There are 6 standing individuals forming two talking groups. The overall scene activity is talking.

In this work, we extend the existing methods by adding an extra layer that finds the groups (or clusters) present in a scene and their activities. We then utilize these group activities along with the scene context to recognize the scene activity. To discover these groups, we propose a min-max criteria within the framework to learn pairwise similarity between any two individuals, which is used by a clustering algorithm. The group activity is captured by an LSTM module whereas the individual and scene activities are captured by CNN-LSTM based modules.

Results of our model being tested on Collective Activity Dataset [1].