Center for Research in Comptuer Vision
Center for Research in Comptuer Vision



Recognizing Realistic Actions from Videos "in the Wild"



Introduction

In this paper, we present a systematic framework for recognizing realistic actions from videos “in the wild.” Such unconstrained videos are abundant in personal collections as well as on the web. Recognizing action from such videos has not been addressed extensively, primarily due to the tremendous variations that result from camera motion, background clutter, changes in object appearance, and scale, etc. The main challenge is how to extract reliable and informative features from the unconstrained videos. We extract both motion and static features from the videos. Since the raw features of both types are dense yet noisy, we propose strategies to prune these features. We use motion statistics to acquire stable motion features and clean static features. Furthermore, PageRank is used to mine the most informative static features. In order to further construct compact yet discriminative visual vocabularies, a divisive information-theoretic algorithm is employed to group semantically related features. Finally, AdaBoost is chosen to integrate all the heterogeneous yet complementary features for recognition. We have tested the framework on the KTH dataset and our own dataset consisting of 11 categories of actions collected from YouTube and personal videos, and have obtained impressive results for action recognition and action localization.

Contributions


YouTube Action Dataset

We collected 11 realistic action categories from YouTube with about 1,600 videos in total. For the details of this dataset, please click here.


The flowchart of our system



Motion feature pruning


Static feature pruning


Learning semantic visual vocabularies


Experiments on KTH dataset

The KTH dataset is a wildly used action dataset which has 6 actions with almost 600 videos performed by 25 people.


To verify the effect of hybrid of motion and static features on KTH dataset.

Experiments on YouTube action dataset

1. The effect of motion feature pruning


2. The effect of static feature pruning


3. the effect of hybrid of motion and static features.


4. some recognition results with localization.


"M","S" and "H" in the images means the following judgements are made on the "motion", "static", "hybrid of motion and static" features, respectively.

Related Publication

Jingen Liu, Jiebo Luo and Mubarak Shah, Recognizing Realistic Actions from Videos "in the Wild", IEEE International Conference on Computer Vision and Pattern Recognition(CVPR), 2009.

Back to Human Action and Activity Recognition Projects
 
S