Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision

Peng Wu

Jing Liu

Yujia Shi

Yujia Sun

Fangtao Shao

Zhaoyang Wu

Zhiwei Yang

Xidian University

European Conference on Computer Vision (ECCV) 2020, Poster Presentation

Sample videos from the XD-Violence dataset.

Abstract

Violence detection has been studied in computer vision for years. However, previous work are either superficial, e.g., classification of short-clips, and the single scenario, or undersupplied, e.g., the single modality, and hand-crafted features based multimodality. To address this problem, in this work we first release a large-scale and multi-scene dataset named XD-Violence with a total duration of 217 hours, containing 4754 untrimmed videos with audio signals and weak labels. Then we propose a neural network containing three parallel branches to capture different relations among video snippets and integrate features, where holistic branch captures long-range dependencies using similarity prior, localized branch captures local positional relation using proximity prior, and score branch dynamically captures the closeness of predicted score. Besides, our method also includes an approximator to meet the needs of online detection. Our method outperforms other state-of-the-art methods on our released dataset and other existing benchmark. Moreover, extensive experimental results also show the positive effect of multimodal input and modeling relationships.

Representation Video

Presentaion

Violence examples

Abuse, Car Accident, Explosion, Fighting, Riot, and Shooting

Abuse	Car Accident
Explosion	Fighting
Riot	Shooting

Normal Activities

Dataset Statistics

Dataset Statistics. (a) Distribution of the number of videos belonging to each category according to multi-label. (b) Distribution of the number of videos belonging to each category according to the first label.

Dataset Statistics. (a) Distribution of videos according to length (minutes). (b) Distribution of violent videos according to percentage of violence (in each video) in test set.

Comparisons of different violence datasets. ∗ means quite a few videos are silent or only contain background music.

Traits

1) large scale, which is beneficial for training generalizable methods for violence detection;
2) diversity of scenarios, so that violence detection methods actively respond to complicated and diverse environments and are more robust;
3) containing audio signals, making algorithms leverage multimodal information and more confidence;
4) multi labels, We assign multi violent labels (1 ≤ #labels ≤ 3) to each violent video owing to the co-occurrence of violent events. The order of labels of each video is based on the importance of different violent events occurring in the video.

Methodology

The pipeline of our proposed method.

Code

Github

Download

V1.0 Videos

Baidu Netdisk
~~Training Videos~~ ~~[keyword:1ltx]~~ Disabled
Test Videos [keyword:exye]
Test Annotations
ReadMe

AliyunDrive
Training Videos

OneDrive
Trainging Videos_0001-1004
Trainging Videos_1005-2004
Trainging Videos_2005-2804
Trainging Videos_2805-3319
Trainging Videos_3320-3954
Test Videos
Test Annotations

V1.0 Features

Baidu Netdisk
audio features (VGGish) [keyword:i9h2]
visual features (I3D RGB&Flow) [keyword:ou1n]
ReadMe

OneDrive
audio features (VGGish)
visual features (I3D RGB&Flow)

V1.0 PRC Data

precision and recall values of PRC

Paper

Peng Wu et al.
Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision
In ECCV, 2020.
(Paper)

(Supplementary materials)

[arXiv]

Cite

@inproceedings{Wu2020not,
title={Not only Look, but also Listen: Learning Multimodal Violence Detection under 
Weak Supervision},
author={Wu, Peng and Liu, jing and Shi, Yujia and Sun, Yujia and Shao, Fangtao 
and Wu, Zhaoyang and Yang, Zhiwei},
booktitle={European Conference on Computer Vision (ECCV)},
year={2020}
}

Acknowledgements

We sincerely thank Wang Chao, Ying Chaolong, Yuan Shihao and Yuan Kaixin for their excellent annotation work. This work was supported in part by the Key Project of Science and Technology Innovation 2030 supported by the Ministry of Science and Technology of China under Grant 2018AAA0101302 and in part by the General Program of National Natural Science Foundation of China (NSFC) under Grant 61773300. The template of this webpage is borrowed from FineGym .

Contact

For further questions and suggestions, please contact Peng Wu (xdwupeng@gmail.com).