Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision
Peng Wu
Jing Liu
Yujia Shi
Yujia Sun
Fangtao Shao
Zhaoyang Wu
Zhiwei Yang
Xidian University
European Conference on Computer Vision (ECCV) 2020, Poster Presentation

Sample videos from the XD-Violence dataset.


Violence detection has been studied in computer vision for years. However, previous work are either superficial, e.g., classification of short-clips, and the single scenario, or undersupplied, e.g., the single modality, and hand-crafted features based multimodality. To address this problem, in this work we first release a large-scale and multi-scene dataset named XD-Violence with a total duration of 217 hours, containing 4754 untrimmed videos with audio signals and weak labels. Then we propose a neural network containing three parallel branches to capture different relations among video snippets and integrate features, where holistic branch captures long-range dependencies using similarity prior, localized branch captures local positional relation using proximity prior, and score branch dynamically captures the closeness of predicted score. Besides, our method also includes an approximator to meet the needs of online detection. Our method outperforms other state-of-the-art methods on our released dataset and other existing benchmark. Moreover, extensive experimental results also show the positive effect of multimodal input and modeling relationships.

Representation Video


Violence examples

Abuse, Car Accident, Explosion, Fighting, Riot, and Shooting
Car Accident
Normal Activities

Dataset Statistics

Dataset Statistics. (a) Distribution of the number of videos belonging to each category according to multi-label. (b) Distribution of the number of videos belonging to each category according to the first label.

Dataset Statistics. (a) Distribution of videos according to length (minutes). (b) Distribution of violent videos according to percentage of violence (in each video) in test set.

Comparisons of different violence datasets. ∗ means quite a few videos are silent or only contain background music.


1) large scale, which is beneficial for training generalizable methods for violence detection;
2) diversity of scenarios, so that violence detection methods actively respond to complicated and diverse environments and are more robust;
3) containing audio signals, making algorithms leverage multimodal information and more confidence;
4) multi labels, We assign multi violent labels (1 ≤ #labels ≤ 3) to each violent video owing to the co-occurrence of violent events. The order of labels of each video is based on the importance of different violent events occurring in the video.


The pipeline of our proposed method.




V1.0 Videos

Baidu Netdisk
Training Videos [keyword:1ltx]
Test Videos [keyword:exye]
Test Annotations

Trainging Videos_0001-1004
Trainging Videos_1005-2004
Trainging Videos_2005-2804
Trainging Videos_2805-3319
Trainging Videos_3320-3954
Test Videos
Test Annotations

V1.0 Features

Baidu Netdisk
audio features (VGGish) [keyword:i9h2]
visual features (I3D RGB&Flow) [keyword:ou1n]

audio features (VGGish)
visual features (I3D RGB&Flow)

V1.0 PRC Data

precision and recall values of PRC


Peng Wu et al.
Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision
In ECCV, 2020.
(Supplementary materials)



title={Not only Look, but also Listen: Learning Multimodal Violence Detection under 
Weak Supervision},
author={Wu, Peng and Liu, jing and Shi, Yujia and Sun, Yujia and Shao, Fangtao 
and Wu, Zhaoyang and Yang, Zhiwei},
booktitle={European Conference on Computer Vision (ECCV)},


We sincerely thank Wang Chao, Ying Chaolong, Yuan Shihao and Yuan Kaixin for their excellent annotation work. This work was supported in part by the Key Project of Science and Technology Innovation 2030 supported by the Ministry of Science and Technology of China under Grant 2018AAA0101302 and in part by the General Program of National Natural Science Foundation of China (NSFC) under Grant 61773300. The template of this webpage is borrowed from FineGym.


For further questions and suggestions, please contact Peng Wu (