028-86922220

建站动态

根据您的个性需求进行定制 先人一步 抢占小程序红利时代

视频理解中常用的数据集-创新互联

本文知乎链接: https://zhuanlan.zhihu.com/p/573405333

创新互联专注于南江网站建设服务及定制,我们拥有丰富的企业做网站经验。 热诚为您提供南江营销型网站建设,南江网站制作、南江网页设计、南江网站官网定制、小程序制作服务,打造南江网络公司原创品牌,更为您提供南江网站排名全网营销落地服务。文章目录


1. 引言

视频理解是计算机视觉领域中的重要任务,近年来得到了飞速的发展,其中高质量的数据集对视频理解的研究至关重要。本文我们总结了视频理解中常见的数据集,涵盖行为识别、行为分割、时序定位,视听理解等多个任务,并且附上了相应的链接,便于读者直接进入相应的网站查询每个数据集的详细信息。此外,有些数据集可以应用于多种任务,具有一定的交叉性。

本文涉及的数据集及其任务匹配如下:

主要任务常用数据集
行为识别/分类HMDB51, UCF101, ActivityNet1.3, Kinetics400, Kinetics-Sounds, VGGSound, EPIC-KITCHENS-100, THUMOS‘14等
时序定位ActivityNet1.3, THUMOS’14, Charades, AVE, LLP, EPIC-KITCHENS-100等
视听理解AVE, LLP, AVSBench, MUSIC-AVQA, Kinetics-Sounds, EPIC-KITCHENS-100, VGGSound等
行为分割GTEA, Breakfast, 50Salads等
第一视角EPIC-KITCHENS-100, EGTEA Gaze++, Ego4D等

数据集基本信息如下

序号数据集名称基本任务类别数量总规模平均时长(秒)总时长(时)
1HMDB51[1]行为识别516,7143-10——
2UCF101[2]行为识别10113,3207.2126.67
3ActivityNet1.3[3]行为识别等20020,000180700
4Charades[4]行为识别1579,848————
5Kinetics400[5]行为识别400236,53210657
6Kinetics-Sounds[6]行为识别3118,7161051
7EPIC-KITCHENS-100[7]行为识别v.97, n.30089,9773.1100
8THUMOS’14[8]时序定位2041368.867.56
9AVE[9]视音定位284,1431011
10LLP[10]视音定位2511,8491033
11AVSBench[11]视音分割234,93256.85
12VGGsound[12]行为识别309185,22910514
13MUSIC-AVQA[13]视音问答229,28860150
14Breakfast[14]行为分割17121989139.3777
1550Salads[15]行为分割17503845.33
16GTEA[16]行为分割72874.340.58
17EGTEA Gaze++[17]时序定位等10686121429
18Ego4D[18]时序定位等——————3670

注:本文列举的是视频理解中常见的数据集,本文列举的是视频理解中常见的数据集,对于大部分高校等科研单位的研究人员来说,是比较容易使用起来的数据集。


2. 数据集介绍 2.1. HMDB512.2. UCF1012.3. ActivityNet1.32.4. Charades2.5. Kinetics4002.6. Kinetics-Sounds2.7. EPIC-KITCHENS-1002.8. THUMOS’142.9. AVE2.10. LLP2.11. AVSBench2.12. VGGSound2.13. MUSIC-AVQA2.14. Breakfast2.15. 50Salads2.16. GTEA2.17. EGTEA Gaze++2.18. Ego4D3. 小结

随着算力等硬件设备的升级,以数据为驱动的(超)大规模数据集逐渐涌现,这些基于(超)大规模数据集的模型能够较轻易的突破之前中小数据集性能的瓶颈,具有很大的前景。但是,由于笔者身在高校,算力等硬件设施无法和公司的算力相提并论,所以基于一些经典数据集的探索对高校等科研单位的研究人员具有很大的意义。虽然视频理解领域的数据集非常多,并且不断的有新数据集被提出,但是一些基准数据集还是依旧被大家所认可,本文是笔者依据自己探索的方向(行为识别/分类/分割、时序定位、视听理解等)进行的一些归纳,并附上了这些数据集的作者及团队,他们往往在该领域里深耕多年,值得关注。由于时间关系,本文如有不全或笔误之处,请不吝指出,同时后续也将持续更新。


参考文献

[1] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “Hmdb: a large video database for human motion recognition,” in2011 International conference on computer vision. IEEE, 2011, pp. 2556–2563.

[2] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,”arXiv preprint arXiv:1212.0402, 2012.

[3] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, “Activitynet: A large-scale video benchmark for human activity understanding,” inProceedings of the ieee conference on computer vision and pattern recognition, 2015, pp. 961–970.

[4] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta, “Hollywood in homes: Crowdsourcing data collection for activity understanding,” inEuropean Conference on Computer Vision. Springer, 2016, pp. 510–526.

[5] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev et al., “The kinetics human action video dataset,”arXiv preprint arXiv:1705.06950, 2017.

[6] R. Arandjelovic and A. Zisserman, “Look, listen and learn,” inProceedings of the IEEE International Conference on Computer Vision, 2017, pp. 609–617.

[7] D. Damen, H. Doughty, G. M. Farinella, , A. Furnari, J. Ma, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray, “Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100,”International Journal of Computer Vision, vol. 130, p. 33–55, 2022.

[8] H. Idrees, A. R. Zamir, Y. Jiang, A. Gorban, I. Laptev, R. Sukthankar, and M. Shah, “The thumos challenge on action recognition for videos “in the wild”,”Computer Vision and Image Understanding, vol. 155, pp. 1–23, 2017.

[9] Y. Tian, J. Shi, B. Li, Z. Duan, and C. Xu, “Audio-visual event localization in unconstrained videos,” inProceedings of the European Conference on Computer Vision, 2018, pp. 247–263.

[10] Y. Tian, D. Li, and C. Xu, “Unified multisensory perception: Weakly-supervised audio-visual video parsing,” inEuropean Conference on Computer Vision. Springer, 2020, pp. 436–454.

[11] J. Zhou, J. Wang, J. Zhang, W. Sun, J. Zhang, S. Birchfield, D. Guo, L. Kong, M. Wang, and Y. Zhong, “Audio-visual segmentation,” inEuropean Conference on Computer Vision, 2022.

[12] H. Chen, W. Xie, A. Vedaldi, and A. Zisserman, “Vggsound: A large-scale audio-visual dataset,” inICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2020, pp.721–725.

[13] G. Li, Y. Wei, Y. Tian, C. Xu, J.-R. Wen, and D. Hu, “Learning to answer questions in dynamic audio-visual scenarios,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19108-19118.

[14] H. Kuehne, A. Arslan, and T. Serre, “The language of actions: Recovering the syntax and semantics of goal-directed human activities,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp.780–787.

[15] S. Stein and S. J. McKenna, “Combining embedded accelerometers with computer vision for recognizing food preparation activities,” inProceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing, 2013, pp. 729–738.

[16] A. Fathi, X. Ren, and J. M. Rehg, “Learning to recognize objects in egocentric activities,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2011, pp. 3281–3288.

[17] Y. Li, M. Liu, and J. M. Rehg, “In the eye of beholder: Joint learning of gaze and actions in first person video,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 619–635.

[18] K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu et al., “Ego4d: Around the world in 3,000 hours of egocentric video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 995–19 012.
in first person video,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 619–635.

[18] K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu et al., “Ego4d: Around the world in 3,000 hours of egocentric video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 995–19 012.

你是否还在寻找稳定的海外服务器提供商?创新互联www.cdcxhl.cn海外机房具备T级流量清洗系统配攻击溯源,准确流量调度确保服务器高可用性,企业级服务器适合批量采购,新人活动首月15元起,快前往官网查看详情吧


分享标题:视频理解中常用的数据集-创新互联
本文地址:http://www.tsicrk.com/article/dchhjd.html

其他资讯

让你的专属顾问为你服务

3.2830s