Multi-scale Deep Learning for Gesture Detection and Localization

详细信息查看全文

作者：Natalia Neverova (16) (17)
Christian Wolf (16) (17)
Graham W. Taylor (18)
Florian Nebout (19)

16. Universit茅 de Lyon ; CNRS ; Lyon ; France
17. INSA-Lyon ; LIRIS ; UMR5205 ; 69621 ; Villeurbanne cedex ; France
18. University of Guelph ; Guelph ; Canada
19. Awabot ; Lyon ; France
关键词：Gesture recognition ; Multi ; modal systems ; Deep learning
刊名：Lecture Notes in Computer Science
出版年：2015
出版时间：2015
年：2015
卷：8925
期：1
页码：474-490
全文大小：320 KB
参考文献：1. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)
2. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. In: ICLR (2014)
3. Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet Classification with Deep Convolutional Neural Networks. In: NIPS (2012)
4. Farabet, C, Couprie, C, Najman, L, LeCun, Y (2013) Learning Hierarchical Features for Scene Labeling. PAMI 35: pp. 1915-1929 CrossRef
5. Couprie, C., Clment, F., Najman, L., LeCun, Y.: Indoor Semantic Segmentation using depth information. In: ICLR (2014)
6. LeCun, Y, Bottou, L, Bengio, Y, Haffner, P (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86: pp. 2278-2324 CrossRef
7. Kahou, S.E., Pal, C., Bouthillier, X., Froumenty, P., G眉l莽ehre, C., Memisevic, R., Vincent, P., Courville, A., Bengio, Y.: Combining modality specific deep neural networks for emotion recognition in video. In: ICMI (2013)
8. aigman, Y., Yang, M., Ranzato, M.A., Wolf, L.: DeepFace: Closing the Gap to Human-Level Performance in Face Verification. In: CVPR (2014)
9. Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Spatio-Temporal Convolutional Sparse Auto-Encoder for Sequence Classification. In: BMVC (2012)
10. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Li, F.F.: Large-scale Video Classification with Convolutional Neural Networks. In: CVPR (2014)
11. Simonyan, K., Zisserman, A.: Two-Stream Convolutional Networks for Action Recognition in Videos. In: arXiv preprint arXiv:1406.2199v1 (2014)
12. Escalera, S., Bar贸, X., Gonz脿lez, J., Bautista, M.A., Madadi, M., Reyes, M., Ponce, V., Escalante, H.J., Shotton, J., Guyon, I.: ChaLearn Looking at People Challenge 2014: Dataset and Results. In: ECCV ChaLearn Workshop on Looking at People (2014)
13. Wang, H, Kl盲ser, A, Schmid, C, Liu, CL (2013) Dense trajectories and motion boundary descriptors for action recognition. IJCV 103: pp. 60-79 CrossRef
14. Wang, H, Ullah, MM, Klaser, A, Laptev, I, Schmid, C (2009) Evaluation of local spatio-temporal features for action recognition. BMVC 124: pp. 11
15. Doll谩r, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior Recognition via Sparse Spatio-Temporal Features. In: 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (2005)
16. Laptev, I., Marsza艂ek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008)
17. Kl盲ser, A., Marsza艂ek, M., Schmid, C.: A spatio-temporal descriptor based on 3D-gradients. In: BMVC (2008)
18. Willems, G, Tuytelaars, T, Gool, L An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector. In: Forsyth, D, Torr, P, Zisserman, A eds. (2008) Computer Vision 鈥?ECCV 2008. Springer, Heidelberg, pp. 650-663 CrossRef
19. Keskin, C., Kira莽, F., Kara, Y., Akarun, L.: Real time hand pose estimation using depth sensors. In: ICCV Workshop on Consumer Depth Cameras. IEEE (2011)
20. P贸艂rola, M, Wojciechowski, A Real-Time Hand Pose Estimation Using Classifiers. In: Bolc, L, Tadeusiewicz, R, Chmielewski, LJ, Wojciechowski, K eds. (2012) Computer Vision and Graphics. Springer, Heidelberg, pp. 573-580 CrossRef
21. Tang, D., Yu, T.H., Kim, T.K.: Real-time Articulated Hand Pose Estimation using Semi-supervised Transductive Regression Forests. In: ICCV (2013)
22. Tompson, J., Stein, M., LeCun, Y., Perlin, K.: Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks. ACM Transaction on Graphics (2014)
23. Oikonomidis, I, Kyriazis, N, Argyros, A (2011) Efficient model-based 3D tracking of hand articulations using Kinect. BMVC 101: pp. 11
24. Qian, C., Sun, X., Wei, Y., Tang, X., Sun, J.: Realtime and Robust Hand Tracking from Depth. In: CVPR (2014)
25. Wang, F., Li, Y.: Beyond Physical Connections: Tree Models in Human Pose Estimation. In: CVPR (2013)
26. Tang, D., Chang, H.J., Tejani, A., Kim, T.K.: Latent Regression Forest: Structured Estimation of 3D Articulated Hand Posture. In: CVPR (2014)
27. Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: CVPR (2012)
28. Sung, J., Ponce, C., Selman, B., Saxena, A.: Unstructured Human Activity Detection from RGBD Images. In: ICRA (2012)
29. Chen, X., Koskela, M.: Online RGB-D gesture recognition with extreme learning machines. In: ICMI (2013)
30. Nandakumar, K., Wah, W.K., Alice, C.S.M., Terence, N.W.Z., Gang, W.J., Yun, Y.W.: A Multi-modal Gesture Recognition System Using Audio, Video, and Skeletal Joint Data Categories and Subject Descriptors. In: 2013 Multi-modal Challenge Workshop in Conjunction with ICMI (2013)
31. Le, Q.V., Zou, W.Y., Yeung, S.Y., Ng, A.Y.: Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: CVPR, pp. 3361鈥?368 (2011)
32. Ranzato, M., Huang, F.J., Boureau, Y.L., LeCun, Y.: Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition. In: CVPR (2007)
33. Chen, B., Ting, J.A., Marlin, B., de Freitas, N.: Deep learning of invariant Spatio-Temporal Features from Video. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2010)
34. Ji, S, Xu, W, Yang, M, Yu, K (2013) 3D Convolutional Neural Networks for Human Action Recognition. PAMI 35: pp. 221-231 CrossRef
35. Ngiam, J., Khosla, A., Kin, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: ICML (2011)
36. Srivastava, N., Salakhutdinov, R.: Multimodal learning with Deep Boltzmann Machines. In: NIPS (2013)
37. Neverova, N., Wolf, C., Paci, G., Sommavilla, G., Taylor, G.W., Nebout, F.: A multi-scale approach to gesture detection and recognition. In: ICCV Workshop on Understanding Human Activities: Context and Interactions (HACI) (2013)
38. Zanfir, M., Leordeanu, M., Sminchisescu, C.: The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection. In: ICCV (2013)
39. Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: ICMlL (2009)
40. Wu, D.: Deep Dynamic Neural Networks for Gesture Segmentation and Recognition. In: ECCV ChaLearn Workshop on Looking at People (2014)
41. Monnier, C., German, S., Ost, A.: A Multi-scale Boosted Detector for Efficient and Robust Gesture Recognition. In: ECCV ChaLearn Workshop on Looking at People (2014)
42. Camgoz, N.C., Kindiroglu, A.A., Akarun, L.: Gesture Recognition using Template Based Random Forest Classifiers. In: ECCV ChaLearn Workshop on Looking at People (2014)
43. Chang, J.Y.: Nonparametric Gesture Labeling from Multi-modal Data. In: ECCV ChaLearn Workshop on Looking at People (2014)
44. Evangelidis, G., Singh, G., Horaud, R.: Continuous gesture recognition from articulated poses. In: ECCV ChaLearn Workshop on Looking at People (2014)
45. Peng, X., Wang, L., Cai, Z.: Action and Gesture Temporal Spotting with Super Vector Representation. In: ECCV ChaLearn Workshop on Looking at People (2014)
46. Pigou, L., Dieleman, S., Kindermans, P.J.: Sign Language Recognition Using Convolutional Neural Networks. In: ECCV ChaLearn Workshop on Looking at People (2014)
47. Chen, G., Clarke, D., Giuliani, M., Weikersdorfer, D., Knoll, A.: Multi-modality Gesture Detection and Recognition With Un-supervision, Randomization and Discrimination. In: ECCV ChaLearn Workshop on Looking at People (2014)
48. Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection. In: CVPR (2005)
49. Lazebnik, S., Schmid, C., Ponce, J.: Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In: CVPR (2006)
50. Geurts, P, Ernst, D, Wehenkel, L (2006) Extremely randomized trees. Machine Learning 63: pp. 3-42 CrossRef
51. Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and regression trees (1984)
作者单位：Computer Vision - ECCV 2014 Workshops
丛书名：978-3-319-16177-8
刊物类别：Computer Science
刊物主题：Artificial Intelligence and Robotics
Computer Communication Networks
Software Engineering
Data Encryption
Database Management
Computation by Abstract Devices
Algorithm Analysis and Problem Complexity
出版者：Springer Berlin / Heidelberg
ISSN：1611-3349

文摘

We present a method for gesture detection and localization based on multi-scale and multi-modal deep learning. Each visual modality captures spatial information at a particular spatial scale (such as motion of the upper body or a hand), and the whole system operates at two temporal scales. Key to our technique is a training strategy which exploits i) careful initialization of individual modalities; and ii) gradual fusion of modalities from strongest to weakest cross-modality structure. We present experiments on the ChaLearn 2014 Looking at People Challenge gesture recognition track, in which we placed first out of 17 teams.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700