Multi-scale Deep Learning for Gesture Detection and Localization
详细信息    查看全文
  • 作者:Natalia Neverova (16) (17)
    Christian Wolf (16) (17)
    Graham W. Taylor (18)
    Florian Nebout (19)

    16. Universit茅 de Lyon
    ; CNRS ; Lyon ; France
    17. INSA-Lyon
    ; LIRIS ; UMR5205 ; 69621 ; Villeurbanne cedex ; France
    18. University of Guelph
    ; Guelph ; Canada
    19. Awabot
    ; Lyon ; France
  • 关键词:Gesture recognition ; Multi ; modal systems ; Deep learning
  • 刊名:Lecture Notes in Computer Science
  • 出版年:2015
  • 出版时间:2015
  • 年:2015
  • 卷:8925
  • 期:1
  • 页码:474-490
  • 全文大小:320 KB
  • 参考文献:1. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)
    2. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. In: ICLR (2014)
    3. Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet Classification with Deep Convolutional Neural Networks. In: NIPS (2012)
    4. Farabet, C, Couprie, C, Najman, L, LeCun, Y (2013) Learning Hierarchical Features for Scene Labeling. PAMI 35: pp. 1915-1929 CrossRef
    5. Couprie, C., Clment, F., Najman, L., LeCun, Y.: Indoor Semantic Segmentation using depth information. In: ICLR (2014)
    6. LeCun, Y, Bottou, L, Bengio, Y, Haffner, P (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86: pp. 2278-2324 CrossRef
    7. Kahou, S.E., Pal, C., Bouthillier, X., Froumenty, P., G眉l莽ehre, C., Memisevic, R., Vincent, P., Courville, A., Bengio, Y.: Combining modality specific deep neural networks for emotion recognition in video. In: ICMI (2013)
    8. aigman, Y., Yang, M., Ranzato, M.A., Wolf, L.: DeepFace: Closing the Gap to Human-Level Performance in Face Verification. In: CVPR (2014)
    9. Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Spatio-Temporal Convolutional Sparse Auto-Encoder for Sequence Classification. In: BMVC (2012)
    10. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Li, F.F.: Large-scale Video Classification with Convolutional Neural Networks. In: CVPR (2014)
    11. Simonyan, K., Zisserman, A.: Two-Stream Convolutional Networks for Action Recognition in Videos. In: arXiv preprint arXiv:1406.2199v1 (2014)
    12. Escalera, S., Bar贸, X., Gonz脿lez, J., Bautista, M.A., Madadi, M., Reyes, M., Ponce, V., Escalante, H.J., Shotton, J., Guyon, I.: ChaLearn Looking at People Challenge 2014: Dataset and Results. In: ECCV ChaLearn Workshop on Looking at People (2014)
    13. Wang, H, Kl盲ser, A, Schmid, C, Liu, CL (2013) Dense trajectories and motion boundary descriptors for action recognition. IJCV 103: pp. 60-79 CrossRef
    14. Wang, H, Ullah, MM, Klaser, A, Laptev, I, Schmid, C (2009) Evaluation of local spatio-temporal features for action recognition. BMVC 124: pp. 11
    15. Doll谩r, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior Recognition via Sparse Spatio-Temporal Features. In: 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (2005)
    16. Laptev, I., Marsza艂ek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008)
    17. Kl盲ser, A., Marsza艂ek, M., Schmid, C.: A spatio-temporal descriptor based on 3D-gradients. In: BMVC (2008)
    18. Willems, G, Tuytelaars, T, Gool, L An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector. In: Forsyth, D, Torr, P, Zisserman, A eds. (2008) Computer Vision 鈥?ECCV 2008. Springer, Heidelberg, pp. 650-663 CrossRef
    19. Keskin, C., Kira莽, F., Kara, Y., Akarun, L.: Real time hand pose estimation using depth sensors. In: ICCV Workshop on Consumer Depth Cameras. IEEE (2011)
    20. P贸艂rola, M, Wojciechowski, A Real-Time Hand Pose Estimation Using Classifiers. In: Bolc, L, Tadeusiewicz, R, Chmielewski, LJ, Wojciechowski, K eds. (2012) Computer Vision and Graphics. Springer, Heidelberg, pp. 573-580 CrossRef
    21. Tang, D., Yu, T.H., Kim, T.K.: Real-time Articulated Hand Pose Estimation using Semi-supervised Transductive Regression Forests. In: ICCV (2013)
    22. Tompson, J., Stein, M., LeCun, Y., Perlin, K.: Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks. ACM Transaction on Graphics (2014)
    23. Oikonomidis, I, Kyriazis, N, Argyros, A (2011) Efficient model-based 3D tracking of hand articulations using Kinect. BMVC 101: pp. 11
    24. Qian, C., Sun, X., Wei, Y., Tang, X., Sun, J.: Realtime and Robust Hand Tracking from Depth. In: CVPR (2014)
    25. Wang, F., Li, Y.: Beyond Physical Connections: Tree Models in Human Pose Estimation. In: CVPR (2013)
    26. Tang, D., Chang, H.J., Tejani, A., Kim, T.K.: Latent Regression Forest: Structured Estimation of 3D Articulated Hand Posture. In: CVPR (2014)
    27. Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: CVPR (2012)
    28. Sung, J., Ponce, C., Selman, B., Saxena, A.: Unstructured Human Activity Detection from RGBD Images. In: ICRA (2012)
    29. Chen, X., Koskela, M.: Online RGB-D gesture recognition with extreme learning machines. In: ICMI (2013)
    30. Nandakumar, K., Wah, W.K., Alice, C.S.M., Terence, N.W.Z., Gang, W.J., Yun, Y.W.: A Multi-modal Gesture Recognition System Using Audio, Video, and Skeletal Joint Data Categories and Subject Descriptors. In: 2013 Multi-modal Challenge Workshop in Conjunction with ICMI (2013)
    31. Le, Q.V., Zou, W.Y., Yeung, S.Y., Ng, A.Y.: Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: CVPR, pp. 3361鈥?368 (2011)
    32. Ranzato, M., Huang, F.J., Boureau, Y.L., LeCun, Y.: Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition. In: CVPR (2007)
    33. Chen, B., Ting, J.A., Marlin, B., de Freitas, N.: Deep learning of invariant Spatio-Temporal Features from Video. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2010)
    34. Ji, S, Xu, W, Yang, M, Yu, K (2013) 3D Convolutional Neural Networks for Human Action Recognition. PAMI 35: pp. 221-231 CrossRef
    35. Ngiam, J., Khosla, A., Kin, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: ICML (2011)
    36. Srivastava, N., Salakhutdinov, R.: Multimodal learning with Deep Boltzmann Machines. In: NIPS (2013)
    37. Neverova, N., Wolf, C., Paci, G., Sommavilla, G., Taylor, G.W., Nebout, F.: A multi-scale approach to gesture detection and recognition. In: ICCV Workshop on Understanding Human Activities: Context and Interactions (HACI) (2013)
    38. Zanfir, M., Leordeanu, M., Sminchisescu, C.: The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection. In: ICCV (2013)
    39. Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: ICMlL (2009)
    40. Wu, D.: Deep Dynamic Neural Networks for Gesture Segmentation and Recognition. In: ECCV ChaLearn Workshop on Looking at People (2014)
    41. Monnier, C., German, S., Ost, A.: A Multi-scale Boosted Detector for Efficient and Robust Gesture Recognition. In: ECCV ChaLearn Workshop on Looking at People (2014)
    42. Camgoz, N.C., Kindiroglu, A.A., Akarun, L.: Gesture Recognition using Template Based Random Forest Classifiers. In: ECCV ChaLearn Workshop on Looking at People (2014)
    43. Chang, J.Y.: Nonparametric Gesture Labeling from Multi-modal Data. In: ECCV ChaLearn Workshop on Looking at People (2014)
    44. Evangelidis, G., Singh, G., Horaud, R.: Continuous gesture recognition from articulated poses. In: ECCV ChaLearn Workshop on Looking at People (2014)
    45. Peng, X., Wang, L., Cai, Z.: Action and Gesture Temporal Spotting with Super Vector Representation. In: ECCV ChaLearn Workshop on Looking at People (2014)
    46. Pigou, L., Dieleman, S., Kindermans, P.J.: Sign Language Recognition Using Convolutional Neural Networks. In: ECCV ChaLearn Workshop on Looking at People (2014)
    47. Chen, G., Clarke, D., Giuliani, M., Weikersdorfer, D., Knoll, A.: Multi-modality Gesture Detection and Recognition With Un-supervision, Randomization and Discrimination. In: ECCV ChaLearn Workshop on Looking at People (2014)
    48. Dalal, N., Triggs, B.: Histograms of Oriented Gradients for Human Detection. In: CVPR (2005)
    49. Lazebnik, S., Schmid, C., Ponce, J.: Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. In: CVPR (2006)
    50. Geurts, P, Ernst, D, Wehenkel, L (2006) Extremely randomized trees. Machine Learning 63: pp. 3-42 CrossRef
    51. Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and regression trees (1984)
  • 作者单位:Computer Vision - ECCV 2014 Workshops
  • 丛书名:978-3-319-16177-8
  • 刊物类别:Computer Science
  • 刊物主题:Artificial Intelligence and Robotics
    Computer Communication Networks
    Software Engineering
    Data Encryption
    Database Management
    Computation by Abstract Devices
    Algorithm Analysis and Problem Complexity
  • 出版者:Springer Berlin / Heidelberg
  • ISSN:1611-3349
文摘
We present a method for gesture detection and localization based on multi-scale and multi-modal deep learning. Each visual modality captures spatial information at a particular spatial scale (such as motion of the upper body or a hand), and the whole system operates at two temporal scales. Key to our technique is a training strategy which exploits i) careful initialization of individual modalities; and ii) gradual fusion of modalities from strongest to weakest cross-modality structure. We present experiments on the ChaLearn 2014 Looking at People Challenge gesture recognition track, in which we placed first out of 17 teams.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700