图像序列中人的行为分析和识别方法

英文题名：Human Action Analysis and Recognition from Image Sequences
作者：韩磊
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：人的行为分析 ; 图像序列 ; 交互行为识别 ; 手势识别 ; 人手跟踪
英文关键词：human action analysis ; image sequence ; interaction recognition ; gesture recognition ; hand tracking
学位年度：2009
导师：贾云得
学科代码：081203
学位授予单位：北京理工大学
论文提交日期：2009-10-01

摘要

人的行为分析和识别是计算机视觉和模式识别领域的研究热点,它在智能监控、虚拟现实、运动分析等领域具有广阔的应用前景。本文主要研究图像序列中人的行为分析和识别,在行为特征提取、特征表示和行为识别与建模三个方面,对手势跟踪与动态手势识别、单人行为识别和两人交互行为识别等问题进行了探索研究。
     本文提出一种层级潜变量空间中的三维人手跟踪算法,和其它基于流形学习的跟踪方法不同,它将人手状态空间划分成多个人手部分状态空间,采用层级高斯过程潜变量模型得到更能反映人手运动本质的树状低维流形空间,在该低维空间使用粒子滤波器跟踪人手和各个人手部分的运动,降低了粒子滤波器有效跟踪人手所需的粒子数量;使用径向基函数插值方法构建低维流形空间到图像空间的非线性映射,将低维粒子直接映射到图像空间中观测。实验表明,该方法可以鲁棒的跟踪关节人手,并具有更小的跟踪误差。本文提出一种层级条件随机场模型(Hierarchical Conditional Random Filed, Hierarchical CRF)建模动态手势,该模型为每一帧图像预测一个行为标签,可用于连续动态手势的识别。实验结果证明了该模型的有效性。
     目前大多数单人行为识别方法是基于整个人体运动特征的。本文提出一种层级潜变量空间中的单人行为识别方法,它基于人体自身的生理学结构,构建人体运动的层级潜变量空间,并在该空间中采用聚类技术提取各个人体部分的运动模式。该方法采用层叠条件随机场模型(Cascade CRF)建模输入数据和运动模式的概率映射,使用判别式分类器估计最终的人体行为标签。在运动捕捉数据上的识别结果证明了该方法的有效性,在合成图像上的识别结果验证了该方法的鲁棒性。
     本文研究了两人交互行为的识别与建模,提出一种基于时空单词的两人交互行为识别方法,该方法从行为视频中提取丰富的时空兴趣点,基于人体剪影的连通性分析和时空兴趣点的历史信息,把时空兴趣点划分给不同的人体,并在兴趣点样本空间聚类生成时空码本(spatial-temporal codebook)。对于给定的时空兴趣点集,通过投票得到表示单人原子行为的时空单词(spatial-temporal words)。它采用条件随机场模型建模单人原子行为,在建模两人交互行为语义时,人工建立表示领域知识(domain knowledge)的一阶逻辑知识库,并训练马尔可夫逻辑网用以两人交互行为的推理。两人交互行为库上的实验结果证明了该方法的有效性。
Human action analysis and recognition is a hot topic in the domain of computer vision and pattern recognition, and has promising applications to intelligent surveillance, visual reality and motion analysis. The key problems in this task are feature extraction, feature representation and action recognition. In this thesis, we focus on human action analysis and recognition from image sequences and investigate hand tracking, gesture recognition, human action and interaction recognition.
     This thesis proposes an algorithm for 3D hands tracking on the learned hierarchical latent variable space, which employs a Hierarchical Gaussian Process Latent Variable Model (HGPLVM) to learn the hierarchical latent space of hands motion and the nonlinear mapping from the hierarchical latent space to the pose space simultaneously. Nonlinear mappings from the hierarchical latent space to the space of hand images are constructed using radial basis function interpolation method. With these mappings, particles can be projected into hand images and measured in the image space directly. Particle filters with fewer particles are used to track the hand on the learned hierarchical low-dimensional space. Then the Hierarchical Conditional Random Field (Hierarchical CRF), which can capture extrinsic class dynamics and learn the relationship between motions of hand parts and different hand gestures simultaneously, is presented to model the continuous hand gestures. Experimental results show that our proposed method can track articulated hand robustly and approving recognition performance has also been achieved on the user-defined hand gesture dataset.
     Most researches on human action recognition are mainly based on the features of whole body motion. This thesis presents a hierarchical discriminative approach for recognizing human action based on limbs motion. The approach consists of feature extraction with mutual motion pattern analysis and discriminative action modeling in the hierarchical manifold space. HGPLVM is employed to learn the hierarchical manifold space in which motion patterns are extracted. A cascade CRF is introduced to estimate the motion patterns in the corresponding manifold subspace, and the trained SVM classifier is used to predict the action label for the current observation. The results on motion capure data prove the significance motion analysis of body parts, and the results on synthetic image sequences are also presented to demonstrate the robustness of the proposed algorithm.
     This thesis also explores a hierarchical approach for recognizing person-to-person interactions in an indoor scenario from a single view. It detects dense space-time interest points from action videos and divides them into two sets exclusively according to the history information and the connectivity of the two silhouettes. Then K-means clustering is performed on the combined set of interest points of all the training interactions to learn the spatio-temporal codebook. For a given set of interest points, a spatio-temporal word is built by allowing each point to vote softly into the few centers nearest to it and accumulating the scores of all the points. The CRF whose inputs are the spatio-temporal words is used to modeling the primitive actions for each person. Domain knowledge and first order logic production rules with weights are employed to learn the structure and the parameters of Markov Logic Network (MLN). MLN can naturally integrate common sense reasoning with uncertain analysis, which is capable of dealing with the uncertainty produced by CRF. Experiment results on our interaction dataset demonstrate the effectiveness and the robustness.

引文

[1] David Marr. Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. Henry Holt & Company, June 1982.
    [2]贾云得.机器视觉[M].北京:科学出版社, 2000.
    [3]王亮,胡卫民,谭铁牛.人运动的视觉分析综述[J],计算机学报,2002,25(3): 225-237.
    [4]杜友田,陈峰,徐文立,李永彬.基于视觉的人的运动识别综述[J],电子学报,2007,35(1): 84-90.
    [5]侯志强,韩崇昭.视觉跟踪技术综述[J],自动化学报,2006,32(4): 603-617.
    [6] J. Aggarwal, Q. Cai. Human motion analysis: a review [J]. Computer Vision and Image Understanding, 1999 73(3): 428-440.
    [7] D. Gavrila. The visual analysis of human movement: a survey [J]. Computer Vision and Image Understanding, 1999, 73(1): 82-98.
    [8] Y. Wu, T. S. Huang. Vision-based gesture recognition: a review [C]. International Gesture Workshop, 1999: 103-115.
    [9] Y. Wu, T. S. Huang. Hand modeling, analysis, and recognition for vision-based human computer interaction [J]. IEEE Signal Processing Magazine, 2001, 51-60.
    [9] T.B. Moeslund and E. Granum. A survey of computer vision-based human motion capture [J]. Computer Vision and Image Understanding, 2001, 231-268.
    [9] T. B. Moeslund, A. Hilton, V. Kruger. A survey of advances in vision-based human motion capture and analysis [J]. Computer Vision and Image Understanding, 2006, 104: 90-126.
    [10] R. Poppe, Vision-based human motion analysis: an overview [J]. Computer Vision and Image Understanding, 2007, 108: 4-18.
    [11] P. Turaga, R. Chellappa. Machine recognition of human activities: a survey [J]. IEEE Transactions on Circuits and Systems for Video Technology, 2008, 18(11): 1473-1488.
    [12] C.R. Wren. Pfinder: Real-time tracking of the human body. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998.
    [13] http://www.csail.mit.edu/
    [14] http://groups.csail.mit.edu/vision/app/
    [15] X. Wang, K. Tieu and E. Grimson. Learning semantic scene models by trajectory analysis. European Conference on Computer Vision, 2006.
    [16] Xiaogang Wang, Xiaoxu Ma and Eric Grimson. Unsupervised activity perception by hierarchical Bayesian models. IEEE Conference on Computer Vision and Pattern Recognition, 2007.
    [17] http://groups.csail.mit.edu/vision/vip/
    [18] R. Urtasun, T. Darrell. Local probabilistic regression for activity independent human pose inference. IEEE Conference on Computer Vision and Pattern Recognition, 2008.
    [19] A. Rahimi, B. Recht, T. Darrell. Learning appearance manifolds from video. IEEE Conference on Computer Vision and Pattern Recognition, 2005.
    [20] A. Imai, N. Shimada and Y. Shirai. Hand posture estimation in complex background s by considering. Asian Conference on Computer Vision, 2007.
    [21] F.D.T. Frade, J. Campoy, Z. Ambadar and J.F. Cohn. Temporal segmentation of facial behavior. International Conference on Computer Vision, 2007.
    [22] Wael Abd-Almageed and Larry Davis. Human Detection using Iterative Feature Selection and Logistic Principal Component Analysis. International Conference Robotics and Automation, 2008.
    [23] S. N. Vitaladevuni, V. Kellokumpu, and L. S. Davis. Action Recognition using Ballistic Dynamics. Computer Vision and Pattern Recognition, 2008:23-28.
    [24] Abhinav Gupta, Trista Chen, Francine Chen, Don Kimber, and Larry Davis, Context and Observation Driven Latent Variable Model for Human Pose Estimation. Computer Vision and Pattern Recognition, 2008.
    [25] A. Yilmaz and M. Shah, A Differential Geometric Approach To Representing the Human Actions, Computer Vision and Image Understanding, 2008, 109(3):335-351.
    [26] Pingkun Yan, Saad M. Khan, and Mubarak Shah, Learning 4D Action Feature Models for Arbitrary View Action Recognition. Computer Vision and Pattern Recognition, 2008.
    [27] Jingen Liu, Saad Ali, and Mubarak Shah, Recognizing Human Actions Using MultipleFeatures. Computer Vision and Pattern Recognition, 2008.
    [28] Bo Wu and Ram Nevatia, Detection and Tracking of Multiple, Partially Occluded Humans by Bayesian Combination of Edgelet based Part Detectors, International Journal of Computer Vision, 2007, 75(2): 247-266.
    [29] P. Natarajan and R.Nevatia. View and scale invariant action recognition using multiview shape-flow models. Computer Vision and Pattern Recognition, 2008.
    [30] I. Laptev, B. Caputo, C. Schuldt, and T. Lindeberg. Local Velocity-Adapted Motion Events for Spatio-Temporal Recognition. Computer Vision and Image Understanding, 2007, 108:207-229.
    [31] I. Laptev, M. Marszaek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. Computer Vision and Pattern Recognition, 2008.
    [32] X. Ren, A.C. Berg. And J. Mailik. Recovering human body configurations using pairwise constraints between parts. International Conference on Computer Vision, 2005.
    [33] C. Bregler. Learning and recognizing human dynamics in video sequences. IEEE Conference on Computer Vision and Pattern Recognition, 1997.
    [34] http://www.eecs.berkeley.edu/Research/Projects/CS/vision/vision_group.html.
    [35] http://cvrc.ece.utexas.edu/
    [36] http://vision.cs.princeton.edu/index.html.
    [37] http://www.cs.utoronto.ca/vis/index.html.
    [38] http://www.nada.kth.se/cvap.
    [39] http://www.dcs.qmw.ac.uk/research/vision/projects/INSIGHT/index.html.
    [40] http://cavr.korea.ac.kr/research/e_research.htm.
    [41] http://nlpr-web.ia.ac.cn/index.asp.
    [42] http://www.cs.brown.edu/~black/research.html.
    [43] Yan Ke, R. Sukthankar and Martial Hebert. Event detection in cluttered videos. International Conference on Computer Vision, 2007.
    [44] Yan Ke, R. Sukthankar and Martial Hebert. Spatio-temporal shape and flow correlation for action recognition. Visual Surveillance Workshop, 2007.
    [45] S. Roth and M.J. Black. On the spatial statistics of optical flow[J]. InternationalJournal of Computer Vision, 2007, 74(1): 33-50.
    [46]刘法旺.图像序列中的人体行为分析方法研究[D].北京:北京理工大学,2008.
    [47] I. Cohen and H. Li. Inference of human postures by classification of 3D human body shape. IEEE International Workshop on FG, 2003:74-81.
    [48] Ying Wang, Kaiqi Huang, and Tieniu Tan. Human activity recognition based on R transform. IEEE Conference on Computer Vision and Pattern Recognition Workshop on Visual Surveillance, 2007.
    [49] R. Poppe, and M. Poel. Discriminative human action recognition using pairwise CSP classifiers. IEEE Conference on Automatic Face and Gesture Recognition, 2008.
    [50] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Conference Computer Vision and Pattern Recognition, 2005, 886-893.
    [51] http://www.wisdom.weizmann.ac.il/~vision/SpaceTimeActions.html.
    [52] J. Yamato, J. Ohya, K. Ishii. Recognizing human action in time-sequential images using hidden markov model. Computer Vision and Pattern Recognition, 1992, 379-385.
    [53] Veeraraghavan A., Chowdhury A.R., Chellappa R. Matching shape sequences in video with applications in human movement analysis[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(12): 1896-1909.
    [54] S. Park and J.K. Aggarwal. Recognition of two-person interactions using a hierarchical Bayesian network. In Proceedings of the ACM SIGMM international workshop on Video Surveillance, 2003, 65-76.
    [55] S. Park. A hierarchical graphical model for recognizing human action and interactions in video. PhD Thesis, 2004.
    [56] Feng X., Perona P. Human action recognition sequence of movelet codewords. 3DPVT, 2002, 717-723.
    [57] Ramanan Deva. Learning to parse images of articulated bodies. Advances in Neural Information Processing Systems, 2006.
    [58] Ramanan D., Forsyth D.A. Zisserman A. Tracking people by learning their appearance[J]. IEEE Transactions on Pattern Analysis and Machine Interlligence, 2007,29(1): 65-81.
    [59] Arie J.B., Wang Z., Pandit P., Rajaram S. Human activity recognition using multidimensional indexing [J]. IEEE Transactions on Pattern Analysis and Machine Interlligence, 2002, 24(8): 1091-1104.
    [60] Johnson N. and Hogg D. Learning the distribution of object trajectories for event recognition. Image and Vision Computing Journal, 1996, 14: 609-615.
    [61] Sun X., Chen C. and Manjunath B.S. Probabilistic motion parameter models for activity recognition. International Conference on Pattern Recognition, 2002.
    [62] Efros A., C. Alexander, G. Mori and J. Malik. Recognizing action at a distance. International Conference on Computer Vision, 2003.
    [63] Sidenbladh H., and Black M.J. Learning the statistics of people in images and video. International Journal of Computer Vision, 2003, 54(1-3): 187-207.
    [64] Dalal N., Triggs B. and Schmid C. Human detection using oriented histograms of flow and appearance. European Conference on Computer Vision, 2006, 2: 428-441.
    [65] Wang Y., Mori G.: Learning a discriminative hidden part model for human action recognition. Advances in Neural Information Processing Systems, 2008.
    [66] Bobick A.F., Davis J.W. The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001, 23(3): 257-267.
    [67] Yilmaz and Shah M. Action sketch: a novel action representation. IEEE International Conference on Computer Vision and Pattern Recognition, 2005.
    [68] Blank M., Gorelick L., Shechtman E., Irani M. and Basri R. Action as space time shapes. International Conference on Computer Vision, 2005.
    [69] Laptev I. and Lindeberg T. Space-time interest points. IEEE International Conference on Computer Vision, 2003, 432-439.
    [70] P.Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal features, In 2nd joint IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance (VS-PETS), 2005, 65-72.
    [71] Ke Y., Sukthankar R., Hebert M. Efficient visual event detection using volumetricfeatures. IEEE International Conference on Computer Vision, 2005, 166-173.
    [72] Ke Y., Sukthankar R., Hebert M. Event detection in crowded videos. IEEE International Conference on Computer Vision, 2007.
    [73] Oikonomopoulos A., Patras I. and Pantic M. Kernel-based recognition of human actions using spatiotemporal salient points. IEEE International Conference on Computer Vision and Pattern Recognition, 2006.
    [74] J.C. Niebles L.Fei Fei. A hierarchical model of shape and appearance for human action classification, in: Proceedings of the International Conference on Computer Vision and Pattern Recognition , 2007.
    [75] J.C. Niebles, H. Wang, L. Fei Fei. Unsupervised learning of human action categories using spatial-temporal words[J]. International Journal of Computer Vision, 2008, 79: 299-318.
    [76] Jhuang H., Serre T., Wolf L. and Poggio T. A biologically inspired system for action recognition. IEEE International Conference on Computer Vision, 2007.
    [77] Zhang D. and Lu G. Shape-based image retrieval using generic Fourier descriptor. Signal Process: Image Communication, 2002, 17: 825-848.
    [78] Filiberto P., Pedro R., Joseand S. and Alexander B. Extracting motion features for visual human activity representation. Iberian conference on pattern recognition and image analysis, 2005.
    [79] The C.H. and Chin R.T. On image analysis by the methods of moments[J]. IEEE Transactions on Pattern Analysis and Machine Interlligence, 1988, 10: 496-512.
    [80] Souvenir R. and Babbs J. Learning the viewpoint manifold fro action recognition. IEEE International Conference on Computer Vision and Pattern Recognition, 2008.
    [81] Ahmad M. and Lee S.W. Human action recognition using shape and CLG-motion flow from multi-view image sequences[J]. Pattern Recognition, 2008, 41(7): 2237-2252.
    [82] Poppe R. and Poel M. Example-based pose estimation in monocular images using compact fourier descriptors. Technical Report TR-CTIT-05-49, University of Twente, 2005.
    [83] Agarwal A. and Triggs B. Recovering 3D human pose from monocular images[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006, 28(1): 1-15.
    [84] Belongie S., Malik J. and Puzicha J. Shape matching and object recognition using shape contexts [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(24): 509-522.
    [85] Poppe R. and Poel M. Comparison of silhouette shape descriptors for example-based human pose recovery. IEEE International Conference on Automatic Face and Gesture Recognition, 2006.
    [86] Hu M.-K. Visual pattern recognition by moment invariants [J]. IRE Transactions Information Theory, 1962, 8(2): 179-187.
    [87] Sivic J., Russell B.C., Efros A.A., Zisserman A. and Freeman W.T. Discovering objects and their location in images. IEEE International Conference on Computer Vision, 2005.
    [88] Sheikh Y. and Shah M. Exploring the space of an action for human action recognition. IEEE International Conference on Computer Vision, 2005.
    [89] Parameswaran V. and Chellappa R. View invariance for human action recognition [J]. International Journal of Computer Vision, 2006, 66(1): 83-101.
    [90] Parameswaran V. and Chellappa R. Human action recognition using mutual invariants[J]. Computer Vision and Image Understanding, 2005, 98: 295-325.
    [91] Veeraraghavan A., Chellappa R. and Roy-Chowdhury A.K. The function space of an activity. IEEE International Conference on Computer Vision and Pattern Recognition, 2006.
    [92] Han L, Liang W., Wu X. and Jia Y. Human action recognition using discriminative models in the learned hierarchical manifold space. IEEE International Conference on Automatic Face and Gesture Recognition, 2008.
    [93] Cuntoor N.P., Yegnanarayana B. and Chellappa R. Activity modeling using event probability sequences [J]. IEEE Transactions on Image Processing, 2008, 17(4):594-607.
    [94] Shlens J. A tutorial on principal component analysis. Technical Report, 2005.
    [95] Yang J., Zhang D., Frangi A.F., and Yang J.Y. Two-dimensional PCA: a new approach to appearance-based face representation and recognition [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004, 26(1):131-137.
    [96] Wei-Lwun Lu, Little J.J. Simultaneous tracking and action recognition using the PCA-HOG descritptor. The 3rd Canadian Conference on Computer and Robot Vision, 2006.
    [97] Chuang C. and Shih F.Y. Recognizing facial action units using independent component analysis and support vector machine [J]. Pattern Recognition, 2006, 39(9): 1795-1798.
    [98] Zhang L., Gao Q. and Zhang D. Directional independent component analysis with tensor representation. IEEE International Conference on Computer Vision and Pattern Recognition, 2008.
    [99] R.O. Duda, P.E. Hart and D. Stork. Pattern Classification. Wiley, 2000.
    [100] Dudoit S., Fridlyand J. and Speed T.P. Comparison of discrimination methods for the classification of tumors using gene expression data [J]. Journal of the American Statistical Association, 2002, 97(457):77-87.
    [101] Tenenbaum J., Sivla V. and Langford J.. A global geometric framework for nonlinear dimensionality reduction [J]. Science, 2000, 90(12):2319-2323.
    [102] Wang Q., Xu G. and Ai H. Learning object intrinsic structure for robust visual tracking. IEEE International Conference on Computer Vision and Pattern Recognition, 2003.
    [103] Roweis S. and Saul L. Nonlinear dimensionality analysis by locally linear embedding [J]. Science, 2000, 290(12):2323-2326.
    [104] Lim H., M. V.I., Camps O.I. and Sznaier M. Dynamic appearance modeling for human tracking. IEEE International Conference on Computer Vision and Pattern Recognition, 2006.
    [105] Lawrence N.D.: Gaussian process latent variable models for visualisation of high dimensional data. Advances in Neural Information Processing Systems, 2004, 329-336.
    [106] Andriluka M., Roth S. and Schiele B. People-tracking-by-detection and people-detection-by-tracking. IEEE International Conference on Computer Vision and Pattern Recognition, 2008.
    [107] Urtasun R. and Darrell T. Discriminative Gaussian Process Latent Variable Models for classification. International Conference on Machine Learning, 2007.
    [108]韩磊,梁玮,贾云得.层级潜变量空间中的三维人手跟踪方法[J].计算机辅助设计与图形学学报,2009.
    [109] Han L. and Liang W. Continuous hand gesture recognition in the learned hierarchical latent variable space. International Conference on Articulated Motion and Deformable Objects, 2008, LNCX, 5098.
    [110] Wang L. and Suter D. Recognizing Human Activities from Silhouettes: Motion Subspace and Factorial Discriminative Graphical Model. IEEE International Conference on Computer Vision and Pattern Recognition, 2007.
    [111] Lv F., Nevatia R. and Lee M.W. 3D human action recognition using spatial-temporal motion templates. IEEE International Conference on Computer Vision Workshop on HCI, 2005, 3766: 120-130.
    [112] Ayers D. and Shah M. Monitoring human behavior from video taken in an office environment [J]. Image and Vision Computing, 2001, 19(12): 833-846.
    [113] Somboon H., Nevatia R. and Bremond F. Video-based event recognition: activity representation and probabilistic recognition methods [J]. Computer Vision and Image Understanding, 2004, 96(2): 129-162.
    [114] Remagnino P., Tan T. and Baker K.D. Multi-agent visual surveillance of dynamic scenes. Image and Vision Computing, 1998, 16(8): 529-532.
    [115] Intile S. and Bobick A. Representation and visual recognition of complex, multi-agent actions using belief networks. Technique Report, MIT Media Lab, 1998.
    [116] Neil Robertson and Ian Reid. A general method for human activity recognition in video. Computer Vision and Image Understanding, 2006, 104(2): 232-248.
    [117] Rabiner L. A tutorial on hidden markov models and selective applications in speech recognition. Proceedings of IEEE, 1989, 77(2).
    [118] Shaogang Gong and Tao Xiang. Recognition of group activities using dynamic probabilistic networks. International Conference on Computer Vision, 2003.
    [119] Brand M., Oliver N. and Pentland A. Coupled hidden Markov models for complex action recognition. IEEE International Conference on Computer Vision and Pattern Recognition, 1997, 994– 999.
    [120] Oliver N. M., Rosario B., Pentland A.P. A Bayesian computer vision system formodeling human interaction [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22(8): 831– 843.
    [121] Ren H. and Xu G. Human action recognition with primitive-based coupled-HMM. International Conference on Pattern Recognition, 2002, 2: 494– 498.
    [122] Nguyen N. T., Phung D. Q., Venkatesh S. and Bui H. Learning and detecting activities from movement trajectories using the hierarchical hidden Markov model. IEEE International Conference on Computer Vision and Pattern Recognition, 2005, 955– 960.
    [123] Duong T. V., Bui H. H., Phung D. Q., Venkatesh S. Activity recognition and abnormality detection with the switching hidden semi-Markov mode. IEEE International Conference on Computer Vision and Pattern Recognition, 2005, 838–845.
    [124] Ghahramani Z. and Jordan M. Factorial hidden Markov models [J]. Machine Learning, 1997, 29(2 - 3), 245–273.
    [125] Galata A., Johnson N. and Hogg D. C. Learning Variable Length Markov Models of Behaviors [J]. Computer Vision and Image Understanding, 2001, 81: 398–413.
    [126] Luo Y., Wu T.D., Hwang J.N. Object-based analysis and interpretation of human motion in sports video sequences by dynamic Bayesian networks [J]. Computer Vision and Image Understanding, 2003, 92(2): 196–216.
    [127] Laxton Benjamin, Lim Jongwoo and Kriegman David. Leveraging temporal, contextual and ordering constraints for recognizing complex activities in video. IEEE International Conference on Computer Vision and Pattern Recognition, 2007, 1-8.
    [128] Buxton H. Learning and understanding dynamic scene activity: a review [J]. Image and Vision Computing Journal, 2003, 21: 125–136.
    [129] Lafferty J., McCallum A. and Pereira F. Conditional random fields: probabilistic models for segmenting and labeling sequence data. International Conference on Machine Learning, 2001.
    [130] McCallum A., Rohanimanesh K. and Sutton C. Dynamic Conditional Random Fields for Jointly Labeling Multiple Sequences. Advances in Neural Information Processing Systems, workshop, 2003.
    [131] Sutton C., Rohanimanesh K. and McCallum A. Dynamic Conditional Random Fields: Factorized Probabilistic Models for Labeling and Segmenting Sequence Data. International Conference on Machine Learning, 2004.
    [132] Sutton C. and McCallum A. Collective segmentation and labeling of distant entities in information extraction. Technical Report of University of Massachusetts, 2004, 04–49.
    [133] Kumar S. and Herbert M. Discriminative random fields: a framework for contextual interaction in classification. International Conference on Computer Vision, 2003.
    [134] Reynolds J. and Murphy K. Figure-ground segmentation using a hierarchical conditional random field. CRV, 2007.
    [135] Quattoni A., Collins M. and Darrell T. Conditional random fields for object recognition. Advances in Neural Information Processing Systems, 2004.
    [136] Torralba A., Murphy K. and Freeman W. Contextual models for object detecting using boosted random fields. Advances in Neural Information Processing Systems, 2004.
    [137] Sminchisescu C., Kanaujia A., Li Z. and Metaxas D. Conditional random fields for contextual human motion recognition. International Conference on Computer Vision, 1808-1815, 2005.
    [138] Wang S., Quattoni A., Morency L., Demirdjian D. and Darrell T. Hidden conditional random fields for gesture recognition. IEEE International Conference on Computer Vision and Pattern Recognition, 2006.
    [139] Morency L., Quattoni A. and Darrell T. Latent-Dynamic Discriminative Models for Continuous Gesture Recognition. IEEE International Conference on Computer Vision and Pattern Recognition, 2007.
    [140] Liao L., Fox D. and Kautz H. Hierarchical conditional random fields for GPS-based activity recognition. In Robotics Research: The 11th International Symposium Springer Tracts in Advanced Robotics (STAR), 2007.
    [141] Brand M. Understanding manipulation in video. In Proceedings Second International Conference on Automatic Face and Gesture Recognition, 1996, 94–99.
    [142] Ivanov Y.A., Bobick A.F. Recognition of visual activities and interactions bystochastic parsing [J]. IEEE Transaction on Pattern Analysis and Machine Intelligence, 2000, 22(8): 852–872.
    [143] Shi Y., Huang Y., Minnen D., Bobick A. and Essa I. Propagation Networks for recognition of partially ordered sequential action. IEEE International Conference on Computer Vision and Pattern Recognition, 2004.
    [144] Cho K., Cho H. and Um K. Human action recognition by inference of stochastic regular grammars [J]. Lecture Notes in Computer Science, 2004, 3138: 388– 396.
    [145] Joo S.W. and Chellappa R. Attribute grammar-based event recognition and anomaly detection. IEEE International Conference on Computer Vision and Pattern Recognition Workshop, 2006.
    [146] Yamato M., Mitomi H., Fujiwara F. and Sato T. Bayesian classification of task-oriented actions based on stochastic context-free grammar. IEEE International Conference on Automatic Face and Gesture Recognition, 2006, 317-323.
    [147] Ryoo M. S. and Aggarwal J.K. Recognition of composite human activities through context-free grammar based representation. IEEE International Conference on Computer Vision and Pattern Recognition. 2006, 2: 1709–1719.
    [148] Ryoo M.S. and Aggarwal J.K. Hierarchical recognition of human activities interacting with objects. International Workshop on Semantic Learning Applications in Multimedia, in conjunction with CVPR, 2007.
    [149] Kitani K. M., Sato Y., Sugimoto A. Deleted interpolation using a hierarchical Bayesian grammar network for recognizing human activity. In Proc IEEE workshop on VSPETS, 2005, 239–246.
    [150] Zhang Z., Huang K. and Tan T. Trajectory series analysis based event rule induction for visual surveillance. IEEE International Conference on Computer Vision and Pattern Recognition, 2007.
    [151] Atsuhiro Kojima, Takeshi Tamura and Kunio Fukunaga. Natureal language description of human activities from video images based on concept hierarchy of actions [J]. International Journal of Computer Vision, 2002, 50(2): 171-184.
    [152] Hakeem A., Sheikh Y.A. and Shah M. Case: A hierarchical event representation for the analysis of videos. American Association of Artificial Intelligence, 2004, 263-268.
    [153]刘大有,齐红,孙舒杨等.统计关系学习综述.中国人工智能学会第11届全国学术年会论文集:中国人工智能进展,2005,241-253.
    [154] Richadson M., Domingos P. Markov logic networks [J]. Machine Learning Journal, 2006, 62, 107-136.
    [155] Tran S.D., Davis L.S. Event modeling and recognition using markov logic networks. Proceedings of 7th European Conference on Computer Vision, 2008, 610-623.
    [156]王西颖,张习文,戴国忠.一种面向实时交互的变形手势跟踪方法[J].软件学报, 2007, 18(10): 2423-2433.
    [157]柳阳.基于穿戴视觉的手势交互方法研究[D].北京:北京理工大学,2005.
    [158]刘棠丽.基于图模型的关节人手跟踪技术研究[D].北京:北京理工大学,2005.
    [159]任海兵,祝远新,徐光佑,林学誾,张晓平.基于视觉手势识别的研究——综述[J].电子学报,2002,28(2): 118-121.
    [160] Erol A., Bebis G., Nicolescu M., Boyle R.D., Twombly X. Vision-based hand pose estimation: a review. Computer Vision and Image Understanding, 2007, 108: 52-73.
    [161] Thayananthan A. Template-based Pose Estimation and Tracking of 3D Hand Motion [D]: University of CAMBRIDGE.
    [162] Athitsos V,Scaroff S. Estimating 3D hand pose from a cluttered image. IEEE International Society Conference on Computer Vision and Pattern Recognition, 2003, 432-439.
    [163] Bray M,Koller-Meier E,Van Gool L. Smart particle filtering for 3D hand tracking. IEEE International Conference on Automatic Face and Gesture Recognition, 2004, 675-680.
    [164] Agarwal A and Triggs B. Monocular Human Motion Capture with a Mixture of Regressors. IEEE International Conference on Computer Vision and Pattern Recognition, 2005.
    [165] Felzenszwalb P.F. and Huttenlocher D.P. Efficient matching of pictorial structures. IEEE International Conference on Computer Vision and Pattern Recognition, 2000.
    [166] Ioffe S. and Forsyth D. Finding people by sampling. International Conference on Computer Vision, 1999.
    [167] Ioffe S. and Forsyth D. Human tracking with mixture of trees. InternationalConference on Computer Vision, 2001.
    [168] Robertson N., McKenna S.J. and Ricketts I.W. Human pose estimation using learnt probabilistic region similarities and partial configurations. European Conference on Computer Vision, 2004, 11-14.
    [169] Ronfard R., Schmid C. and Triggs B. Learning to parse pictures of people. European Conference on Computer Vision, 2002, 27-31.
    [170] Micilotta A., Ong E. and Bowden R. Detection and tracking of humans by probabilistic body part assembly. British Machine Vision Conference, 2005.
    [171] Ren X., Berg A.C., Malik J. Recovering human body configurations using pairwise constraints between parts. International Conference on Computer Vision, 2005.
    [172] Ramanan D. and Sminchisescu C. Training deformable models for localization. IEEE International Conference on Computer Vision and Pattern Recognition, 2006.
    [173] Hua G., Yang M.-H. and Wu Y. Learning to estimate human pose with data driven belief propagation. IEEE International Conference on Computer Vision and Pattern Recognition, 2005.
    [174] Elgammal A. and Lee C.S. Inferring 3D body pose from sihouettes using activity manifold learning. IEEE International Conference on Computer Vision and Pattern Recognition, 2004.
    [175] Isard M. and Blake A. CONDENSATION-conditional density propagation of visual tracking [J]. International Journal of Computer Vision, 1998, 29(1): 5-28.
    [176] Kato M., Chen Y.W. and Xu G. Articulated hand tracking by PCA-ICA approach. IEEE International Conference on Automatic Face and Gesture Recognition, 2006.
    [177] Sudderth E.B., Mandel M.I., Freeman W.T. and Wilsky A.S. Visual hand tracking using nonparametric belief propagation. IEEE International Conference on Computer Vision and Pattern Recognition Workshop on Generative Model Based Vision, 2004.
    [178] Tian T.P., Li R. and Sclaroff S. Tracking human body pose on a learned smooth space [R]. Boston: Boston University, 2005.
    [179] Hou S., Galata A., Caillette F., et al. Real-time body tracking using a gaussian process latent variable model. IEEE International Conference on Computer Vision, 2007.
    [180] Lee J. and Kunii T. Constraint-based hand animation. Models and Techniques inComputer Animation, 1993, 110-127.
    [181] Lawrence N.D. Hierarchical Gaussian process latent variable models. International Conference on Machine Learning, 2007, 481-488.
    [182] Poggio T. and Girosi F. Network for approximation and learning [J]. Proceedings of IEEE, 1990, 78(9): 1481-1497.
    [183] CMU Graphics Lab Motion Capture Database. http://mocap.cs.cmu.edu/
    [184] Zhang Z., Huang K. And Tan T. Comparison of similarity measure for trajectory clustering in outdoor surveillance scenes. IEEE International Conference on Pattern Recognition, 2006.
    [185] Bashir F.I., Khokhar A.A. and Schonfeld D. Segmented trajectory based indexing and retrieval of video data. IEEE International Conference on Image Processing, 2003.
    [186] Chang C.C., Lin C.J. LIBSVM: a library for support vector machines, software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
    [187] Bosch A., Zisserman A., Munoz X. Representing shape with a spatial pyramid kernel. ACM International Conference on Image and Video Retrieval, 2007.
    [188] Ali A, Aggarwal J K. Segmentation and recognition of continuous human activity, in: Proceedings of the IEEE Workshop on Detection and Recognition of Events in Video, pp, 28-35, 2001.
    [189] Xiang T, Gong S. Beyond tracking: modeling activity and understanding behavior. International Journal of Computer Vision, 67(1): 21-51, 2006.
    [190] Park S, Aggarwal J K. A hierarchical Bayesian network for event recognition of human action and interaction. ACM Journal of Multimedia Systems, special issue on Video Surveillance, 10(2): 164-179, 2004.
    [191] Du Y, Chen F, Xu W and Zhang W. Activity recognition through multi-scale motion detail analysis. Neurocomputing, 71(16-18): 3561-3574, 2008.
    [192] Hakeem A, Shah M. Learning, detection and representation of multi-agent event in videos. Artificial Intelligence, 171(8-9): 586-605, 2007.
    [193] Alchemy– Open Source AI. http: //alchemy.cs.washington.edu/.
    [194] Gorelick L, Blank M, Shechtman E, Irani M and Basri R, action as space-time shapes. IEEE Transaction on Pattern Analysis and Machine Intelligence, 29(12): 2247-2253,2007.
    [195] Schuldt C, Laptev I and Caputo B, recognizing human actions: a local SVM approach, IEEE International Conference on Pattern Recognition, Cambridge, UK, 2004.
    [196] Weinland D, Ronfard R and Boyer E, free viewpoint action recognition using motion history volumes, Computer Vision and Image Understanding, 104(2-3): 249-257, 2006.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700