基于约束动态时间弯曲距离的时间序列相似性匹配

英文题名：The Similarity Measurement Based on Constraint Dynamic Time Warping
作者：高成锴
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：DTW ; 过匹配 ; FDTW ; LBM ; 数据流
英文关键词：DTW ; through matching ; FDTW ; LBM ; data flow
学位年度：2010
导师：冯林
学科代码：081203
学位授予单位：大连理工大学
论文提交日期：2010-11-01

摘要

时间序列相似性匹配是数据挖掘中的重要工具。本文首先对不同时间序列的处理技术进行详细的概括,在前人基础上分析现有挖掘工具；之后说明其中存在的问题,包括时间序列度量和数据流处理,并辨析各个方法的优缺点。最终提出了针对动态时间弯曲的过匹配问题,在对这个问题进行详细分析的基础上提出应用于不同领域的两种算法,即Flex Dynamic Time Warping (FDTW)和Lining Bound Mine (LBM)算法,通过给出的证明和实验验证算法的准确性。
     FDTW是一种全新的度量方式,主要解决DTW算法存在的过匹配问题,它使用三个迭代矩阵对路径选择严格限制,若超出限制范围则无法成为路径的选择点。文章不仅给出算法,并且证明算法的正确性。
     LBM算法是应用在时间序列数据流中支持弹性形变的子序列匹配算法。由于数据流相对于传统静态时间序列的特殊性,无法支持重复性查找,所以数据流子序列算法必须通过一次扫描找到区域最优解。前人曾提出spring算法,但spring算法存在冗余计算以及过匹配现象。LBM算法使用全局约束和判断减少过匹配以及冗余计算,经过实验验证,相对于spring算法不仅速度得到提升,并且不会出现过匹配现象。
Time series similarity matching is an important tool in data mining. Firstly, time series of different treatment technologies detailed summary of the basis of the previous analysis of existing mining tools; after the description of the existing problems, including time-series measurements and data stream processing, and Analysis of advantages and disadvantages of each method. Eventually made over for dynamic time warping matching problem, a detailed analysis of this issue based on the proposed application of the two algorithms in different fields, that Flex Dynamic Time Warping (FDTW) and Lining Bound Mine (LBM) algorithm, algorithm proof and experimental validation of their accuracy.
     FDTW DTW algorithm is mainly to solve the existing problems have made matching metric, which uses three iterations of the routing matrices strictly limited, if out of the path matching region cannot be the choice of points.
     LBM algorithm is applied in the time series data stream support for elastic deformation of the sub-sequence matching algorithm. As opposed to traditional static data flow time series are unique, cannot support the repetitive look, so the data flow sequence algorithm must be found through a scan area optimal solution. Algorithms have been proposed previous spring, but the existence of redundant computing algorithm and the spring had mismatch. LBM algorithm uses global constraints and determine the reduction of redundant computing over match and, after the experiment, the speed relative to the spring algorithm not only be enhanced, and will not match the phenomenon occurred.

引文

[1]詹艳艳,陈晓云,徐荣聪.基于时间序列的模式表示挖掘频繁子模式[J].计算机工程与应用,2006,21(41)：24-27.
    [2]郑诚,舒坚.多尺度时间序列异常事件检测[J].计算机工程与应用.2006,31(17)：86-89.
    [3]郝井华,刘民,吴澄,陈少卿.一种基于LLM的高维时间序列数据异常检测方法[J].控制工程.2005,12(3)：14-18.
    [4]Chan K, Fu W. Efficient time series matching by wavelets[C]. In:Proceedings of the 15th IEEE International Conference on Data Engineering, Sydney,1999.
    [5]Agrawal R, Psaila G, Wimmers El et al. Querying Shapes of Histories[C]. In:Proc of the 21st Intl Conf On Very Large Data Bases, San Francisco, CA, USA,1995.
    [6]Sanghyun Park, Wesley W Chu, Jeehee Yoon et al. Efficient searches for similar subsequences of different lengths in sequence databases[C]. In:Proceedings of the 16th International Conference on Data Engineering, Washington, IEEE Computer Society,2000.
    [7]Keogh E, Pazzani M. A simple dimensionality reduction technique for fast similarity search in large time series databases[C]. In:the Fourth Pacific-Asia Conference on Knowledge Discoveryand Data Mining, Kyoto, Japan,2000.
    [8]Pratt K B, Fink E. Search for patterns in compressed time series[J]. International Journal of Image and Graphics,2002 2(1):89～106.
    [9]Keogh E, Chu S, Hart D. Segmenting Time Series:A Survey and Novel Approach[M]. Data Mining in Time Series Databases, World Scientific Publishing Company,2003.
    [10]R. Bellmann. Adaptive Control Processes:A Guided Tour[M]. Princeton:Princeton University Press,1961.
    [11]Roger Weber, Stephen Blott. An Approximation-Based Data Structure for Similarity Search[DB]. http://citeseer.ist.psu.edu/238155.html.1997.
    [12]Kira Kononerco I. Estimating attributes:analysis and extension of relief [C]. Proc of European Conf on Machine Learning.1994:171-182.
    [13]Li Y, Wu ZF, Liu JM, et al. Efficient feature selection for high-dimensional data using two-level f ilter[C]. Proc of Int Conf on Machine Learning and Cybernetics, IEEE CNF,2004.
    [14]Mitra P, Murthy C A, Pal S K. Unsupervised feature selction using feature similarity [J]. IEEE Trans on Pattern Recognition and Machine Intelligence,2002,24(3):301-312.
    [15]Agrwal R, Faloutus C, Swami A. Efficient similarity search in sequence databases[C]. Proc of the 4th International Conference on Foundations of Data Organization and Algorithms, London,Springer-Verlag,1993.
    [16]Faloutus C, Ranganathan M, Manolopoulos Y. Fast subsequence matching in time-series databases[C]. Proc of the ACM SIGMOD International Conference on Management of Data. Mineapolis, Tokyo, ACM Press,1994.
    [17]Guttman A. R-trees:A dynamic index structure for spatial searching[C]. Proc:ACM SIGMOD Int. Conf. on Management of Data, Taiwan,1984.
    [18]Jessica L, Keogh E, Stefano L, Pranav P. Finding Motifs in Time Series[M]. SigKDD 2002.
    [19]Bill C, Keogh E, Stefano L. Probabilistic Discovery of Time Series Motifs[C]. Conference on Knowledge Discovery in Data Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining,2003.
    [20]Yoshiki T, Kazuhisa I,Kuniaki U. Discovery of Time-Series Motif from Multi-Dimensional Data Based on MDL Principle[J]. Machine Learning,2005,58: 269-300.
    [21]Kyle L, Jensen, Mark P, Styczynski, Isidore R, Gregory N. A generic motif discovery algorithm for sequential data[J]. Bioinformatics,2006,22(1):21-28.
    [22]Kovar, L., and Gleicher M. Automated Extraction and Parameterization of Motions in Large Data Sets[J]. ACM Transactions on Graphics,2004,14(5):58-65.
    [23]Meinard M, Tido R,Michael C. Efficient Content-Based Retrieval of Motion Capture Data[J]. ACM Transactions on Graphics.2005,24(3):677-685.
    [24]Wang M. Y, Chan C. Y, Chao S. P, Yang S. N, Lin H. C. Content-Based Retrieval for Human Motion Data[C].16th IPPR Conference on Computer Vision, Graphics and Image Processing,2003:605-612.
    [25]Clifford K, George B. Entropy-based motion extraction for motion capture animation[J]. Computer Animation and Virtual Worlds,2005,16(3):225-235.
    [26]Keogh, Palpanas T, Gunpulos D,Cardle M. Indexing Large Human-Motion Databases[C]. In Proc.30th VLDB Conf., Toronto,2004.
    [27]Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, Jennifer Widom. Modelsand Issues in Data Stream Systems[C]. In:Proceedings of the Twentyfirst ACM IGACTSIGMOD SIGART Symposium on Principles of Database Systems, Madison, Wisconsin, USA. ACM Press,2002.
    [28]Eamonn J. Keogh, Shruti Kasetty. On the need for time series data mining benchmarks: a survey and empirical demonstration[C]. In:Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada.2002.
    [29]Yasushi Sakurai, Christos Faloutsos, Masashi Yamamuro. Stream Monitoring under the Time Warping Distance[C]. In:Proceedings of the 23rd International Conference on Data Engineering(ICDE), The Marmara Hotel, Istanbul, Turkey. IEEE Computer Society,2007.
    [30]Pedro Domingos, Geoff Hulten. Mining highspeed data streams[C]. In:Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, Boston, MA, USA.2000.
    [31]Geoff Hulten, Laurie Spencer, Pedro Domingos. Mining time changing data streams[C]. In:Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovcry and data mining, San Francisco, CA, USA.2001.
    [32]Maleq Khan, Qin Ding, William Perrizo. K-nearest Neighbor Classification on Spatial DataStreams Using P-trees[C]. In:Proceedings of the 6th Pacific-Asia Conference of Advances in Knowledge Discovery and Data Mining(PAKDD), Taipei, Taiwan. SpringeL 2002, v01.2336 of Lecture Notes in Computer Science,517-518.
    [33]Charu C. Aggarwal, Philip S. Yu. LOCUST:An Online Analytical Processing Framework for High Dimensional Classification of Data Streams [C]. In:Proceedings of the 24th International Conference on Data Engineering(ICDE), Cancan, Maxico. IEEE,2008, 426-435.
    [34]Chris Giannella, Jiawei Han, Jian Pei, Xifeng Yan, Philip S. Yu. Mining Frequent Patterns in Data Streams at Multiple Time[M]. Next Generation Data Mining.2003.
    [35]Jiawei Han, Jian Pei, Yiwen Yin. Mining Frequent Patterns without Candidate Generation[C]. In:Proceedings of International Conference on Management of Data(SIGMOD), Texas, USA.2000,1-12
    [36]Gurmeet Singh Manku, Rajeev Motwani. Approximate Frequency Counts over Data Streams[C]. In:Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), Hong Kong, China. Morgan Kaufmann,2002,346-357.
    [37]Joong Hyuk Chang, Won Suk Lee. Finding recent frequent itemsets adaptively over online data streams[C]. In:Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA.2003, 487-492.
    [38]Cheqing Jin, Weining Qian, Chaofeng Sha, Jeffrey Xu Yu, Aoying Zhou. Dynamically maintaining frequent items over a data stream[C]. In:Proceedings of ACM CIKM International Conference on Information and Knowledge Management, New Orleans, Louisiana, USA.2003,287-294.
    [39]LinFeng, Chengkai Gao, Tao Sun,Hao Wu. A Neighborhood Selection Algorithm for Manifold Learning[C].2010 international conference on computer design and applications.2010. Qinhuangdao, Hebei, China.
    [40]Myers C S, Rabiner L R. A comparative study of several dynamic time-warping algorithms for connected word recognition[J]. The Bell System Technical Journal, 1981 60(7):1389-1409.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700