时间序列相似性聚类算法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着计算机在工业中的应用发展,电解铝行业在生产过程中普遍使用计算机监控系统,以达到对电解槽的自动控制。监控系统自动收集各种电解槽的数据,在铝生产行业中积累了大量的历史数据资料。但是现有数据系统的共享和整合程度低,只有简单的数据输入,查询,统计以及其他事务处理等功能,不能找到这些数据背后隐藏的在生产和企业管理中有重要指导作用的规则和规律。决策者迫切需要提取信息和知识,改善电解槽的管理质量,提高生产效率。为充分利用时间序列数据,从大型数据库发现隐藏的知识,本文对时间序列相似性聚类做了一系列的研究。主要的工作和贡献包括以下几点:
     1,在综合分析近年来时间序列数据挖掘相关文献的基础上从时间序列分割、相似性度量、时间序列聚类等方面对时间序列数据挖掘进行了综述,并在此基础上对未来的发展趋势进行了展望,为研究者了解最新的基于时间序列相似性聚类研究动态、新技术及发展趋势提供了参考。
     2.针对SAX(符号化聚合近似)等长分割的缺陷,提出一种基于分割模式的时间序列符合化表示方法(SMSAX).该算法根据时间序列特征对其进行不等长分割,同时加入波动率消除奇异点的影响。通过对标准数据集和铝电解数据的相关实验和分析,表明该算法能获得比SAX相对精确的结果,有效解决SAX等长分割的缺陷。
     3.针对时间序列角度距离相似性度量子线段长度信息丢失的缺陷,提出一种加权夹角距离相似性度量方法。该方法使用相邻线段夹角以及该相邻子线段长度所占比重构成的向量集合描述原始时间序列;并用相邻子线段所占比重作为权值,对时间序列进行相似性度量。通过对标准数据和铝电解数据的相关实验和分析,表明该方法有效避免了子序列长度信息的丢失,能够对时间序列进行相对准确的相似性度量。
     4.在对k-means聚类算法研究的基础上,基于序列整体相似性提取分割模式对时间序列线性分割,考虑序列特征的上界和下界,提出一种基于k-edge的时间序列相似性聚类算法。通过对铝电解槽况判断的相关实验和分析,表明该算法在聚类效率和聚类准确度都有着比k-means较好的效果。
As the promotion of computer applications in the aluminum production industry, a kind of control system was used to automatic control electrolyze in the production process.Variety data of electrolyze conditions was automatically collected by control system, thus a lot of historical data has been accumulated in the aluminum manufacturing industry.However,data sharing and integration degree are low in existing system,and only simple data entry,query,statistics and other transactional processing were executed.The rules and laws,which have an important guiding role in the enterprise production and management,inherent in these vast amounts of data, could not be found.The valuable information and knowledge,which were used to improve production efficiency in electrolyzes management, urgently need to be extracted from massive amounts of data by decision-makers.This thesis devoted in the research on application of time sequence similarity clustering in details,so that these time series data was exploited abundantly and the knowledge was mined from the large databases.The main contents and contribution were as following:
     1.On the basis of a comprehensive analysis of the recent year's relevant literature of time series data mining,time-series data mining such as time series segment,similarity measure,clustering was reviewed.The future of development trends was presented.New techniques and development trends of time series data mining were provided for scholars as reference.
     2.To avoid the problems of equal-length segmentation defects of SAX(Symbolic Aggregate approximation),a Vector Symbolic Algorithm based on Segmentation Model (SMSAX) for time series was put forward.The algorithm segmented time sequences based on characters,and eliminated the impact of singular points with the fluctuation rate.The results of experiment on standard data set and the aluminum electrolysis data set indicated that SMSAX could obtain more accurate results than SAX,and solved effectively equal-length segmentation defects of SAX.
     3.Because the sub-segment length information in time series similarity degree method based on angle-distance was lost, a weighted angle distance similarity metrics was presented.Loss of the sub-sequence length information was avoided effectively,the original time series could be more comprehensively described;and time-sequence similarity was more accurate and comprehensive measured.The results of experiments based on the standard data set and the aluminum electrolysis data set indicated that the method was practical and effective.
     4.On the foundation of the research on k-means clustering algorithm,based on the whole similarity of series,considering upper and lower bounds of the sequence feature,a k-edge-based time series similarity clustering algorithm was proposed.The decision experiment of electrolytic condition was programmed.The results showed that the clustering algorithm has reached the desired aim in clustering efficiency and accuracy.
引文
[1]虞健飞,朱家元,张恒喜.相似时间序列挖掘方法.计算机仿真,2003,20(9):7-9
    [2]肖辉.时间序列的相似性查询与异常检测:[复旦大学博士学位论文].上海:复旦大学,2005,12-13
    [3]喻静文.基于模式的时间序列进化分割算法研究:[中山大学硕士学位论文].广州:中山大学,2007,3-4
    [4]Refiei D.On Similarity Based Queries for Time Series Data. In:15th IEEE International Conference on Data Engineering(ICDE).Sydney,1999,410-417.
    [5]Keogh E,Chakrabarti K,Pazzani M and Mehrotra S.A Simple Dimensionality Reduction Technique for Fast Similarity Search in Large Time Series Databases. In:4th Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD).Kyoto,2000,122-133
    [6]Keogh E,Pazzani M.An enhanced representation of time series which allows fast and accurate classification,clustering and relevance feedback.In: Proceedings of the 4th International Conference of Knowledge Discovery and Data Mining.New York,1998,239-241
    [7]Perng C, Wang H, Zhang S, Parker D.Landmarks:a new model for similarity-based pattern querying in time series databases.In:16th International Conference on Data Engineering (ICDE'00).California,2000,33-44
    [8]李斌,谭立湘,章劲松.面向数据挖掘的时间序列符号化方法研究.电路与系统学报,2000,5(2):9-14
    [9]覃征,李爱国.时间序列数据的稳健最优分割.西安交通大学学报,2003,37(4):338-342
    [10]Hanlon B,Forbes C.Model Selection criteria for segmented time series from a Bayesian approach to information compression.Monash:Clayton Victoria, 2002,45-50
    [11]HAWKINS D M.Fitting multiple Change-point models to data.Computational Statistic & Data Analysis,2008,37(3):323-341
    [12]Keogh E,Kasetty S.On the need for time series Data Mining Benchmarks:A Survey and Empirical Demonstration.In:Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Alberta, 2002,102-111
    [13]Geetika Tewari,John Snyder, Pedro V. Sander, Steven J.Gortler, Hugues Hoppe. Signal-specialized parameterization for piecewise linear reconstruction.In: Proceedings of the 2004 Euro graphics/ACM SIGGRAPH symposium on Geometry processing.Nice,2004,55-64
    [14]肖辉,马海兵,龚薇.基于时态边缘算子的时间序列分段线性表示.计算机工程与应用,2008,44(19):156-159
    [15]詹艳艳,徐荣聪,陈晓云.基于斜率提取边缘点的时间序列分段线性表示方法.计算机科学,2006,33(11):139-142
    [16]杜奕,卢德唐,李道伦,查文舒.基于层次聚类的时间序列在线划分算法.模式识别与人工智能,2007,3(20):23-27
    [17]Keogh E,Chakrabarti K,Pazzani M J,et al.Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases.Knowledge and Information Systems,2008,3(3):263-286
    [18]Yi B K, Faloustsos C.Fast Time Sequence Indexing for Arbitrary Lp Norms.In: Proceeding of the 26th International Conference on Very Large Databases.San Francisco,2002,385-394
    [19]Keogh E,Chu S,Hart D,pazzani M. Segmenting time-series:A survey and novel approach. In M.Last, A.Kandel, H. Bunke, eds, Data Mining In Time-series Databases,World Scientific,2004,1-22
    [20]Park S,Kim S W,Cho J S,et al.Prefix Querying:An Approach for Effective Subsequence Matching Under Time Warping in Sequence Databases.In: Proceedings of the 10th International Conference on information and Knowledge Management.New York,2002,255-262
    [21]Lavrenko V,Schmill M,Lawrie D,et al.Mining of concurrent text and time series.In:Proceedings of the 6th ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining Workshop on Text Mining.Boston,2002,37-44
    [22]吴江琴,高文.时间序列聚类算法及其在手势识别中的应用.模式识别与人工智能,2005,18(1):1-5
    [23]国宏伟,高学东,王宏.基于异时间窗划分的时间序列聚类.计算机工程,2007,33(21):3-5
    [24]黄书剑.时序数据上的数据挖掘.软件学报,2004,15(01):1-7
    [25]Hoppner F,Klawonn F.Compensation of Translational Displacement in Time Series Clustering Using Cross Correlation.In:Proceedings of the 8th International Symposium on Intelligent Data Analysis.Lyon,2009,71-82
    [26]杨风召,朱扬勇.一种有效的量化交易数据相似性搜索方法.计算机研究与发展,2004,31(2):361-368
    [27]Zhao Y,Zhang C,Zhang S.A recent-biased dimension reduction technique for time-series data.In:Proceedings of the 9th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'05).Springer,2005,75-757
    [28]Wang X,Smith K, Hyndman R. Characteristic-based clustering for Time Series Data.Data Ming and Knowledge Discovery,2006,13(3):335-364
    [29]Gaffney S.Curve clustering with random effects regression mixtures.In: Bishop CM,Frey BJ,eds. Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics.Society for Artificial Intelligence and Statistics,Florida,2003
    [30]张小涛,李翠玉.基于模型的不等间隔时间序列聚类算法研究.计算机工程与应用,2008,44(6):166-168
    [31]Gregorio AD,Iacus SM.Clustering of discretely observed diffusion processes. Computational Statistics & Data Analysis,2010,54(2):598-606
    [32]Keogh E,Lonardi S,Ratanamahatana C.Towards parameter-free data mining. In:W.Kim,R. Kohavi,J.Gehrke,W.DuMouchel,eds,Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'04).ACM Press,2004,206-215
    [33]苏新宁,杨建林,邓三鸿,周军著.数据挖掘理论与技术.北京:科学技术文献出版社,2003,138-139
    [34]Jiawei Han,Micheline Kamber著.数据挖掘概念与技术,范明孟小峰译.北京:机械工业出版社,2008,300-301
    [35]Yie Yinbao.Discovery in A Study of fuzzy Clustering Algorithm to Knowledge Databases.Computer Engineering,2002,28(1):100-102
    [36]Huang Zhexue.Extensions to the k-means Algorithm for Clustering Large Data Sets with Categorical Values.Data Mining and Knowledge Discovery,1998, 2(3):283-304
    [37]李雄飞,李军著.数据挖掘与知识发现.北京:高等教育出版社,2003,94-95
    [38]翁颖钧,朱仲英.基于动态时间弯曲的时序数据聚类算法的研究.计算机仿真灰色理论,2004,21(3):37-40
    [39]段江娇,薛永生,林子雨等.一种新的基于隐Markov模型的分层时间序列聚类算法.计算机研究与发展,2006,43(1):61-67
    [40]黄超,吴清烈,武忠,等.基于方差波动多重分形特征的金融时间序列聚类.系统工程,2006,24(6):100-103
    [41]秦亮曦,刘新峰,史忠植.基于片段模式的多时间序列关联分析.计算机科学,2006,33(1):232-235
    [42]邵丹,陈平雁.模糊C均值聚类在时间序列分析中的应用.中国卫生统计,2009,26(2):166-170
    [43]Pierpaolo D,Elizabeth AM.Autocorrelation-based fuzzy clustering of time series.Fuzzy Sets and Systems,2009,160(24):3565-3589
    [44]Ohsaki M,Nakase M, Katagiri S.Analysis of Subsequence Time-Series Clustering Based on Moving Average.In:Proceedings of the 2009 Ninth IEEE International Conference on Data Mining.USA,2009,902-907
    [45]Guo H, Liu Y, Liang H, Gao X.An Application on Time Series Clustering Based on Wavelet Decomposition and Denoising.In:Proceedings of the 2008 Fourth International Conference on Natural Computation.USA,2008,419-422
    [46]Gavrilov M, Anguelov D, Indyk P,Motwani R. Mining the Stock Market: Which Measure is Best? In:Proc.Of the KDD.2000,487-496
    [47]Li C,Biswas.A Bayesian Approach to Temporal Data Clustering Using Hidden Markov Models.In:International Conf. on Machine Learning.2000,543-550
    [48]Alon J, Sclaro S,Kollios G, Pavlovic V.Discovering clusters in motion time-series data.In:IEEE Computer Vision and Pattern Recognition Conference (CVPR),2003
    [49]Bagnall AJ, Janakec G, Zhang M.Clustering Time Series from Mixture Polynomial Models with Discredited Data.Technical Report CMP-C03-17, School of Computing Sciences,University of East Anglia,2003
    [50]KEOGH E.Data mining and machine learning in time series database.In:Proc of the 5th Industrial Conference on Data Mining(ICDM).Leipzig,2000
    [51]Faloutsos C,Ranganathan M,Manolopulos Y.Fast Subsequence Matching in Time-Series Databases.SIGMOD Record,1994,23:419-429
    [52]Chan K,Fu AW.Efficient Time Series Matching by Wavelets.In:proceedings of the 15th IEEE Int'l Conference on Data Engineering.Sydney,1999,126-133
    [53]Keogh E, Chakrabarti K, Pazzani M. Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases.In:proceedings of ACM SIGMOD Conference on Management of Data.Santa Barbara,2001,151-162
    [54]Lin J,Keogh E,Lonardi S,Chiu B.A Symbolic Representation of Time Series, with Implications for Streaming Algorithms.In:proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.San Diego,CA.USA June 13.2003
    [55]钟清流,蔡自兴.基于统计特征的时序数据符号化算法.计算机学报,2008,31(10):1857-1864
    [56]Keogh E,Kasetty S.On the Need for Time Series Data Mining Benchmarks:A Survey and Empirical Demonstration.In:Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Edmonton. Alberta,2002
    [57]Loh W K, Kim S W.Index Interpolation:An Approach for Subsequence Matching Supporting Normalization Transform in Time-Series Databases.In: Proceedings of the 9th International Conference on Information and Knowledge Management.New York,2000:480-487
    [58]Pham DT, Chan AB,Control Chart Pattern Recognition using a New Type of Self Organizing Neural Network Process.Instn, Mech, Engrs.1998,212(1): 115-127
    [59]张鹏,李学仁,张建业,张宗麟.时间序列的夹角距离及相似性搜索.模式识别与人工智能,2008,21(6):763-767
    [60]Ordonez C.Omiecinski E.efficient Disk-Based K-means Clustering for Relational Databases.IEEE Trans on Knowledge and Data Engineering,2004, 16(8):909-921
    [61]张健沛,杨悦,杨静,等.基于最优划分的k-means初始聚类中心选取算法.系统仿真学报,2009,21(9):2586-2590
    [62]刘慧婷,倪志伟.基于EMD与k-means算法的时间序列聚类.模式识别与人工智能,2009,22(5):803-808
    [63]李俊奎,王元珍,李新萍.基于边界距离的时间序列聚类.中国科技论文在线(http://www.paper.edu.en),2006
    [64]戴东波,汤春蕾,熊贽.基于整体和局部相似性的序列聚类算法.软件学报,2010,21(4):702-717
    [65]戴东波,熊赞,朱扬勇.基于参考集索引的高效序列相似性查找算法.软件学报,2010,21(4):718-731
    [66]宿爱霞.数据挖掘在铝电解槽槽况趋势控制中的研究与应用:[北方工业大学硕士学位论文].北京:北方工业大学,2008,43-48

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700