数据挖掘中金融时间序列的粗糙聚类分析

英文题名：Rough Clustering of Financial Time Series in Data Mining
作者：吴晓彬
论文级别：硕士
学科专业名称：统计学
中文关键词：数据挖掘 ; 时间序列 ; 相似性度量 ; 小波分析 ; 粗糙聚类
英文关键词：Data Mining ; Time Series ; Similarity Measurement ; Wavelet Analysis ; Rough Clustering
学位年度：2008
导师：朱建平
学科代码：020208
学位授予单位：厦门大学

摘要

传统统计分析与现代金融计量经济方法研究时间序列的主要思路是建立基于严格数学推导下的统计模型并对其进行参数估计与数据检验,目前已建立起一套较为成熟的理论体系。但该方法既依赖于苛刻的假设条件,又要求所有数据都符合一个固定的数学模型,显得过于牵强。数据挖掘研究时间序列的思路则不同,它由数据直接驱动建立模型,克服了上述的缺陷。
     时间序列数据挖掘已是当前的研究热点之一,人们也取得不少的研究成果,但对于时间序列相似性度量这一关键难题一直未能得到较好的解决,而很多时序挖掘方法都是建立在相似性的基础上,显然时间序列相似性度量直接影响着这些时序挖掘方法的结果,为此本文首先就该关键的基础性问题展开研究,进一步讨论了该度量方法在序列挖掘中的应用。由于数据挖掘方法众多,本文不可能一一涉及,所以只针对聚类分析进行深入的探讨。聚类分析不仅是数据挖掘的重要组成部分,同时也是多元统计分析的重要方法,在实际中有广泛的运用。本文绕开了已有较多成熟方法的硬聚类,而深入地研究了一种软聚类——粗糙聚类的方法及其在时间序列挖掘中的应用,同时从侧面反映了本文度量序列相似性方法的实用性。全文的主要工作及创新可归纳为以下几点。
     首先,结合小波分析的思想方法,提出一种基于小波多尺度变换的时间序列相似性度量方法,并通过金融时间序列的实例研究,说明该方法全面考虑了影响序列相似性度量的各种因素,很好地克服了已往方法无法兼顾序列整体形状轮廓与细节差异的缺陷。
     其次,在相似性度量方法的基础上,研究了序列粗糙聚类方法,通过金融实证研究表明粗糙聚类方法的优点。并深入研究了以下三个问题:(1)建立粗糙聚类质量指标,并研究不同阈值参数对聚类结果的影响;(2)将粗糙聚类法与层次聚类法进行整合,各取所长;(3)将软聚类转化为硬聚类,通过迭代剔除法对粗糙聚类结果精简化,并与之前聚类结果进行比较,说明其可行性。
     最后,本文模型方法尚无现成的软件模块实现,故本文还给出Matlab软件上具体实现的参考程序,结合实证研究取得较好的效果。
Based on strict mathematical conduction and then to conduct parameters estimation and inference, traditional statistics and modern financial econometrics, in which theory frameworks have been built up for years, are to establish statistical models. However, such methods seem unfit due to its dependence on strict hypothesis and importuning all data of series to meet modeling requirements. Data mining techniques overcome this kind of shortage in a way of establishing models motivated by data.
     Time series data mining is popular today, and many achievements have been made. Whereas, appropriate solution of measuring similarity still lacks of attention, which lays the foundation of several methods in series mining. Apparently, similarity measurement in time series does affect mining results. This dissertation aims at such pivotal issue as well as its applications in series mining, particularly, clustering analysis. Instead of hard clustering, this dissertation introduces a soft clustering method——Rough Clustering method, which can reflect the practicability of the new method on measuring similarity of time series. Main works and innovations of this dissertation are summarized as:
     Firstly, a method to measure similarity of time series based on multi-scale wavelet transformation is presented with the idea of wavelets analysis. And financial time series cases study is also conducted to show that this method considers all the factors affecting the measuring similarity of series and effectively overcomes the shortage of existent methods that fail to balance between outline and detail differences of series.
     Secondly, discusses rough clustering of sequences and shows its advantages through financial cases study. Furthermore, analysis on three issues as follow is considered: (1) to discuss the impact of threshold parameters on clustering results by establishing the quality indicators for rough clustering; (2) to integrate the rough clustering and hierarchical clustering so that we can make most of their advantages; (3) to transfer soft clustering into hard clustering, to condense the results of rough clustering by the iteratively-removed-method, and to show its feasibility by comparing with original results.
     Finally, we also discuss the algorithms used in these methods, and share programming code in form of Matlab. Results from empirical research are convincible.

引文

[1]Han J.,Kamber M.著,范明,孟小峰译.数据挖掘概念与技术[M].北京:机械工业出版社,2001.
    [2]Hand D.,Mannila H.and Symth P.著,张银奎等译.数据挖掘原理[M].北京:机械工业出版社,2003.
    [3]Fayyad U.and Uthurusamy R.Data Mining and Knowledge Discovery in Database.Communications of the ACM,1996,39(11):204-211.
    [4]Olivia P.R.著,朱扬勇等译.数据挖掘实践[M].北京:机械工业出版社,2003.
    [5]陈京民等.数据仓库与数据挖掘技术[M].北京:电子工业出版社,2002.
    [6]邵峰晶,于忠清.数据挖掘原理与算法[M].北京:中国水利水电出版社,2003.
    [7]陈文伟,黄金才,赵新昱.数据挖掘技术[M].北京:北京工业大学出版社,2002.
    [8]史忠植.知识发现[M].北京:清华大学出版社,2002.
    [9]马超群,兰秋军,陈为民著.金融数据挖掘[M].科学出版社,2007.
    [10]兰秋军.金融时间序列隐含模式挖掘方法及其应用研究[D].博士论文,2004.
    [11]Agrawal R.,Faloutsos C.and Swami A.Efficient similarity search in sequence database[C].In:Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms,Chicago,USA,1993:69-84.
    [12]Faloutsos C.,Ranganathan M.and Manolopoulos Y.Fast subsequence matching in time-series database[A].In:Proc of the ACM SIGMOD[C].New York:ACM Press,1994:419-429.
    [13]Agrawal R et al.Fast similarity search in the presence of noise,scaling,and translation in time-series database[A].In:Proceedings of 21st International Conference on Very Large Data Bases[C].Zurich:[s.n.],1995:305-316,490-500.
    [14]李斌,谭立湘,解光军等.非同步多时间序列中频繁模式的发现算法[J].软件学报,2002,13(3):410-416.
    [15]谭旭.时间序列例外模式挖掘研究[M].硕士论文,昆明:云南大学,2002.
    [16]李斌.金融时间序列数据挖掘算法研究[D].博士论文,合肥:中国科学技术大学,2001.
    [17]王振龙.时间序列分析[M].北京:中国统计出版社,2000.
    [18]Cohen P.R.and Oates T.Finding Structure in Streams[C].In:Advances in Intelligent Data Analysis,Proc.Of the IDA-95 Symposium of the Int' l Institute for Systems Research,Informatics and Cybernetics.Baden-Baden,Germany,August 1995,27-31.
    [19]Oates T.and Cohen P.R.Searching for Structure in Multiple Streams of Data[C].In:Proc.Of the 13th int' l.Conf.on Machine Learning(ICML' 96).Bari, Italy,July 1996,346-354.
    [20]Kovalerchuk B.and Vityaev E..Data Mining in Finance[M]:Advances in Relational and Hybrid Methods.Kluwer Academic Publishers,2001.
    [21]Trippi R.R.and Turban E..Neural Networks in Finance and Investing[M].McGraw Hill-Irwin Publishing,1996.
    [22]Leigh W.,Modani N.,Purvis R.,et al.Stock Market Trading Rule Discovery Using Technical Charting Heuristics[J].Expert System with Applications,2002,23:155-159.
    [23]Wang Y.Mining Stock Price Using Fuzzy Rough Set System[J].Expert System with Applications,2003,24:13-23.
    [24]Chun S.H.and Steven H.K.Data Mining for Financial Prediction and Trading:Application to Single and Multiple Markets[J].Expert System with Applications,2004,26(2):131-139.
    [25]Povinelli R.J.Identifying Temporal Patterns for Characterization and Prediction of Financial Time Series Events[C].In:Proc.Of Temporal,Spatial and Spatio-Temporal Data Mining:First Int'l Workshop(TSDM' 2000).Lyon,France,September 2000,46-61.
    [26]Kim K.Financial Time Series Forecasting Using Support Vector Machines Neurocomputing.2003,55:307-319.
    [27]Christopher J.N.Risk-adjusted,ex ante,Optimal Technical Trading Rules in Equity Markets[J].International Review of Economics and Finance,2003,12:69-87.
    [28]秦前清,杨宗凯.实用小波分析[M].西安:西安电子科技大学出版社,1994.
    [29]杨福生.小波变换的工程分析与应用[M].北京:科学出版社,1999.
    [30]彭玉华.小波变换与工程应用[M].北京:科学出版社,1999.
    [31]Daubechies L.Ten Lectures on Wavelet[M].Capital City Press,1992.
    [32]胡昌华,张军波等.基于MATLAB的系统分析与设计-小波分析[M].西安:西安电子科技大学出版社,1999.
    [33]飞思科技产品研发中心.小波分析理论与MATLAB7实现[M].北京:电子工业出版社,2005.
    [34]张海勤,蔡庆生.基于小波变换的时间序列相似模式匹配[J].计算机学报,2003,26(3):373-377.
    [35]高成等.Matlab小波分析与应用[M].北京:国防工业出版社,2007.
    [36]刘世元,江浩.面向相似性搜索的时间序列表示方法述评[J].计算机工程与应用,2004,9,第40卷,第27期.
    [37]杨敏,王志坚,尹燕敏.时间序列相似性搜索算法研究[J].山东师范大学学报(自然科学版),2001,16(4):373-377.
    [38]李爱国,覃征,贺升平.时间序列数据的相似模式抽取[J].西安交通大学学报,2002,36(12):1275-1278.
    [39]陈晓航,彭宏,谢运祥.基于傅立叶变换的一种时间序列相似搜索算法[J].计算机工程与应用,2002,18:202-203.
    [40]Wu Y.L.,Agrawal D.and Abbadi E.A.A Comparison of DFT and DWT Based Similarity Search in Time-Series Databases[C].In:Proc.Of the 9th ACM CIKM Int' l Conf.on Information and Knowledge Management.McLean,VA,Novermber 2000,488-495.
    [41]Xia B.B.Similarity Search in Time Series Data Sets:[M.S.Thesis of Simon Fraser University].Simon Fraser University,CA,November 1997,1-50.
    [42]Chan K.and Fu A.W.Efficient Time Series Matching by Wavelets[C].In:Proc of the 15th IEEE Int' l Conf.on Data Engineering.Sydney,Australia,March 1999:126-133.
    [43]Kumar P.,Rao M.V.,Krishna P.R.,Bapi R.S.and Laha A.Intrusion detection system using sequence and set preserving metric[C].In:Proceedings of IEEE International Conference on Intelligence and Security Informatics,LNCS Springer Verlag,Atlanta,2005:498-504.
    [44]Kumar P.et al.Rough clustering of sequential data[J].Data Knowl.Eng.2007,(1),1-17.
    [45]Chen C.B.and Wang L.Y.Rough Set-Based Clustering with Refinement Using Shannon' s Entropy Theory[J].Computers and Mathematics with Applications,2006(3):1563-1576.
    [46]朱建平,陈民恳.面板数据的聚类分析及其应用[J].统计研究,2007(4).
    [47]李奕.时间序列数据相似模式挖掘的研究与应用[M].硕士论文,河北工业大学,2005.
    [48]郑诚,蔡庆生.一种多尺度的时间序列相似模式匹配算法[J].小型微型计算机系统,2003,(3),546-549.
    [49]Davood R.and Alberto M.Efficient retrieval of similar time sequences using DFT[C].In Proceedings of the International Conference on Foundations of Data Organizations and Algorithms-FODO 98,Kobe,Japan,November 1998.
    [50]Faloutsos C,Ranganathan M,Manolopoulos Y.Fast subsequence matching in time-series database[A].In:Proc of the ACM SIGMOD[C].New York:ACM Press,1994:419-429.
    [51]邓秀勤.聚类分析在股票市场板块分析中的应用[J].数理统计与管理,1999,(9),1-4.
    [52]来升强,朱建平.数据挖掘中高维数据的粗糙集聚类[J].统计研究,2005,(8),56-60.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700