面向数据流的关联规则挖掘精确度研究

英文题名：Research on Association Rules Mining Precision over Data Stream
作者：郝广浩
论文级别：硕士
学科专业名称：电子商务与信息技术
中文关键词：频繁项集 ; 最大误差 ; 最小支持度阈值 ; 精确度
英文关键词：Frequent Patterns ; Maximum Error ; Minimum Support Threshold ; Mining Precision
学位年度：2011
导师：郑永清
学科代码：081202
学位授予单位：山东大学
论文提交日期：2011-04-20

摘要

当今时代是信息的时代是数字化的时代,随着通信、互联网的发展,社会各行各业存储的数据越来越庞大,在这种背景下,一种新的数据形式——数据流引起了计算机从业人员的关注。如何在海洋一样广阔的数据流中准确的挖掘有价值的信息成为了数据挖掘研究工作新的挑战。面向数据流的关联规则挖掘是数据挖掘的新形式,近些年来的研究热点,被广泛的应用于网络通信、设备维护、证券交易等领域,对于社会的生产和日常生活有着重要的意义。
     数据挖掘工作者对数据流的关联规则挖掘工作展开了大量的相关研究,对数据挖掘的思想、流程、算法提出了许多新的设计。然而这些方法大都把主要的研究工作放在了挖掘的过程、数据处理等方面,忽略了对于关联规则挖掘结果精确度的关注,同时在对挖掘过程的设计中,对于数据准确性的研究也比较有限。数据挖掘的目的是获得可信的、准确的、有价值的信息,由于在数据流环境下的挖掘只能够得到近似的挖掘结果,因此,挖掘结果的精确度将是评价挖掘的关键指标。
     本文围绕着提高挖掘结果精确度的目的,提出了面向数据流的关联规则挖掘的方法,在数据流的获取、处理以及信息的发现等挖掘流程的设计过程中,从处理细节入手,将提高挖掘精确度的思想贯穿其中。本文对数据流关联规则挖掘的工作主要分为三个部分的研究成果：数据获取部分、数据存储部分和数据挖掘部分,围绕着如何提高挖掘精确度,对每一个部分的设计进行了详细的描述。
     首先,在数据获取部分提出了使用滑动时间窗口模型获取数据,并按照每个窗口将数据流分割成为事务形式,这个模型既符合了数据流的特点,又满足了频繁项集挖掘对数据的要求。
     其次,数据存储模型由数据存储结构、数据更新算法和最大误差的选取三部分组成。通过对经典算法FP-growth算法中FP树的改进,本文提出了一个新的数据存储结构FP-Atree,这个存储结构符合了只能一遍读取数据的数据环境要求,节省了数据存储空间,简化了数据逻辑,压缩了存储体积。数据更新算法把整个数据存储时间划分为多个时间框,在时间框结束时对FP-Atree进行剪枝,删除支持度小于最大误差的项集,从而保证了有限的空间资源的到充分的利用,避免了因为数据流的无边界性而导致的存储数据的无限扩张。
     第三部分为最大误差的选取,研究中使用了多项式近似的策略,发现了最大误差与环境资源参数之间的关系,既有效地控制了空间资源,又尽量避免了有效信息的丢失,为提高挖掘结果精确度提供了保障。数据挖掘模型的主要意义在于提高了挖掘精确度,在这一模型中本文提出了基于滑动时间窗口的新阈值,最小支持度阈值S经过修正,每个项集都有适用于本身的阈值。这个模型保证了所有真实支持度大于S的项集都能成为频繁项集,减少了结果项集中非频繁项集的比例,提高了挖掘结果的精确度。
Nowadays, at the digital age, with the development of telecommunication and World Wide Web, the volume of data is increasing extremely. The data stream comes up. Discovering the useful information and knowledge in the data, just like mining the precious in the huge ocean, is a challenge that we face up to. Mining the frequent patterns in the data stream is a new task in recent years, it is meaningful to the social production and our daily life, it can be widely used in telecommunication, facilities maintenance, security exchange and etc.
     The data mining works and researchers make great effort on the data stream mininig and advance a lot of new design on the mininig procedures and algorithms. However most of the researches only put the attention on the mining process, lack of the work on the mining result. The aim of the data mining is the precise, credible and useful information, and as the reverse of our expect, the result of association rules mining on the data stream can only be approximative. So the precision of the mining result should be the parameter key of the association rules mininig.
     In the essay, it illustrate a new method on mining frequent patterns in data stream and algorithms ensuring the precise result. It modify the details of the mining method on the obtaining data, data storage and information discovering. Our research consist of three part:obtaining data, data storage and knowledge discovering, it works for ensuring the precision of the mining result on every method details
     Firstly the sliding time windows divide the data stream to itemsets, drop the items which appears more than twice, then sort the itemset as the sequence of the first layer child node in the FP-Atree from the left to right. After that the itemset can be seen as transactions.
     Secondly the data storage consist of storage structure, data update
     algorithm and computing the maximum error. Our researcher advanced a new data storage structure named FP-Atree, different the FP-tree, it consist of a prefix tree, without the head-node list and the head-node point. The data update algorithm divide the whole time to time frames, after each frame, the node which support less than the maximum error should be pruned.
     Finally it proposes the polynomial strategy to estimate the value of the maximum error, and in Chapter 4 the minimum support threshold has been modified. The proper value of the maximum error and modified minimum support threshold are two parameter keys for increasing the precision of the mining result.

引文

[1]孙秀杰,宋喜莲基于数据挖掘的电信行业分析型CRM系统研究中国管理信息化 2010-02-01
    [2]李艳,杨永健,李树秋基于数据集市的电信经营分析系统模型山东大学学报42卷11期2007-11
    [3]戴东波+,赵杠,孙圣力,基于概率数据流的有效聚类算法,软件学报,20卷,5期,2009-5,pp.1313-1328
    [4］李国徽,陈辉+,挖掘数据流任意滑动时间窗口内频繁模式,软件学报,19卷,10期,2008-10,pp.2585-2596
    [5]Jiawei Han,Micheline Kamber范明,孟小峰译数据挖掘概念与技术机械工业出版社2007-3第1版
    [6]邵峰晶,于忠清,王金龙,孙仁诚数据挖掘原理与算法科学出版社2009-8第二版
    [7]魏莉,杨科华,唐丽华基于数据仓库的电信经营分析和决策支持系统湖南大学2008-10-20
    [8]常建龙,闫莺,宫学庆,戴岱,周傲英基于数据流技术的电信网络流量监控系统山东大学学报2007-11 42卷11期
    [9]刘学军徐宏炳董逸生钱江波王永利,基于滑动窗口的数据流闭合频繁模式的挖掘,计算机研究与发展,43卷,10期,2006,pp.1738-1743
    [10]黄磊,流数据挖掘综述,软件学报,34卷,1期,2007,pp.1-5
    [11]王伟平,李建中,张冬冬,郭龙江,一种有效的挖掘数据流近似频繁项算法,软件学报,18卷,4期,2007,pp.884-892
    [12]Hui Chen, Mining Frequent Patterns in the Recent Time Window over Data Streams The 10th IEEE International Conference on High Performance Computing and Communications,2008, pp.586-693
    [13]Hao Guanghao, Zheng Yongqing, Cui Lizhen, Computing Maximum Error and Reduced Threshold of Mining Frequent Patterns in Data Stream, Information Engineering and Computer Science,2009, pp.1345-1350
    [14]Chris Giannella, Jiawei Han, Jian Pei, Xifeng Yan, Philip S. Yu, Mining Frequent Patterns in Data Streams at Multiple Time Granularities, Data Mining:Next Generation Challenges and Future Directions, AAAI/MIT,2003, pp.191-212.
    [15]Shichao Zhang, Xindong Wu, Chengqi Zhang, Jingli Lu, Computing the minimum-support for mining frequent patterns, Knowl Inf Syst, Springer Publisher, Berlin,2008, pp.233-257.
    [16]Nan Jiang, Le Gruenwald, Research Issues in Data Stream Association Rule Mining, ACM SIGMOD Record,2006, pp.14-19.
    [17]Mohamed Medhat Gaber, Shonali Krishnaswamy, Arkady Zaslavsky; Adaptive Mining Techniques for Data Streams Using Algorithm Output Granularity; The Australasian Data Mining Workshop, December 2003.
    [18]Mohamed Medhat Gaber, Arkady Zaslavsky and Shonali rishnaswamy; Resource-Aware Knowledge Discovery in Data Streams, Int'l Workshop on Knowledge Discovery in Data Streams, September 2004.
    [19]Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying Zhou; False Positive or False Negative:Mining Frequent Itemsets from High Speed transactional Data Streams, Int'l Conf. on Very Large Databases,2004,pp.204-215.
    [20]Liu Xuejun, Xu Hongbing, Dong Yisheng, Qian Jiangbo, Wang Yongli, Mining Frequent Closed Patterns from a Sliding Windows over Data Streams, in Chinese, Journal of Computer Research and Development, Science Press, Beijing,2006, pp. 1738-1743.
    [21]Yun Chi, Haixun Wang, Philip S. Yu, Richard R., Moment:Maintaining Closed Frequent Itemsets over a Stream Sliding Window, IEEE Int'l Conf. on Data Mining, November 2004.
    [22]Xingzhi Sun, Maria E. Orlowska, Xue Li, Finding Frequent Itemsets in High-Speed Data Streams, SI AM International Conference on Data Mining, Maryland, USA,20-22 April,2006,pp.1-6.
    [23]G.S.Manku, R.Motwani, Approximate Frequency Counts over Data Streams, In Proc. Of 28th Intl, Conf. On Very Large Data Base,2002,pp.346-357.
    [24]Sudipto Guha, Nick Koudas, Kyuseok Shim, Data Streams and Histograms, ACM, Symposium on Theory of Computing,2001, pp.471-475.
    [25]Nan Jiang, Discovering Association Rules in Data Streams Based On Closed Pattern Mining. IDAR2007, Beijing, June 2007, pp.83-84.
    [26]Han J., Pei J., Yin Y., Mining Frequent Patterns without Candidate Generation, In Proc,2000 ACM-SIGMOD Int. Conf., Management of Data(SIGMOD'00), pp.1-12.
    [27]Rakesh Agrawal, Ramakrishnan Srikant; Fast Algorithms for Mining Association Rules; Int'l Conf. on Very Large Databases; September 1994.
    [28]Dario Bruzzese, Paolo Buono; Combining Visual Techniques for Association Rules Exploration; The Working Conf. on Advanced Visual Interfaces; May 2004.
    [29]Y. Dora Cai, Greg Pape, Jiawei Han, Michael Welge, Loretta Auvil; MAIDS: Mining Alarming Incidents from Data Streams; Int'l Conf. on Management of Data; June 2004.
    [30]Joong Hyuk Chang, Won Suk Lee, Aoying Zhou; Finding Recent Frequent Itemsets Adaptively over Online Data Streams; ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining; August 2003.
    [31]Joong Hyuk Chang, Won Suk Lee; A Sliding Window Method for Finding Recently Frequent Itemsets over Online Data Streams; Journal of Information Science and Engineering; July 2004.
    [32]Moses Charikar, Kevin Chen, Martin Farach-Colton; Finding Frequent Items in Data Streams; Theoretical Computer Science; January 2004.
    [33]David W. Cheung, Jiawei Han, Vincent T. Ng, C.Y. Wong; Maintenance of Discovered Association Rules in Large Databases:An Incremental Updating Technique; IEEE Int'l Conf. on Data Mining; November 1996.
    [34]David W. Cheung, S.D. Lee, Benjamin Kao; A General Incremental Technique for Maintaining Discovered Association Rules; Int'l Conf. on Database Systems for Advanced Applications; 1997.
    [35]Graham Cormode, S.Muthukrishnan; What's Hot and What's Not:Tracking Most Frequent Items Dynamically; ACM Transactions on Database Systems; March 2005.
    [36]Mohamed Medhat Gaber, Arkady Zaslavsky, Shonali Krishnaswamy; Mining Data Streams:A Review; ACM SIGMOD Record Vol.34, No.2; June 2005.
    [37]Amol Ghoting, Srinivasan Parthasarathy; Facilitating Interactive Distributed Data Stream Processing and Mining; IEEE Int'l Symposium on Parallel and Distributed Processing Systems; April 2004.
    [38]Jiawei Han, Guozhu Dong, Yiwen Yin; Efficient mining of partial periodic patterns in time series database; IEEE Int'l Conf. on Data Mining; March 1999.
    [39]Mihail Halatchev and Le Gruenwald; Estimating Missing Values in Related Sensor Data Streams; Int'l Conf. on Management of Data; January 2005.
    [40]Heike Hofmann, Arno P. J. M. Siebes, Adalbert F. X. Wilhelm; Visualizing Association Rules with Interactive Mosaic Plots; ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining; August 2000.
    [41]Hao Huang, Xindong Wu, Richard Relue; Association Analysis with One Scan of Databases; IEEE Int'l Conf. on Data Mining; December 2002.
    [42]Cheqing Jin, Weining Qian, Chaofeng Sha, Jeffrey X. Yu, Aoying Zhou; Dynamically Maintaining Frequent Items over a Data Stream; Int'l Conf. on Information and Knowledge Management; 2003.
    [43]Hillol Kargupta, Ruchita Bhargava, Kun Liu, Michael Powers, Patrick Blair, Samuel Bushra, James Dull, Kakali Sarkar, Martin Klein, Mitesh Vasa, David Handy; VEDAS:A Mobile and Distributed Data Stream Mining System for Real-Time Vehicle Monitoring; SIAM Int'l Conf. on Data Mining; 2004.
    [44]Richard M. Karp, Scott Shenker; A Simple Algorithm for Finding Frequent Elements in Streams and Bags; ACM Transactions on Database Systems; March 2003.
    [45]S.D. Lee, David W. Cheung; Maintenance of Discovered Association Rules:When to update?; Research Issues on Data Mining and Knowledge Discovery; 1997.
    [46]Hua-Fu Li, Suh-Yin Lee, and Man-Kwan Shan; An Efficient Algorithm for Mining Frequent Itemsets over the Entire History of Data Streams; Int'l Workshop on Knowledge Discovery in Data Streams; Sept.2004.
    [47]Chih-Hsiang Lin, Ding-Ying Chiu, Yi-Hung Wu, Arbee L. P. Chen; Mining Frequent Itemsets from Data Streams with a Time-Sensitive Sliding Window; SIAM Int'l Conf. on Data Mining; April 2005.
    [48]Guojun Mao, Xindong Wu, Chunnian Liu, Xingquan Zhu, Gong Chen, Yue Sun, Xu Liu; Online Mining of Maximal Frequent Itemsequences from Data Streams; University of Vermont. Computer Science Technical Report CS-05-07; June 2005.
    [49]Matthew Eric Otey, Chao Wang, Srinivasan Parthasarathy, Adriano Veloso, Wagner Meira Jr.; Mining Frequent Itemsets in Distributed and Dynamic Databases; IEEE Int'l Conf. on Data Mining; 2003.
    [50]Matthew Eric Otey, Srinivasan Parthasarathy, Chao Wang, Adriano Veloso, Wagner Meira Jr.; Parallel and Distributed Methods for Incremental Frequent Itemset Mining; IEEE Transactions on Systems, Man and Cybernetics; December 2004.
    [51]S.Parthasarathy, M. J. Zaki, M. Ogihara, S. Dwarkadas; Incremental and interactive sequence mining; Int'l Conf. on Information and Knowledge Management; 1999.
    [52]Helen Pinto, Jiawei Han, Jian Pei, Ke Wang, Qiming Chen, Umeshwar Dayal; Multi-Dimensional Sequential Pattern Mining; Int'l Conf. on Information and Knowledge Management; 2001.
    [53]Richard Relue, Xindong Wu, Hao Huang; Efficient runtime generation of association rules; Int'l Conf. on Information and Knowledge Management; October 2001.
    [54]Assaf Schuster, Ran Wolff, and Dan Trock; Distributed Algorithm for Mining Association Rules; IEEE Int'l Conf. on Data Mining; November 2003.
    [55]Wei-Guang Teng, Ming-Syan Chen, and Philip S. Yu; Resource-Aware Mining with Variable Granularities in Data Streams; SIAM Int'l Conf. on Data Mining; 2004.
    [56]Adriano Veloso, Wagner Meira Jr., Marcio Carvalho, Srini Parthasarathy, Mohammed J.Zaki; Parallel, Incremental and Interactive Mining for Frequent Itemsets in Evolving Databases; Int'l Workshop on High Performance Data Mining:Pervasive and Data Stream Mining; May 2003.
    [57]Adriano Veloso, Matthew Eric Otey, Srinivasan Parthasarathy, Wagner Meira Jr.; Parallel and Distributed Frequent Itemset Mining on Dynamic Datasets; Int'l Conf. on High Performance Computing; 2003.
    [58]Haixun Wang, Wei Fan, Philip S. Yu, Jiawei Han; Mining Concept-Drifting Data Streams using Ensemble Classifiers; ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining; August 2003.
    [59]Ran Wolff, Assaf Schuster; Association Rule Mining in Peer-to-Peer Systems; IEEE Transactions on Systems, Man and Cybernetics, Part B, Vol.34, Issue 6; December 2004.
    [60]Li Yang, Mustafa Sanver; Mining Short Association Rules with One Database Scan; Int'l Conf. on Information and Knowledge Engineering; June 2004.
    [61]Chung-Ching Yu, Yen-Liang Chen; Mining Sequential Patterns from Multidimensional Sequence Data; IEEE Transactions on Knowledge and Data Engineering; January 2005.
    [62]Qingguo Zheng, Ke Xu, Shilong Ma; When to Update the Sequential Patterns of Stream Data; Pacific-Asia Conf. on Knowledge Discovery and Data Mining; 2003.
    [63]Yunyue Zhu, Dennis Shasha; StatStream:Statistical Monitoring of Thousands of Data Streams in Real Time; Int'l Conf. on Very Large Data Bases; 2002.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700