RFID路径数据聚类分析与频繁模式挖掘

英文题名：RFID Path Data Clustering Analysis and Frequent Pattern Mining
作者：林国省
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：RFID ; 数据挖掘 ; 路径相似度 ; 聚类分析 ; 频繁模式挖掘
英文关键词：RFID ; Data Mining ; Path Similarity ; Clustering Analysis ; Frequent Patterns Mining
学位年度：2010
导师：邓辉舫
学科代码：081202
学位授予单位：华南理工大学
论文提交日期：2010-06-01

摘要

RFID路径数据是指带有RFID标签的物品在移动过程中产生的路径数据。如何从大量的RFID路径数据中提取有用的信息和知识,成为一门重要的研究课题。本文结合RFID应用,分析了RFID路径数据的特点,提出了若干适用于RFID路径数据的聚类分析方法和频繁模式挖掘方法。在实际应用中,这些挖掘方法能够帮助业务决策,优化或改善业务安排和规划等。
     在聚类分析方面,路径对象的相似度计算是路径聚类分析算法的基础,本文借鉴了生物信息学中序列比对的相关研究成果,讨论了基于全局和局部相似性的两种路径相似度计算方法。在聚类分析算法方面,传统的聚类分析方法并不能处理RFID路径数据,本文提出了基于密度聚类的路径聚类算法DBPC。DBPC算法根据路径数据的特点,采用了新的构建簇的方法;本文还提出了路径数据的层次聚类算法PHC,使用簇成员加权方案计算簇之间的相似度;最后讨论了异常路径的检测方法和复杂路径数据聚类的可行方法。
     在频繁模式挖掘方面,本文修改和简化了传统的闭频繁序列挖掘算法CloSpan实现频繁模式挖掘;并基于修改的CloSpan算法提出了一种频繁模式挖掘算法CFPM。CFPM算法根据路径数据的特点,提出了基于节点计数剪枝方法,提高了剪枝效率,比修改的CloSpan算法具有更好的挖掘效率。最后本文讨论了复杂路径数据的频繁模式挖掘的可行方法。
     本文开发了RFID路径数据挖掘实验系统。此系统具有路径数据可视化的功能,能够直观地表现路径数据的分布和挖掘结果。实验表明,本文讨论的聚类方法和频繁模式挖掘方法能够适用于RFID路径数据。其中聚类算法DBPC和PHC算法能够形成高质量的路径簇,PHC算法具有较高的簇合并效率,路径频繁模式挖掘算法CFPM比传统的CloSpan挖掘算法提高了挖掘效率。
With RFID technologies and applications developing, the data generated by RFID applications is growing rapidly. How to extract useful information and knowledge from large amounts of data has became an important research topic. RFID path data refer to the path data generated by RFID-tagged objects in their movement process. In this thesis, we analyzed the characteristics of RFID data path based on real applications, proposed some clustering analysis approaches and frequent path mining approaches, which suitable for RFID path data. In applications, these approaches can help business decision-making, optimize or improve the business arrangements, support the planning and so on.
     In cluster analysis, the similarity measurement is the basis of clustering algorithms. We refered to some related alignment techniques in Bioinformatics, discussed two path similarity calculation methods– global and local similarity based, which can reveal the true similarity of paths. The traditional clustering algorithms can not process RFID path data. This thesis presented a density based clustering algorithm DBPC for RFID path data clustering. Based on the characteristics of path data, DBPC proposed a new way to build clusters. We also proposed a hierarchical clustering algorithm PHC for path data, which use a weight scheme to calculate cluster similarity. Finally, we discuss outlier detection methods and complex path data clustering methods.
     In frequent path mining, we modified the closed frequent sequence mining algorithm CloSpan to make it suitable for path data. Based on CloSpan, we proposed an algorithm CFPM for frequent path mining, which well correspond with characteristics of the path data, use a node counting scheme for tree cutting, and has better performance. Finally we disscuss frequent pattern mining for complex path data.
     We developed an experiment system for RFID path data mining, which has path visualization functions, can intuitively reveal the distribution of path data and mining results. Experiments show that the mining approaches proposed by this thesis can be well applied to RFID path data. DBPC and PHC can build high quality clusters and have good efficiency. CFPM improves the mining efficiency compare to tranditional CloSpan.

引文

[1] S.Chawathe, V.Krishnamurthy, S.Ramachandtan, et al.Managing RFID data[C]. Proc. of the 30th Int.Conf.on Very Large Data Bases (VLDB'04), 2004:1189-1195.
    [2] F.Wang, P.Liu.Temporal management of RFID data[C].Proc.of the 31st Int.Conf.on Very Large Data Bases (VLDB'05), 2005:1128-1139.
    [3] R.Derakhshan, M.Orlowska, and X.Li.RFID data management: challenges and opportunities[C]. Proc.of the 1st IEEE Int.Conf.on RFID, 2007:175-182.
    [4]李战怀,聂艳明,陈群等.RFID数据管理的研究进展[J].中国计算机学会通讯, 2007, 8(8):50-58.
    [5]科技部,信息产业部等十五个部委.中国射频识别(RFID)技术政策白皮书,2006. [EB/OL].2008-5-21.
    [6] X.Zhang,and X.Lian.Design of warehouse information acquisition system based on RFID[C].Proc.of the 2008 IEEE Int.Conf.on Automation and Logistics (ICAL'08), 2008:2550-2555.
    [7] B.Yan,Y.Chen,and X.Meng.RFID technology applied in warehouse management system[C].Proc.of the 2008 ISECS International Colloquium on Computing, Communication,Control,and Management(CCCM'08),2008:363-367.
    [8] H.Tan.The application of RFID Technology in the warehouse management information system[C].Proc.of the 2008 Int.Symposium on Electronic Commerce and Security, 2008:1063-1067.
    [9] Z.Berenyi and H.Charaf.Utilizing tracking data in RFID-equipped warehouses[C]. Proc.of the IEEE Int.Conf.on Communications Workshops (ICC Workshops'08), 2008:169-173.
    [10] Z.Berenyi and H.Charaf.Retrieving frequent walks from tracking data in RFID-equipped warehouses[C]. Proc.of the 2008 Conf.on Human System Interactions, 2008:663-667.
    [11] Y.Cheung, K.Choy, C.Lau, et al.The impact of RFID technology on the formulation of logistics strategy[C]. Proc.of the 2008 Portland Int.Conf.on Management of Engineering&Technology (PICMET'08), 2008:1673-1680.
    [12]叶年发,沈海燕,冯云梅.基于RFID及智能优化的物流配送方法和技术的研究[J].交通运输系统工程与信息,2008,8(2):131-135.
    [13] H.Kim and S.Sohn.Cost of ownership model for the RFID logistics system applicable to u-city [J].European Journal of Operational Research, 2009, 194(2):406-417.
    [14] H.Baars and X.Sun.Multidimensional Analysis of RFID Data in Logistics[C]. Proc. ofthe 42nd Hawaii Int.Conf.on System Sciences, 2009:1-10.
    [15] H.Baars, H.Kemper, H.Lasi, et al.Combining RFID technology and business intelligence for supply chain optimization-scenarios for retail logistics[C]. Proc.of the 41st Annual Hawaii International Conference on System Sciences (HICSS'08), 2008:73.
    [16] S.Sarma,D.L.Brock,andK.Ashton.Thenetworkedphysicalworld.In Whitepaper, MITAuto-IDCenter,http://archive.epcglobalinc.org/publishedresearch/MIT-AUTOID-WH-001.pdf,2000
    [17]陶学宗.我国RFID技术的应用状况分析及发展对策[J].金卡工程,2007,第5期:44-47
    [18] Venturedevelopmentcorporation (vdc).In http://www.vdc-corp.com/.
    [19] J. Han, H. Gonzalez, X. Li, and D. Klabjan, "Warehousing and mining massive RFID data sets", ADMA’06.
    [20]王能斌编著.数据库系统教程[M].北京:电子工业出版社,2002:219-239.
    [21]林杰斌,刘明德,陈湘等.数据挖掘与OLAP理论与实务[M].北京:清华大学出版社, 2003.
    [22] S.Cong, J.Han, H.Jay, et al. Padua: A sampling-based framework for parallel data mining[C]. Proc.of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP'05), page 255-265.
    [23] J.Han,M.Kamber.Data mining Concepts and Techniques(2nd edition)[M].北京:机械工业版社,2006:489-513.
    [24] J.Larson, E.Bradlow, and P.Fader.An exploratory look at supermarket shopping paths [J].International Journal of Research in Marketing, 2005, 22(4): 395-414.
    [25] Lim H.A.Bioinformatics, Supercomputing. and Complex Genome Analysis[M].New jersey:World Scientific Publishing Company,1993.
    [26] Rashidi H,Buehler L.Bioinformatics basic:Appl ication in biological science and medicine[M].CRC Press,2000.
    [27] Luscombe NM,C,reeabaum D,Gerstdn M what is bloinformatics? A proposed definilion and overview of the field.Methods Information in Medicine.2001.40(4):346-58.
    [28] H.Gonzalez, J. Han, and X.Li. Flowcube: constructing RFID flowcubes for multi-dimensional analusis of commodity flows[C]. Proc.of the 32nd Int.Conf.on Very Large Data Bases (VLDB'06), 2006:834-845.
    [29] J.Gray,A.Bosworth,A.Layman,et al.Data cube:A Relational Aggregation Operator Generalizing Group-By,Cross-Tab,and Sub-Totals[C].Proc.of the 12th Int.Conf.on DataEngineering(ICDE'96),1996:152-159.
    [30] U.Chaudhuri, and U.Dayal.Data warehousing and OLAP for decision support[C]. Proc.of the 1997 ACM SIGMOD Int.Conf.on Management of data, 1997:507-508.
    [31] S.Jeffery,M.Garofalakis,and M.Franklin.Adaptive cleaning for RFID data streams [C].Proc.of the 32nd Int.Conf.on.Very Large Data Bases(VLDB'06),2006: 163-174.
    [32] S.Jeffery,G.Alonso,M.Franklin,et al.A pipelined framework for online cleaning of sensor data streams[C].Proc.of Int.Conf.on Data Engineering(ICDE'06),2006:140.
    [33] B.Carbunar,M.Ramanathan.M.Koyuturk,et al.Redundant reader elimination in RFID systems[C].Proc.of the 2nd IEEE Communications Society Conf.on Sensor and Ad Hoc Communications and Networks,2005:176-184.
    [34]张铮波,基于RFID的供应链物流跟踪系统研究[D],复旦大学硕士论文,2008
    [35] J.-G. Lee, J. Han, and K.-Y. Whang, "Trajectory clustering: A partition-and-group framework," In Proc. 2007 ACM SIGMOD ICM D, Beijing, China, June 2007.
    [36] Needleman,S B,Wunsch,C D.A general method applicable to the search for similarities in the amino acid sequences of two proteins[J].Joumal of M01ecular Bi0109y,48:443-453, 1970.
    [37] Smith T.Waterman MS, Burks C. The statistical dlsttibutioa of nucleic acid similarlties.Nucleic Acids Research.1985.13(2).645—656
    [38] Altschul,S.E,Gish,W.,Myers,E.W,Lipman,D—J Basic local alignment searchtool,J.M01.Bi01.,1990,215:403-410.
    [39] Pearson WR.Lipman DJ Improved tools for biological sequence comparison 1988
    [40] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, Addison Wesley, 2005.
    [41] Guha S, Rastogi R, Shim K.ROCK: A Robust Clustering Algorithm for Categorical Attributes. Proc of the IEEE Conference on Data Engineering. Sydney, Australia, 1999: 512-521
    [42] M.M.Breunig, H.-P.Kriegel, R. T.Ng, and J.Sander, LOF: Identifying density-basedlocaloutliers, in Proc.2000 ACMSIGMOD Int’lConf. On Management of Data, Dallas, Texas, May2000, pp.93–104.
    [43] R. Srikant, and R. Agrawal. Mining sequential patterns: Generalizations andperformance improvements[C].Proc.of the 5th Int.Conf.on Extending Database Technology (EDBT’96). 1996:3-17.
    [44] M.Zaki.SPADE: An efficient algorithm for mining frequent sequences [J].Mach Learning,41(12), 2001:31-60.
    [45] J. Han and J.Pei.Freespan: frequent pattern-projected sequential pattern mining [C]. Proc.of the 6th ACM-SIGKDD Int.Conf.on Knowledge Discovery and Data Mining (KDD'00), 2000:355-359.
    [46] J.Pei, J. W.Han, H. Pinto, et al.PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth [J].IEEE Transactions on Knowledge & Data Engineering, 16(1), 2004:1424-1440.
    [47] X.Yan,J.Han,and R.Afshar.CloSpan:Mining closed sequential patterns in large datasets[C].Proc.of the 6th ACM-SIGKDD Int.Conf.Knowledge Discovery and Data Mining(KDD'00),2000:355-359.
    [48] J.Wang, J. Han.BIDE: Efficient Mining of Frequent Closed Sequences[C].Proc.of the 20th Int.Conf.on Data Engineering (ICDE'04), 2004:79-90.
    [49] Richard Durbin, Sean R. Eddy, Anders Krogh, Graeme Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1999
    [50] B. Sarwar, G. Karypis, J. Konstan and J. Riedl,―Item-based Collaborative Filtering Recommendation Algorithms‖, Proc. 10th International Conference on the World Wide Web, pp.285-295, 2001.
    [51] S.Choi,H.Jung,K.Bang,et al.Real-time data stream management system for large volume of RFID events[C].Proc.of the 2008 Int.Conf.on Convergence and Hybrid Information Technology,(ICHIT'08),2008:515-521.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700