面向列车客票数据预测分析及特征提取方法的研究

作者：吕晓艳
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：数据挖掘 ; 预测性挖掘 ; 描述性挖掘 ; 决策树归纳 ; 粗糙集 ; 铁路客票营销 ; 铁路客运
英文关键词：data mining ; descriptive mining tasks ; predictive mining tasks ; decision tree induction ; rough set ; train ticket analysis ; train traffic
学位年度：2004
导师：叶阳东
学科代码：081202
学位授予单位：郑州大学
论文提交日期：2004-05-01

摘要

随着铁路信息化技术的发展，作为铁路信息系统子系统的客票营销系统已经积累了丰富的数据，如何以较少的人力和技术成本合理利用现有的客票信息资源获取有价值的决策信息，日趋成为铁路决策部门的一个迫切需求和铁路客票营销和信息技术部门的一个工作重点。数据挖掘技术的迅速发展为铁路客票营销工作的深入分析奠定了良好的理论基础，但是现有的数据挖掘工具在面对海量存储级别的客票数据和结合铁路背景的实际应用需求时，具有一定的局限性，不能直接为其所用，需要结合应用需求进行方法改良。
     本文面向铁路客票的营销需求分析，以铁路客运为背景，针对客票数据特征，围绕如何对铁路客票数据建立有效的数据分析模型进行了深入的研究和大量的应用性实验。本文是以数据挖掘分类方法中的决策树归纳方法和数据挖掘中的概念描述为理论出发点，以建立合理的面向客票数据的数据分析方法为目的的。对于不同的决策树分类算法，特别是对ID3、SLIQ、SPRINT等进行了较为详尽、深入地研究，通过详细的分析和综合研究，针对目前铁路客票营销系统中预测方法的不足，提出了一个改进的决策树方法TTDTPA。此方法具有突破内存的限制、可提取的定量规则以描述主类分布、易于实现并行等特点，从而使得经过改进的决策树分类方法TTDTPA可以更有效地满足铁路客运营销分析的需求。同时，本研究还尝试采用了朴素贝叶斯方法和一种基于等价类划分方法对客票数据分别进行建模，以期能改善对客票数据的分析的综合性能。特别是后一种方法，它可以提取数据集中小类属数据的特征，从而有效的弥补了TTDTPA方法在此方面的局限。通过对这些方法实际应用结果的归纳分析，根据它们不同的特点，在本文最后给出了对实际客票数据进行数据分析时建立数据分析模型的方法。
     通过研究，我们对挖掘技术在客票数据中的应用有了一定的积累，为进一步的研究奠定了良好的基础并提供了一定的理论指导。另一方面，将有效的数据挖掘技术应用于铁路客票营销分析，建立合理的预测分析模型，为铁路部门合理安排运能、科学组织管理提供了准确的决策信息和先进的预测手段。
With the development of information technology in China railway, rich ticket data have been collected in China Railway Train Ticket System (CRTTS), which is the subsystem of China Railway information system. How to efficiently extract the valuable decision information from the huge ticket data sea with the lower human and technique expenditure is becoming the urgent request for the decision department of Railway and has been the key point for the information department of Railway. It is the techniques about data mining developed rapidly that establish the stable theoretical footstone for the further research on the railway ticketing analysis, but there are some limitations existed in present data mining methods when they are applied to the huge datasets with the railway background. So, the generic methods must be improved to fit the application needs.
    Regarding the railway passenger traffic as our study background and analyzing around the train ticketing requirements, we do deeply research and make lots of application experiments on how to build the efficient data analysis model on ticket dataset in CRTTS. The methods of Decision Tree Induction and Concept Description in data mining are the theoretical point which we begin our study, and this research aims at building rational and efficient models to analyze train datasets. Firstly, after detailedly, deeply analyzed and studied on current classification algorithms, especially, such as on ID3, SLIQ, SPRINT, and according to the requirements of decision analyses and the limitations of current prediction methods in CRTTS, a new method TTDTPA, which is based on decision tree induction, is presented. TTDTPA has the characteristic to break the memory restriction, can extract a kind of instructive rules that collect the advantages both prediction and statistic, and is fascile to implement the parallel algorithm. Therefore it is suitable for supporting multi-level requirements of the decision-makers for predictive analysis in CRTTS. Secondly, for improving the integrated analysis, this research also try to take other two data analysis methods to analyze the train ticket data. One is the naive bayesian, and the other is a new method based on the indiscernibility relation. The application experiments had proved that the latter method has efficient ability to extract the data characteristic of the minority kinds in main class, which just in time to make up the TTDTPA's limitation on this side. And then according to the induction analysis based on these methods and considering the application background, the instructive method that is used to building the analysis model on the train ticket data is been given at the end part of this paper.
    This study makes an efficient exploration in the application fields of data mining techniques and provides a favorable groundwork to make further researches on data analysis in CRTTS. And the improved methods have the ability to build an efficient predictive model to help decision maker to know the railway transportation situations well, get the multi-aspect, multi-level analyses for train ticket data.

引文

[1] Ming-Syan Chen, Jiawei Han and Philip S. Yu, Data Mining: An overview from database perspective, IEEE Transaction on Knowledge and Data Engineering 1996.8(6), p 866-883.
    [2] U.Fayyad, G. Piatetsky-Shapiro, Advances in knowledge discovery and data mining, California., AAAI/M1T press, 1996.
    [3] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, Automatic subspace clustering of high dimensional data for data mining applications, In Proc. 1998 ACM-SIGMOD Intl. Conf. Management of Data(SIGMOD'98), Seattle, WA, USA, June 1998, p 94-105.
    [4] Jiawei Han, Micheline Kamber著。数据挖掘：概念与技术．范明，孟小峰等译，机械工业出版社，2001，8。
    [5] S. Rasoul Safavin and David Landgrebe. A Survey of Decision Tree Classifier methodology. IEEE Transactions on Systems, Man and Cybernetics, May/June 1991, 21(3), p660-674.
    [6] Sreerama K. Murthy. Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey. Data Mining and Knowledge Discovery, 1998, 2(4), p345-389.
    [7] Sholom M. Weiss and Casimir A. Dulikowski. Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman, 1991.
    [8] Leonard A. Breslow and David W. Aha. Simplifying decision trees: A survey. Technical Report AIC-96-014, Navy Center for Applied Research in Artificial Intelligence, Naval Blesearch Lab, Washington DC 20375,1996.
    [9] Steven L. Salzberg, Comparing Classifiers: A Critique of Current Research and Methods. Data Mining and Knowledge Discovery, 1999, 1, p1-12.
    [10] Tjen-Sien Lim, Wei-Yin Lob, Yu-Shan Shih. Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-three Old and New Classification Algorithms. Machine Learning, 2000, 40, p203-229.
    [11] R. Hanson, J. Stutz, P. Cheeseman, Bayesian Classification Theory, Technical Report FIA-90-12-7-01, NASAAmes Research Center, Artificial Intelligence Branch, May 1991.
    [12] Http://www.faqs.org/faqs/ai-faq/neural-nets/
    [13] ZDZISKAW PAWLAK, Rough Classification, Int. J. Human-Computer Studies, 1999, 51, p369-383.
    [14] 王国胤著，Rough集理论与知识获取，西安交通大学出版社，2001。
    [15] Http://www.irnrls.com/English
    [16] Ye Yangdong, Lv Xiaoyan, Cai guoqiang, Jia Limin. Train Ticket Predictive Analysis Based on Decision Tree Induction. Proceedings of the International Conference on Machine Learning and Cybernetics(ICMLC 2003), Xi'an, China, November 2003, 4(5), p2409-2414.
    [17] 刘春煌，铁道部客票中心系统的设计与关键技术的实现，中国铁道科学，2001，Vol．22，No．2，4，p15-22。
    [18] 田宁，铁路客票营销分析系统的研究和设计，上海铁道大学学报，2000，Vol．21，No．12，p70-74。
    [19] 铁道部客票总体组．客票营销分析技术报告．铁道部科学研究院，2000．
    [20] 刘春煌，王丽华，李聚宝，方圆，铁路运输企业收入清算系统总体方案的研究及旅客运输清算系统的实施，中国铁道科学，2003，Vol．24，No．1，p12-18。


    [21] 杨建国，刘强，郁松，铁路客票席位管理方法研究，上海铁道大学学报，1999，No．4，p55-59。
    [22] 林欣，铁路客票计算机联网售票站的设置及客运量分析。铁路运输与经济，1998，No．2，p25-26。
    [23] Http://www.sybase.com.cn
    [24] Hunt E B, J Marin, P T Stone. Experiments in Induction. Academic Press. 1966.
    [25] 史忠植著，知识发现。清华大学出版社，2002，1。
    [26] J. Ross. Quinlan. Induction of decision trees. In Machine Learning, 1986, 1, p 81-106.
    [27] J. Wirth and J. Catlett. Experiments on the costs and benefits of windowing in ID3. In 5th Int'l Conference on Machine Learning, 1998.
    [28] J. Ross Quinlan. C4.5: program for machine learning. Morgan Kaufmann, 1993.
    [29] Manish Mehta, Rakesh Agrawal, and Jorma Rissanen. SLIQ: A fast scalable classifier for data mining. In Proc. of the Fifth Int'l Conference on Extending Database Technology (EDBT'96), Avignon, France, March 1996.
    [30] J. Shafer, R. Agrawal, and M. Methta. SPRINT: A scalable parallel classifier for data mining. Very Large Data Bases(VLDB'96), Valencia, Spain. 1996, p168-182.
    [31] Inder Jeet Taneja, On Generalized Information Measures and Their Applications, Advances in Electronics and Electron Physics, 1989, Vol. 76,p327-413.
    [32] James J. Kay, A Review of the Fundamental Measures of Information Theory and their Application in Ecology: Information Theory Measures of Structural Self-organization, Chapter 3 of:Kay, J.J. Self-organization in Living Systems. Ph.D. thesis. 1984.
    [33] C. E. SHANNON, A Mathematical Theory of Communication, Reprinted with corrections from The Bell System Technical Journal,, July, October, 1948 Vol. 27, pp. 379-423, 623--656.
    [34] 杰里米·里夫金等著，熵：一种新的世界观。吕明等译，上海译文出版社。1987，2。
    [35] 北京地区铁路客票管理中心编，客流调查与营销分析。中国铁道出版社，2002，7。
    [36] Rakesh Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, and Arun Swami. An interval classifier for database mining applications. In Proc. of the VLDB Conference, Vancouver, British Columbia, Canada, August 1992,p 560-573.
    [37] M. Mehta, J. Rissanen, and R. Agrawal. MDL-based decision tree pruning. In Int'l Conf. On Knowledge Discovery in Databases and Data Mining (KDD-95), Montreal, Canada, Agu 1995.
    [38] J. Ross Quinlan and Ronald L. Rivest, "Inferring Decision Trees Using the Mininum Description Length Principle", Information and Computation,1989, p 227-248.
    [39] C. Wallace and J. Patrick. Coding decision trees. Machine Learning, 1993, volumell, p7-22.
    [40] Rakesh Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami, Database Mining Problems: An Interval Classifier for Database Mining Applications, Proceedings of the 18th VLDB Cconference Vancouver, British Columbia,Canada 1992.
    [41] Yangdong Ye, Jing Zhang,Junwei Gao, Limin Jia. The Application of Decision Tree Induction of Classification in Train Tickets System. The First International Conference on Machine Learning and Cybernetics (ICMLC'02), Beijing, Nov. 2002, p2049-2055.
    [42] Jing Yang, Hao Wang, Xuegang Hu, Zhonghui Hu, A New Classification Algorithm Based on Rough Set an Entropy, Proceedings of the Second International Conference on Machine Learning and Cybernetics, Xi'an, November 2003, 2-5, p364-367.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700