网格环境中数据挖掘执行过程模型的研究

英文题名：Research on Data Mining Execution Process Model in Grid Environments
作者：张燕
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：网格 ; 分布式数据挖掘 ; 数据挖掘操作 ; 执行过程模型 ; 优化 ; 执行引擎 ; 接口规范
英文关键词：Grid ; distributed data mining ; data mining operator ; execution process
英文关键词：model ; optimization ; execution engine ; interface specification
学位年度：2012
导师：孟洛明
学科代码：0812
学位授予单位：北京交通大学
论文提交日期：2012-01-01
答辩委员会主席：钱德沛

摘要

随着信息技术的发展,大量数据在各种应用中被产生出来,并被分布的存储和积累在不同地点,如何从这些大量积累的、分布的数据中发现有用的、潜在的知识模式是一个极具挑战性的问题。网格技术用于在分布异构的资源间实现协作和共享,将数据挖掘技术运用于网格平台之上,为从大量分布的数据中获取有用的知识模式提供了有效的解决方案。但数据挖掘过程是一个涉及大量操作和数据的复杂过程,与网格平台相结合,无疑又增加了挖掘过程的复杂性。目前在数据挖掘技术的研究中,数据挖掘算法被作为一个独立的整体,以黑盒的方式出现在应用中,在这种情况下,数据挖掘执行过程对用户和执行环境是不可见的,这使得集中式环境中的数据挖掘算法不能根据分布式环境的特点动态的转化为分布式的数据挖掘过程,用户不能灵活的对数据挖掘执行过程进行控制。此外,访问数据挖掘服务与访问网格服务的接口相互独立给用户访问网格中的数据挖掘服务带来不便。这些因素都导致了数据挖掘技术在网格平台上不能有效的发挥其作用。正如实际的铁路货运应用系统中需要解决的问题：在铁路货运网格平台的基础上,如何充分利用分布的计算资源,对分布在各个铁路局的货运数据进行有效的深层次的挖掘以辅助决策。
     在本文提出的方法中,数据挖掘算法被分解成由细粒度数据挖掘操作组成的执行过程模型；在此基础之上,结合网格环境中数据资源和计算资源的分布情况,对模型进行优化,得到可以在网格中执行的分布式数据挖掘执行过程模型；然后,执行引擎将模型调度到各个网格节点执行；最后通过统一的、与网格平台相兼容的接口将数据挖掘结果提供给用户。本文在网格平台上,使用提出的方法实现了关联规则、序列模式、决策树分类器和朴素贝叶斯分类器等典型的数据挖掘执行过程模型的分解、优化与执行。
     本文的主要工作及创新点包括：
     ·提出了由细粒度的数据挖掘操作组成的数据挖掘执行过程模型,用于描述数据挖掘算法的执行过程,将数据挖掘算法白盒化。通过该模型,用户、应用程序和执行环境能清晰的理解整个数据挖掘算法执行经过的中间步骤及各步骤产生的中间结果的信息。本文在集中式环境中,基于仿真数据对数据挖掘执行过程模型中的各个操作进行了实验评估,证明了数据挖掘执行过程模型能够将数据挖掘算法白盒化,将算法中各个步骤的执行情况展现出来。
     ·设计了基于网格环境的数据挖掘执行过程模型的优化算法,用于将集中式执行过程模型转化为可在多个网格节点上并行执行的分布式执行过程模型,该优化算法采用从部分到整体逐层处理的方式,将整个优化过程分为数据具体化、全局优化和局部优化三个子过程,在每个子过程中,根据数据挖掘操作的类型和数据分布的特征对操作依次进行优化。本文基于网格平台,使用仿真数据对分布式数据挖掘执行过程模型进行了实验,验证了分布式数据挖掘执行过程模型在响应时间和资源使用平衡方面优于集中式的处理方式。
     ·设计了数据挖掘执行过程模型引擎,为分布式数据挖掘执行过程模型在网格平台上执行提供了运行环境,其中,设计了(a)基于网格环境的调度算法,用于将分布式数据挖掘执行过程模型以流程链为单位调度到各个网格节点执行；(b)基于WSRF规范实现的执行服务和控制服务。本文在网格平台上,使用仿真数据进行实验,分析了分布式数据挖掘执行过程模型在网格环境中使用引擎调度执行时,各个流程链调度执行的响应时间；并基于铁路货运网格实验平台和实际的货票数据,使用CART决策树分类器实现了铁路重点客户的预测。
     ·设计了在网格环境中访问数据挖掘服务的接口规范WS-DAI-DM,其目的是使数据挖掘服务与基于OGSA体系架构的网格平台无缝融合,使用户能够像使用网格平台提供的其他服务一样来使用网格环境中的数据挖掘服务。本文通过实例说明了如何使用WS-DAI-DM接口规范,该规范已提交开放网格社区(Open Grid Forum)。
     最后,对全文做了总结,并对下一步研究工作做了展望。
With the development of information technology, large amounts of data is produced in different applications and accumulated at different locations in distributed way. How the useful and hidden knowledge/patterns can be extracted from the accumu-lated data is one of the most challenging issues. Grid technology enables the collabora-tion and sharing among distributed and heterogeneous resources. Applying data mining in Grid provides an effective solution to extract knowledge from large-scale geographi-cally distributed data. Since data mining is a non-trivial process which is composed of many operations executed on large amounts of data, the combination of data mining and Grid will inevitably increase the complexity of data mining processes. In the previous research, data mining process is always treated as independent black-box algorithms in applications in which the functionality and intermediate steps are hidden. During this process, the execution processes of data mining are invisible to users and environments, and data mining algorithms used in central environments cannot be automatically trans-formed to the processed that can be executed in distributed environments according to the distributed resources, and users cannot control data mining execution; moreover, the independence between the interfaces for data mining services and Grid services are in-convenient for users to access data mining services in Grid. As a result, data mining cannot work efficiently as we expect in Grid environments. As the problem encountered in railway freight application system:based on Railway Freight Grid, how distributed computational resources can be efficiently used to extract knowledge from the freight data distributed at railway bureaus in order to support decision making.
     In our approach, data mining algorithms are decomposed as execution process mod-els which are composed of finer-grained data mining operators, and then the models are optimized according to the distribution of data and computational resources in Grid to get the distributed data mining execution process models; the execution engines schedule the models and assign the tasks to different nodes in the Grid, and users can get the data mining results via unified and Grid-compliant interfaces. In the thesis, based on Grid, the approach is used to process the following data mining algorithms:association rules mining, sequential patterns mining, CART classifier and naive Bayesian classifier.
     The major contributions of this thesis include:
     ·Data mining execution process model composed of finer-grained data mining oper-ators enabling to describe the execution process of data mining algorithms. Users, applications and execution environments can have a clue about the intermediate steps and intermediate results via the execution process models. The data min-ing operators are evaluated based on simulation data by the experiments which are executed in central environment, and the result shows that data mining execution process model can show the execution of every step of data mining algorithms.
     ·The optimization algorithm proposing how to transform data mining execution pro-cess models to distributed ones which can execute in Grid, the optimization algo-rithm is divided into three sub-processes:data localization, global optimization and local optimization, and in every sub-process, data mining operators are optimized according to the type of operators and the distribution of data. Distributed data mining execution process models are evaluated based on simulation data in Grid, the results prove that distributed models can execute in shorter response time and use computational resource in more balanced way than centralized processing.
     ·DMEP engine providing a runtime environment for data mining execution process models in Grid, in the engine,(a) the scheduling algorithm enabling to assign flow chains to Grid nodes and (b) WSRF-based model execution service and process control service enabling users to control the execution of flow are proposed. When distributed data mining execution processes are schedules by DMEP engine in Grid, the response time of flow chains are evaluated based on simulation data; an appli-cation example about predicting railway freight major clients are described, which uses freight waybill data and is deployed on Railway Freight Grid test bed.
     ·The interface specification for accessing data mining services in Grid defined by OGSA-WS-DAI-DM enabling the seamless combination of data mining services and Grid, users can access data mining services in the same way as they access other services provided by Grid. An application example shows how to use WS-DAI-DM, and WS-DAI-DM has been submitted to Open Grid Forum.
     The conclusion and proposals for future work are listed at the end of the thesis.

引文

[OGSA]I. Foster (Ed), H. Kishimoto (Ed), A. Savva (Ed), D. Berry, A. Djaoui, A. Grimshaw, B.Horn, F. Maciel, R. Subramaniam, J. Treadwell, J. Von Reich. The Open Grid Services Architecture, Version 1.0. Global Grid Forum. GFD-I.030.29 January 2005. http://www.ggf.org/documents/GFD.30.pdf.
    [RFC2119]S. Bradner, Key words for use in RFCs to Indicate Requirement Levels, Internet Engineering Task Force, RFC 2119, http://www.ietf.org/rfc/rfc2119.txt March 1997.
    [WS-DAI]M. Antonioletti, M. Atkinson, S. Laws, S. Malaika, N. W. Paton D. Pearson and G.Riccardi. Web Services Data Access and Integration-The Core (WS-DAI) Specification,Version 1.0. Draft, Global Grid Forum,2006.
    [WS-Security] OASIS Web Services Security 1.0 (WS-Security 2004) standard as of April 6th 2004, http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=wss.
    [OGSA Glossary] J. Treadwell, Open Grid Services Architecture Glossary of Terms, GFD-I.044, January 25th 2005. http://www.ggf.org/documents/GFD.44.pdf.
    [XPath]J. Clark and S. DeRose. XML Path Language (XPath), Version 1.0 W3C Rec-ommendation 16 November 1999. See:http://www.w3.org/TR/xpath.
    [XQuery]S. Boag, D. Chamberlin, M. F. Fernandez, D. Florescu, J. Robie and J. Simeon. XQuery 1.0:An XML Query Language, W3C Candidate Recommen-dation 8 June 2006. See:http://www.w3.org/TR/xquery/.
    [XUpdate]A. Laux and L. Martin. XUpdate Working Draft, last release September 14, 2000. See:http://xmldb-org.sourceforge.net/xupdate/xupdate-wd.html.
    [PMML]Data Mining Group (DMG), Predictive Model Markup Language (PMML) standard, version 4.0, See:http://www.dmg.org/v4-0/GeneralStructure.html.
    [SQL]http://www.wiscorp.com/SQLStandards.html
    [CIM]Distributed Management Task Force, Inc. Common Information Model (CIM) Standards, version 2.23.0,22 October 2009. See: http://www.dmtf.org/standards/cim/.
    [1]Gantz J F, Reinsel D, Chute C, et al. The Diverse and Exploding Digital Uni-verse:An Updated Forecast of Worldwide Information Growth Through 2011. Tech-nical report, IDC, March,2008. http://www.emcgrandprix.com/collateral/analyst-reports/ diverse-exploding-digital-universe.pdf.
    [2]Han J, Kamber M. Data Mining:Concepts and Techniques. San Francisco, USA:Morgan Kaufmann Publishers,2006.
    [3]Fayyad U, Piatetsky-Shapiro G, Smyth P. From Data Mining to Knowledge Discovery in Databases. AI magazine,1996,17(3):37.
    [4]Dubitzky W. Data Mining Techniques in Grid Computing Environments. New Yoik, USA: John Wiley & Sons, Ltd, January,2009.
    [5]Foster I, Kesselman C. The Grid:Blueprint for a New Computing Infrastructure. San Fran-cisco, USA:Morgan Kaufmann,2004.
    [6]Ozsu M T, Valduriez P. Principles of Distributed Database Systems (2nd Edition). New Jersey, USA:Prentice Hall,1999.
    [7]铁道部信息技术中心.中国铁路TMIS工程.北京：中国铁道出版社,2005.
    [8]左建丽,陈莉.铁路货票信息综合应用的研究.铁路计算机应用,2005,14(B07)：9-12.
    [9]Atkinson M, Brezany P, Corcho O, et al. ADMIRE White Paper:Motivation, Strategy, Overview and Impact. Technical report, EPCC, University of Edinburgh,2009. www. admire-project.eu/docs/ADMIRE-WhitePaper.pdf.
    [10]中华人民共和国铁道部.铁路信息化总体规划.中华人民共和国铁道部,2005.
    [11]Zhang Y, Wohrer A, Brezany P, et al. Towards China's Railway Freight Transportation In-formation Grid. Proceedings of International Conference on Grid and Visualisation Systems, Croatia,2009.
    [12]邢智明,李红辉,戴钢,et al.基于网格的铁路货运信息综合应用系统研究.华中科技大学学报(自然科学版),2010,38(S1)：115-119.
    [13]铁路货运网格项目组.基于网格的铁路货运信息综合应用系统技术手册.Technical report,2010.
    [14]邢智明,李晓林,李红辉,et al.铁路货运网格数据平台关键技术研究.华中科技大学学报(自然科学版),2011,39(S1)：193-197.
    [15]李晓林.一种松耦合的信息网格体系结构及全生命周期评价[博士学位论文].北京：中国科学院计算技术研究所,2005.
    [16]徐志伟,李晓林,游赣梅。织女星信息网格的体系结构研究.计算机研究与发展,2002,39(8)：948-951.
    [17]Li X, Huang W, Yan B. Vega Information Grid:a suite of toolkit for building information sharing scenario. Proceedings of Grid and Cooperative Computing Workshops 2006. IEEE, 2006.536-542.
    [18]Li X, Huang W, Zha L. The Architecture and Implementation of the Vega Information Grid. International Journal of Web and Grid Services,2007,3(4):462-479.
    [19]铁路货运网格项目组.基于网格的铁路货运信息综合应用系统需求分析文档. Technical report,2010.
    [20]ADMIRE-Architecture for Data Intensive Research. http://www.admire-project.eu/.
    [21]Galea M, Atkinson M, Liew C, et al. ADMIRE Final Report on the ADMIRE Architec-ture. Technical report, EPCC, University of Edinburgh,2011. www.admire-project.eu/docs/ ADMIRE-D2.9-FinalReport.pdf.
    [22]Choinski M, Baxter R, Ostrowski R, et al. ADMIRE Use Case Report. Technical report, EPCC, University of Edinburgh,2009. www.admire-project.eu/docs/ADMIRE-Use_Case_Report.pdf.
    [23]Tan P N, Steinbach M, Kumar V. Introduction to Data Mining. Pearson Addison Wesley Boston,2006.
    [24]Witten I H, Frank E. Data Mining:Practical Machine Learning Tools and Techniques. Morgan Kaufmann Publishers,2005.
    [25]Hall M, Frank E, Holmes G, et al. The WEKA Data Mining Software:An Update. ACM SIGKDD Explorations Newsletter,2009,11(1):10-18.
    [26]Weka 3-Data Mining with Open Source Machine Learning Software in Java. http://www.cs.waikato.ac.nz/ml/weka/.
    [27]Mierswa I, Wurst M, Klinkenberg R, et al. YALE:Rapid Prototyping for Complex Data Min-ing Tasks. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM,2006.935-940.
    [28]Rapid-I:RapidMiner. http://rapid-i.com/content/view/181/190/.
    [29]Wohrer A, Zhang Y, Dar E H, et al. Unboxing Data Mining via Decomposition in Operators: Towards Macro Optimization and Distribution. Proceedings of International Conference on Knowledge Discovery and Information Retrieval, Madeira, Portugal,2009.
    [30]Zhang Y, Li H, Wohrer A, et al. Decomposing Data Mining by a Process-oriented Execution Plan. Proceedings of 2010 International Conference on Artificial Intelligence and Computa-tional Intelligence:Part I. Springer,2010.97-106.
    [31]Zhang Y, Meng L, Liu F, et al. Decomposition of Association Rules Mining Process and Analysis of the Intermediate Results. Journal of Computational Information Systems,2011, 7(10):3606-3613.
    [32]Zaniolo C. Mining Databases and Data Streams with Query Languages and Rules. Proceedings of International Workshop on Knowledge Discovery in Inductive Databases. Springer,2005. 24-37.
    [33]De Raedt L. A Perspective on Inductive Databases. ACM SIGKDD Explorations Newsletter, 2002,4(2):69-77.
    [34]Han J, Fu Y, Wang W, et al. DMQL:A Data Mining Query Language for Relational Databases. Proceedings of Workshop on Data Mining and Knowledge Discovery,1996.27-33.
    [35]Meo R, Psaila G, Ceri S. An Extension to SQL for Mining Association Rules. Data Mining and Knowledge Discovery,1998,2(2):195-224.
    [36]Imielinski T, Virmani A. MSQL:A Query Language for Database Mining. Data Mining and Knowledge Discovery,1999,3(4):373-408.
    [37]Braga D, Campi A, Klemettinen M, et al. Mining Association Rules from XML Data. Data Warehousing and Knowledge Discovery,2002.133-156.
    [38]Netz A, Chaudhuri S, Fayyad U, et al. Integrating Data Mining with SQL Databases:OLE DB for Data Mining. Proceedings of the 17th International Conference on Data Engineering. IEEE,2002.379-387.
    [39]Romei A, Ruggieri S, Turini F. KDDML:A Middleware Language and System for Knowledge Discovery in Databases. Data & Knowledge Engineering,2006,57(2):179-220.
    [40]Bueti G, Congiusta A, Talia D. Developing Distributed Data Dining Applications in Knowl-edge Grid Framework. High Performance Computing for Computational Science,2005.156-169.
    [41]Agrawal R, Imieliriski T, Swami A. Mining Association Rules Between Sets of Items in Large Databases. Proceedings of ACM SIGMOD International Conference on Management of Data, New York, USA,1993.207-216.
    [42]Agrawal R, Srikant R. Fast Algorithms for Mining Association Rules. Proceedings of the 20th International Conference on Very Large Data Bases, volume 1215. Citeseer,1994.487-499.
    [43]Yuan X. Data mining query language design and implementation[M]. Hong Kong:The Chinese University of Hong Kong,2003.
    [44]Agrawal R, Srikant R. Mining Sequential Patterns. Proceedings of the 11 th International Conference on Data Engineering, Washington, DC, USA:IEEE,1995.3-14.
    [45]Srikant R, Agrawal R. Mining Sequential Patterns:Generalizations and Performance Improve-ments. Advances in Database Technology,1996.1-17.
    [46]Wu X, Kumar V, Ross Quinlan J, et al. Top 10 Algorithms in Data Mining. Knowledge and Information Systems,2008,14(1):1-37.
    [47]Breiman L. Classification and Regression Trees. Chapman & Hall/CRC,1984.
    [48]Lewis R J. An Introduction to Classification and Regression Tree (CART) Analysis. Proceed-ings of 2000 Annual Meeting of the Society for Academic Emergency Medicine, Francisco, California. Citeseer.
    [49]魏宗舒.概率论与数理统计教程.高等教育出版社,2008.
    [50]Domingos P, Pazzani M. On The Optimality of The Simple Bayesian Classifier Under Zero-One Loss. Machine learning,1997,29(2):103-130.
    [51]John G, Langley P. Estimating continuous distributions in Bayesian classifiers. Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, volume 1. Citeseer,1995. 338-345.
    [52]刘明亮,李雄飞,孙涛,et al.数据挖掘技术标准综述.计算机科学,2008,35(6)：5-10.
    [53]Anand S, Grobelnik M, Herrmann F, et al. Knowledge Discovery Standards. Artificial Intelli-gence Review,2007,27(1):21-56.
    [54]Grossman R. KDD-2003 Workshop on Data Mining Standards, Services and Platforms (DM-SSP 03). ACM SIGKDD Explorations Newsletter,2003,5(2):197-197.
    [55]Kadav A, Kawale J, Mitra P. Data Mining Standards. Technical report, Indian Institute of Technology Kanpur,2003. http://www.datamininggrid.org/cgi-bin/works/Show?standard01.
    [56]CRISP-DM-CRoss Industry Standard Process for Data Mining, http://www.crisp-dm.org/.
    [57]Chapman P, Clinton J, Kerber R, et al. CRISP-DM:Step-by-Step Data Mining Guide. Techni-cal report, CRISP-DM Consortium,2000. http://www.crisp-dm.org/CRISPWP-0800.pdf.
    [58]Shearer C. The CRISP-DM model:The New Blueprint for Data Mining. Journal of Data Warehousing,2000,5(4):13-22.
    [59]DMG-Data Mining Group, http://www.dmg.org/.
    [60]Grossman R L, Hornick M F, Meyer G. Data Mining Standards Initiatives. Communications of the ACM,2002,45(8):59-61.
    [61]Guazzelli A, Zeller M, Lin W C, et al. PMML:An Open Standard for Sharing Models. The R Journal,2009,1(1):60-65.
    [62]Pechter R. Data Mining Standards, Services and Platforms 2005 Workshop Report. SIGKDD Explorations Newsletter,2005,7:137-138.
    [63]Hornick M F, Marcade E, Venkayala S. Java Data Mining:Strategy, Standard, and Practice:A Practical Guide for architecture, design, and implementation. Morgan Kaufmann,2006.
    [64]Java Specification Request 73:Java Data Mining (JDM) 1.0, July,2004. http://www.jcp.org/ en/jsr/detail?id=73.
    [65]Java Specification Request 247:Java Data Mining (JDM) 2.0, September,2006. http://www. jcp.org/en/jsr/detail?id=247.
    [66]ODM-Oracle Data Mining, http://www.oracle.com/technetwork/database/opticns/odm/index.html.
    [67]OLE DB for Data Mining Specification 1.0 Final. Technical report, Microsoft Corporation. http://www.microsoft.com/download/en/details.aspx?id= 17438.
    [68]SQL Server Analysis Services (SSAS). http://msdn.microsoft.ccm/en-us/library/ ms175609(v=sq1.90).aspx.
    [69]Melomed E, Gorbach I, Berger A, et al. Microsoft SQL Server 2005 Analysis Services. Sams, 2006.
    [70]Foster I, Kesselman C, Nick J M, et al. The Physiology of the Grid:An Open Grid Services Architecture for Distributed Systems Integration. Proceedings of Open Grid Service Infras-tructure WG, Global Grid Forum, volume 22. Edinburgh,2002.1-5.
    [71]Foster I. What Is the Grid? A Three Point Checklist. GRID today,2002,1 (6):32-36.
    [72]Chervenak A, Foster I, Kesselman C, et al. The Data Grid:Towards an Architecture for The Distributed Management and Analysis of Large Scientific Datasets. Journal of Network and Computer Applications,2000,23(3):187-200.
    [73]Hoschek W, Jaen-Martinez J, Samar A, et al. Data Management in an International Data Grid Project. Grid Computing-GRID 2000,2000.333-361.
    [74]Krauter K, Buyya R, Maheswaran M. A Taxonomy and Survey of Grid Resource Management Systems for Distributed Computing. Software:Practice and Experience,2002,32(2):135-164.
    [75]Weissman J, et al. The Service Grid:Supporting Scalable Heterogeneous Services in Wide-area Networks. Proceedings of the 2001 Symposium on Applications and the Internet. IEEE, 2001.95.
    [76]Foster I, Kesselman C, Tuecke S. The Anatomy of the Grid:Enabling Scalable Virtual Organi-zations. International Journal of High Performance Computing Applications,2001,15(3):200.
    [77]樊宁.网格体系结构概述. Technical report, IBM中国软件开发中心,2006. http://www.ibm. com/developerworks/cn/grid/gr-fann/index.html.
    [78]都志辉,陈渝,刘鹏,et al.以服务为中心的网格体系结构OGSA.计算机科学,2003,30(007)：26-29.
    [79]Foster I, Kesselman C, Nick J M, et al. Grid Services for Distributed System Integration. Computer,2002.37-46.
    [80]The Globus Alliance, http://www.globus.org/.
    [81]Foster I. Globus Toolkit version 4:Software for Service-Oriented Systems. Journal of Com-puter Science and Technology,2006,21 (4):513-520.
    [82]Antonioletti M, Atkinson M, Baxter R, et al. The Design and Implementation of Grid Database Services in OGSA-DAI. Concurrency and Computation:Practice and Experience,2005,17(2-4):357-376.
    [83]OGSA-DAI:Open Grid Services Architecture-Database Access and Integration. http://www. ogsadai.org.uk/.
    [84]Karasavvas K, Antonioletti M, Atkinson M, et al. Introduction to OGSA-DAI Services. Sci-entific Applications of Grid Computing,2005.1-12.
    [85]中国国家网格软件CNGrid GOS.中国科技成果,2009,10(11).
    [86]Workflow Management Coalition http://www.wfmc.org.
    [87]Workflow Management Coalition Terminology and Glossary. Technical report, Workflow Man-agement Coalition,1999. http://www.wfmc.org/standards/docs/TC-1011_term_glossary_v3. pdf.
    [88]Deelman E, Gil Y. Workshop on the Challenges of Scientific Workflows. Technical report, Information Sciences Institute, University of Southern California,2006. https://confluence. pegasus.isi.edu/display/workshop06/Home.
    [89]Gil Y, Deelman E, Ellisman M, et al. Examining the Challenges of Scientific Workflows. Computer,2007,40(12):24-32.
    [90]Barker A, Van Hemert J. Scientific Workflow:A Survey and Research Directions. Proceedings of the 7th International Conference on Parallel Processing and Applied Mathematics. Springer, 2007.746-753.
    [91]OASIS-Organization for the Advancement of Structured Information Standards. http://www. oasis-open.org/.
    [92]Leymann F, et al. Web Services Flow Language (WSFL 1.0),2001.
    [93]Thatte S. XLANG:Web Services for Business Process Design,2001.
    [94]Van Der Aalst W, Ter Hofstede A. YAWL:Yet Another Workflow Language. Information Systems,2005,30(4):245-275.
    [95]Aalst W, Aldred L, Dumas M, et al. Design and Implementation of the YAWL System. Pro-ceedings of Advanced Information Systems Engineering. Springer,2004.281-305.
    [96]Kavantzas N, Burdett D, Ritzinger G, et al. Web Services Choreography Descrip-tion Language Version 1.0. Technical report, W3C,2004. http://www.w3.org/TR/2004/ WD-ws-cdl-10-20041217/.
    [97]Ross-Talbot S, Fletcher T. Web Services Choreography Description Language:Primer. Tech-nical report, W3C,2006. http://www.w3.org/TR/ws-cdl-10-primer/.
    [98]Barros A, Dumas M, Oaks P. A Critical Overview of the Web Services Choreography Descrip-tion Language. BPTrends Newsletter,2005,3.
    [99]Oinn T, Addis M, Ferris J, et al. Taverna:a Tool for the Composition and Enactment of Bioinformatics Workflows. Bioinformatics,2004,20(17):3045.
    [100]Oinn T, Addis M, Ferris J, et al. Delivering Web Service Coordination Capability to Users. Proceedings of the 13th International World Wide Web Conference on Alternate Track Papers & Posters. ACM,2004.438-439.
    [101]Aalst W. Patterns and XPDL:A Critical Evaluation of the XML Process Definition Language. BPM Center Report BPM-03-09, BPMcenter. org,2003..
    [102]XML Process Definition Language (XPDL). http://www.wfmc.org/xpdl.html.
    [103]Taylor I. Workflows for e-Science:Scientific Workflows for Grids. Springer,2007.
    [104]The Kepler Project. https://kepler-project.org/.
    [105]Ludascher B, Altintas I, Bowers S, et al. Scientific Process Automation and Workflow Man-agement. Scientific Data Management:Challenges, Existing Technology, and Deployment, Computational Science Series,2009.476-508.
    [106]Ludascher B, Altintas I, Berkley C, et al. Scientific Workflow Management and the Kepler System. Concurrency and Computation:Practice and Experience,2006,18(10):1039-1065.
    [107]Triana-Open Source Problem Solving Environment. http://www.trianacode.org/.
    [108]Taylor I, Shields M, Wang I, et al. Distributed P2P Computing within Triana:A Galaxy Visu-alization Test Case. Proceedings of International Parallel and Distributed Processing Sympo-sium. IEEE,2003.
    [109]Taylor I, Shields M, Wang I, et al. The Triana Workflow Environment:Architecture and Ap-plications. Workflows for e-Scicnce,2007.320-339.
    [110]Majithia S, Shields M, Taylor I, et al. Triana:A Graphical Web Service Composition and Execution Toolkit. Proceedings of International Conference on Web Services. IEEE,2004. 514-521.
    [111]Dcelman E, Singh G, Su M, et al. Pegasus:A Framework for Mapping Complex Scientific Workflows onto Distributed Systems. Scientific Programming,2005,13(3):219-237.
    [112]Hunt C, Ferner C, Brown J. JXPL:An XML-based Scripting Language for Workflow Execu-tion in a Grid Environment. Proceedings of IEEE SoutheastCon 2005. IEEE.345-350.
    [113]Brown J, Ferner C, Hudson T, et al. Gridnexus:A Grid Services Scientific Workflow System. International Journal of Computer Information Science,2005,6(2):72-82.
    [114]Syed J, Ghanem M, Guo Y. Discovery Processes:Representation and Re-use. Proceedings of UK e-Science All Hands Meeting,2002.
    [115]Rowe A, Kalaitzopoulos D, Osmond M, et al. The Discovery Net System for High Throughput Bioinformatics. Bioinformatics,2003,19(S1):225.
    [116]Janciak I, Kloner C, Brezany P. Workflow Enactment Engine for WSRF-compliant Services Orchestration. Proceedings of th 9th International Conference on Grid Computing. IEEE,2008. 1-8.
    [117]Park B H, Kargupta H. Distributed Data Mining:Algorithms, Systems, and Applications. Proceedings of Data Mining Handbook,2002.341-358.
    [118]Zaki M J. Parallel and Distributed Data Mining:An Introduction. Large-Scale Parallel Data Mining,2000.804-804.
    [119]DataMiningGrid:Data Mining Tools and Services for Grid Computing Environments, http: //www.datamininggrid.org/.
    [120]Stankovski V, Swain M, Kravtsov V, et al. Grid-enabling Data Mining Applications with DataMiningGrid:An Architectural Perspective. Future Generation Computer Systems,2008, 24(4):259-279.
    [121]Cannataro M, Talia D. The Knowledge Grid. Communications of the ACM,2003,46(1):89-93.
    [122]Cannataro M, Congiusta A, Mastroianni C, et al. Grid-Based Data Mining and Knowledge Discovery. Intelligent Technologies for Information Analysis,2004.19.
    [123]GridMiner-Intelligent Grid Solutions. http://www.gridminer.org/.
    [124]Hofer J, Brezany P. Digidt:Distributed Classifier Construction in the Grid Data Mining Frame-work GridMiner-Core. Proceedings of ICDM Workshop on Data Mining and the Grid,2004.
    [125]Shafer J, Agrawal R, Mehta M. SPRINT:A Scalable Parallel Classifier for Data Mining. Proceedings of the 22nd International Conference on Very Large Data Bases. Citeseer,1996. 544-555.
    [126]Algorithm Development and Mining System. http://datamining.itsc.uah.edu/adam/.
    [127]Hinke T H, Novotny J. Data Mining on NASA's Information Power Grid. Proceedings of The 9th International Symposium on High-Performance Distributed Computing. IEEE,2000. 292-293.
    [128]Weka for Web Service. http://weka4ws.wordpress.com/.
    [129]Talia D, Trunfio P, Verta O. Weka4WS:a WSRF-enabled Weka Toolkit for Distributed Data Mining on Grids. Knowledge Discovery in Databases (PKDD 2005),2005.309-320.
    [130]Perez M S, Sanchez A, Herrero P, et al. Adapting The Weka Data Mining Toolkit to a Grid Based Environment. Advances in Web Intelligence,2005. 492-497.
    [131]The Knowledge Grid. http://www.knowledgegrid.net/.
    [132]Zhuge H. China's e-Science Knowledge Grid Environment. Intelligent Systems,2005, 19(1):13-17.
    [133]Zhuge H. The Knowledge Grid Environment. Intelligent Systems,2008,23(6):63-71.
    [134]Llora X, Acs B, Auvil L S, et al. Meandre:Semantic-driven Data-Intensive Flows in the Clouds. Proceedings of the 4th International Conference on e-Science. IEEE,2008.238-245.
    [135]Watson P, Lord P, Gibson F, et al. Cloud Computing for e-Science with CARMEN. Proceedings of the 2nd Conference on Iberian Grid Infrastructure. Netbiblo,2008.3-14.
    [136]Grossman R, Gu Y. Data Mining Using High Performance Data Clouds:Experimental Studies Using Sector and Sphere. Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM,2008.920-927.
    [137]Ioannidis Y E. Query Optimization. ACM Computing Surveys (CSUR),1996,28(1):121-123.
    [138]Johnson T, Lakshmanan L V S, Ng R T. The 3W Model and Algebra for Unified Data Min-ing. Proceedings of the 26th International Conference on Very Large Data Bases. Morgan Kaufmann Publishers Inc.,2000.21-32.
    [139]Geist I, Sattler K U. Towards Data Mining Operators in Database Systems:Algebra and Im-plementation. Proceedings of University of Manchester, Department of Computer Science. Citeseer,2004.
    [140]RapidMiner 4.4-User Guide, Operator Reference, Developer Tutorial. Technical report, Rapid-Ⅰ GmbH,2009. http://rapid-i.com/content/view/26/84/lang,en/.
    [141]Gopalan R P, Nuruddin T, Sucahyo Y G. Algebraic Specification of Association Rule Queries. Proceedings of the 4th Conference on Data Mining and Knowledge Discovery:Theory, Tools, and Technology,2003.
    [142]Houtsma M, Swami A. Set-Oriented Mining for Association Rules in Relational Databases. Proceedings of the 11th International Conference on Data Engineering,1995.25-33.
    [143]Meo R, Psaila G, Ccri S. An Extension to SQL for Mining Association Rule?. Data Mining Knowledge Discovery,1998,2(2):195-224.
    [144]Botta M, Boulicaut J F, Masson C, et al. Query Languages Supporting Descriptive Rule Min-ing:a Comparative Study. Database Support for Data Mining Applications,2004.24-51.
    [145]Kusiak A. Decomposition in Data Mining:an Industrial Case Study. IEEE Transactions on Electronics Packaging Manufacturing,2000,23(4):345-353.
    [146]Maimon O, Rokach L. Decomposition Methodology for Knowledge Discevery and Data Min-ing. Data Mining and Knowledge Discovery Handbook,2005.981-1003.
    [147]Bernstein A, Provost F, Hill S. Toward Intelligent Assistance for a Data Mining Process:An Ontology-based Approach for Cost-sensitive Classification. IEEE Transactions on Knowledge and Data Engineering,2005.503-518.
    [148]Tsangaris M, Kakaletris G, Kllapi H, et al. Dataflow Processing and Optimization on Grid and Cloud Infrastructures. Data Engineering,2009.67.
    [149]Isard M, Budiu M, Yu Y, et al. Dryad:Distributed Data-Parallel Programs from Sequential Building Blocks. Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems. ACM,2007.59-72.
    [150]Dean J, Ghemawat S. MapReduce:Simplified Data Processing on Large Clusters. Communi-cations of the ACM,2008,51 (1):107-113.
    [151]Yang X, Liu Z, Fu Y. MapReduce as a Programming Model for Association Rules Algorithm on Hadoop. Proceedings of the 3rd International Conference on Information Sciences and Interaction Sciences. IEEE,2010.99-102.
    [152]Ordonez C, Garcfa-Garcfa J. Database Systems Rescarch on Data Mining. Proceedings of ACM SIGMOD International Conference on Management of Data. ACM,2010.1253-1254.
    [153]Rajamani K, Cox A, Iyer B, et al. Efficient Mining for Association Rules with Relational Database Systems. Proceedings of International Symposium on Database Engineering and Applications. IEEE,1999.148-155.
    [154]王珊,萨师煊.数据库系统概论(第四版),中国北京：高等教育出版社,2008.
    [155]Han J, Pei J, Yin Y, et al. Mining Frequent Patterns Without Candidate Generation:A Frequent-Pattern Tree Approach. Data Mining and Knowledge Discovery,2004,8(1):53-87.
    [156]Pei J, Han J, Mortazavi-Asl B, et al. PrefixSpan:Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. Proceedings of Conference on Computer Communications and Networks. IEEE,2001.215.
    [157]Agrawal R, Shafer J. Parallel Mining of Association Rules:Design, Implementation, and Experience. IEEE Transactions on Knowledge and Data Engineering,1996,8:962-969.
    [158]Mehta M, Agrawal R, Rissanen J. SLIQ:A Fast Scalable Classifier for Data Mining. Advances in Database Technology (EDBT'96),1996.18-32.
    [159]Graefe G. Encapsulation of Parallelism in the Volcano Query Processing System. ACM SIG-MOD Record,1990,19(2):102-111.
    [160]Mackert L. R* Optimizer Validation and Performance Evaluation for Distributed Queries. Proceedings of Readings in Database Systems. Morgan Kaufmann Publishers,1988.219-229.
    [161]Bernstein P, Goodman N, Wong E, et al. Query Processing in a System for Distributed Databases (SDD-1). ACM Transactions on Database Systems (TODS),1981,6(4):625.
    [162]Epstein R, Stonebraker M, Wong E. Distributed Query Processing in a Relational Database System. Proceedings of ACM SIGMOD International Conference on Management of Data. ACM,1978.180.
    [163]Gounaris A, Sakellariou R, Paton N, et al. A Novel Approach to Resource Scheduling for Parallel Query Processing on Computational Grids. Distributed and Parallel Databases,2006, 19(2):87-106.
    [164]Grid Resource Allocation and Management. http://globus.org/toolkit/docs/4.0/execution/.
    [165]The Dynamically-Updated Request Online Coallocator (DUROC). http://globus.org/toolkit/ docs/2.4/duroc/.
    [166]Lynden S, Mukherjee A, Hume A, et al. The Design and Implementation of OGSA-DQP: A Service-based Distributed Query Processor. Future Generation Computer Systems,2009, 25(3):224-236.
    [167]Alpdemir M, Mukherjee A, Gounaris A, et al. OGSA-DQP:A Service for Distributed Querying on the Grid. Advances in Database Technology (EDBT 2004),2004.3923-3923.
    [168]Dean J, Ghemawat S. MapReduce:Simplified data processing on large clusters. Communica-tions of the ACM,2008,51(1):107-113.
    [169]Cardona K, Secretan J, Georgiopoulos M, et al. A Grid Based System for Data Mining Using MapReduce. Technical report, Technical Report TR-2007-02, AMALTHEA,2007.
    [170]Zhang Y, Meng L, Liu F, et al. Towards the Optimization of Data Mining Execution Process in Distributed Environments. Journal of Computational Information Systems,2011,7(8):2931-2939.
    [171]The WS-Resource Framework. http://www.globus.org/wsrf/.
    [172]Modeling Stateful Resources with Web Service. Technical report, Globus Alliance,2004. http://www.ibm.com/developerworks/library/ws-resource/ws-modelingresources.pdf.
    [173]The WS-Resource Framework Version 1.0. Technical report, Globus Alliance,2004. http: //www.globus.org/wsrf/specs/ws-wsrf.pdf.
    [174]应岚岚.铁路重点客户管理的分析与探讨.铁道货运,2009,11：32-36.
    [175]钟雁,郭雨松.数据挖掘技术在铁路货运客户细分中的应用.北京交通大学学报(自然科学版),2008,32(003)：25-29.
    [176]铁路货运大客户运输服务管理办法.http://www.12306.cn/mormhweb/hyfw/hygfwj/gfxwj/ 201001/t20100112_1455.html.
    [177]肖秋根,王成友,梁华,et al.C4.5算法在列车轨道故障检测上的应用研究.计算机技术与发展,2006,16(4)：76-78.
    [178]李方.货票信息管理系统的研究.铁路计算机应用,2002,11(6)：11-14.
    [179]李方.货票信息管理系统TMIS系列之二.铁道知识,2000,(06)：24-25.
    [180]Antonioletti M, Krause A, Paton N W, et al. The WS-DAI Family of Specifications for Web Service Data Access and Integration. SIGMOD Record,2006,35(1):48-55.
    [181]Antonioletti M, Atkinson M, Krause A, et al. Web Services Data Access and Integration-The Core (WS-DAI) Specification Version 1.0, July,2006.
    [182]Antonioletti M, Collins B, Krause A, et al. Web Services Data Access and Integration-The Relational Realization (WS-DAIR) Specification Version 1.0, July,2006.
    [183]Antonioletti M, Hastings S, Krause A, et al. Web Services Data Access and Integration-The XML Realization (WS-DAIX) Specification Version 1.0, August,2006.
    [184]Antonioletti M, Aranda C B, Corcho O, et al. WS-DAI RDF(S) Realization:Introduction, Motivational Use Cases and Terminologies, October,2009.
    [185]Gutierrez M E, Gomez-Perez A. Ontology Access Provisioning in Grid Environments. Pro-ceedings of Semantic Grid:The Convergence of Technologies, Dagstuhl, Germany:Interna-tionales Begegnungs und Forschungszentrum fiir Informatik (IBFI),2005.
    [186]Gutierrez M E, Gomez-Perez A, Corcho O, et al. WS-DAIOnt-RDF(S):Ontology access pro-vision in grids. Proceedings of The 8th International Conference on Grid Computing, Wash-ington, DC, USA,2007. IEEE.89-96.
    [187]Theocharopoulos E, Jackson M. OGSA-DAI WS-DAIR Version 1.0, December,2008.
    [188]Jackson M, Theocharopoulos E. OGSA-DAI WS-DAIX Version 1.0, December,2008.
    [189]Melton J, Eisenberg A. SQL Multimedia and Application packages (SQL/MM). ACM SIG-MOD Record,2001,30(4):97-102.
    [190]Talia D, Trunfio P, Verta O. The Weka4WS Framework for Distributed Data Mining in Service-oriented Grids. Concurrency and Computation:Practice and Experience,2008,20(16):1933-1951.
    [191]Zhang Y, Meng L, Li H, et al. WS-DAI-DM:An Interface Specification for Data Mining in Grid Environments. Journal of Software,2011,6(6):953-960.
    [192]Common Information Model (CIM) Standards, Version 2.23.0, October,2009.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700