中医药知识发现可靠性研究

英文题名：Research on Knowledge Discovery Reliability in Traditional Chinese Medicine
作者：封毅
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：知识发现 ; 数据挖掘 ; 可靠性 ; 知识发现可靠性 ; 中医药知识发现 ; 数据质量
英文关键词：Knowledge Discovery ; Data Mining ; Reliability ; Knowledge Discovery Reliability ; Knowledge Discovery in Traditional Chinese Medicine ; Data Quality
学位年度：2008
导师：吴朝晖
学科代码：081202
学位授予单位：浙江大学
论文提交日期：2008-09-15
答辩委员会主席：俞勇

摘要

知识发现可靠性是知识发现领域中一个重要但容易忽视的主题。随着知识发现和数据挖掘技术的广泛应用,有一个问题逐渐引起人们的关注,即在什么条件下知识发现是可靠的,或者说在什么条件下所发现的知识是可靠的。近年来在知识发现可靠性方面的研究,大多关注于某一具体数据挖掘模型下的可靠性问题。而对于不同模型间存在的可靠性共同主题,比如数据质量、评估方法等等,迄今为止仍没有一项系统性研究。针对知识发现可靠性的共同主题,进行分阶段、系统化的总结和梳理,已成为知识发现可靠性研究的一大迫切需要。
     在知识发现技术所应用的各个领域,有一个领域特别需要知识发现可靠性的研究,即中医药领域。作为中华民族重要文化财富和学术成就的中医药,近年来面临着生存和发展的挑战。如何把这一挑战化为中医药发展的契机,利用知识发现技术促进中医药的跨越式发展,已成为中医药研究人员的一项重要课题。近年来的中医药信息化工作已为知识发现创造了有利条件。然而,由于中医药数据自然语言性强,数据表达涵义丰富,表达方式多样化,而且在数据质量上还面临较大问题,在具备这些特征的数据上所进行的知识发现,相比其他领域来讲,就更加需要关注和研究知识发现可靠性问题。
     在这一背景下,本文围绕中医药知识发现可靠性这一主题,从知识发现整个生命周期的各个阶段对可靠性因素进行探讨,提出了知识发现可靠性框架PBRF-KD。针对中医药知识发现中比较突出的可靠性问题,重点探讨中医药知识发现中的结构性因素、表达性因素和信任性因素三大问题。本文的研究工作与贡献包括如下几个方面:
     1)提出了基于过程的知识发现可靠性框架
     针对现有知识发现可靠性研究模型相关的特点,提出了一个与模型/应用无关的知识发现可靠性框架PBRF-KD,该框架采用基于过程的思路对知识发现整个流程中的各个阶段和可靠性因素进行了梳理,归纳出了7种可靠性相关因素。该框架为知识发现项目设立了整套与可靠性相关的蓝本。
     2)提出了结构相关的可靠性因素的优化方法
     分析了中医药知识发现中与结构相关的可靠性因素,主要指数据完整性。针对文本型字段的完整性问题,提出了基于顺序半相关度量的中医药文本缺失字段填补方法。针对中医药文献类别标签缺失的问题,提出了基于M-Similarity的多标签文本分类方法。
     3)提出了表达相关的可靠性因素的优化方法
     分析了中医药知识发现中与表达相关的可靠性因素,包括表达粒度和表达一致性。针对表达粒度,提出了基于规则的表达粒度细分方法。针对表达一致性,提出了基于本体的表达一致化方法。该套方法有助于提高中医药与表达相关的可靠性。
     4)提出了信任相关的可靠性因素的优化方法
     分析了中医药知识发现中与信任相关的可靠性因素,主要指数据可信度。针对中医药特有的数据可信度问题,提出了基于历史文献认可度的数据可信度衡量方法,和基于互联网知名度的数据可信度衡量方法。此外,基于这两种可信度衡量方法,提出了基于数据可信度的加权频繁模式挖掘算法,并在消渴方和脾胃方数据集上获得了有意义的结果。该套方法有助于提高中医药与信任相关的可靠性。
Reliability is a key issue in knowledge discovery. However, this important topic has not yet been well explored. The wide application of knowledge discovery technology nowadays poses a significant question for the community, that under which conditions the discovery is reliable, or alternatively we may ask under which conditions, the discovered knowledge is reliable. Most existing work on this topic considers knowledge discovery reliability (KDR) under the context of some specific data mining models. However, many common reliability issues still exist among different models, such as data quality, evaluation methods, etc. Thus, it is of great necessity to conduct a systematic research on these issues.
     Among various application areas of knowledge discovery, there is one field that particularly needs the consideration of KDR, that is, the area of Traditional Chinese Medicine (TCM). As a complete medical knowledge system taking an indispensable role in the health care for Chinese people for several thousand years, TCM has confronted with the great pressure of development in recent years. As a methodology that is capable to extract useful pattern from data, knowledge discovery is expected to exert its great power to promote the development of TCM. However, TCM data is known to have great natural language characteristics, with various expression patterns. Besides, the data quality in TCM is still unsatisfactory. Knowledge discovery on data with such features, requires more careful consideration on the issue of KDR.
     This thesis is a research focusing on KDR in TCM field. A systematic discussion of reliability issues in the whole life cycle of knowledge discovery is provided, as well as a process-based KDR framework named PBRF-KD. Subsequently, we emphasize three important types of KDR factors in TCM practice, i.e., the structural factors, the representational factors, and the trustworthiness-related factors. The major work and contributions of this thesis are as follows:
     First, we propose a process-based KDR framework named PBRF-KD. As a first framework to the study of KDR from the process perspective, PBRF-KD provides a uniformed view and effective approach for the analysis and estimation of KDR. As a model-independent framework, PBRF-KD could be applied by data analysts in various domains to assess the KDR. The six steps and seven main factors in PBRF-KD provide a traceable way in analyzing reliability of knowledge discovery, which can be viewed as an applicable blueprint for analyzing KDR in the whole knowledge discovery process.
     Second, we present key structural factors with regard to KDR in TCM, and propose a series of methods to optimize the structural factors. The data completeness is analyzed as the major structural factor in TCM. For the missing value in textual attribute in TCM data, we propose an imputation method based on an order-semisensitive similarity named M-Similarity. For the missing label in medical literature, we propose a multi-label text categorization approach based on M-Similarity.
     Third, we present key representational factors with regard to KDR in TCM, and propose a series of methods to optimize the representational factors. The major representational factors in TCM consist of representation granularity and representation consistency. For the issue of representation granularity, we propose a rule-based method of representation granularity subdivision. For the issue of representation consistency, we propose an ontology-based method to tackle representation inconsistency.
     Lastly, we present key trustworthiness-related factors with regard to KDR in TCM, and propose a series of methods to optimize the trustworthiness. For the data trustworthiness issue in TCM field, we propose a trustworthiness evaluation method based on literature historical acceptance, as well as a trustworthiness evaluation method based on popularity in Web. Using these two methods to generate weights in the mining of frequent pattern, we propose a weighed frequent pattern mining method based on data trustworthiness, and get meaningful results in 2 TCM formula datasets.

引文

[1]周志华.机器学习与数据挖掘.中国计算机学会通讯,2007,3(12):35-44
    [2]W.J.Frawley,G.Piatetsky-Shapiro,C.J.Matheus.Knowledge Discovery in Databases:an Overview.AI Magazine,1992,13(3):57-70
    [3]U.Fayyad,G.Piatetsky-Shapiro,P.Smyth.From Data Mining to Knowledge Discovery in Databases.AI Magazine,1996,17(3):37-54
    [4]J.Han,M.Kamber.Data Mining:Concepts and Techniques(Second Edition).Morgan Kaufmann Publishers,2006
    [5]H.Dai.A Study on the Reliability in Graph Discovery,In:ICDM'06 workshop of Reliability Issues in Knowledge Discovery(RIKD 06),2006,775-779
    [6]E.N.Smirnov,A.Kaptein.Theoretical and Experimental Study of a Meta-typicalness Approach for Reliable Classification.In:ICDM'06 workshop of Reliability Issues in Knowledge Discovery(RIKD 06),2006,739-743
    [7]P.Berka.Recognizing Reliability of Discovered Knowledge,In:Proc.of PKDD 1997,LNAI 1263,1997,307-314
    [8]K.Wang,J.Liu,W.M.Ma.Mining the Most Reliable Association Rules with Composite Items.In:ICDM'06 workshop of Reliability Issues in Knowledge Discovery(RIKD 06),2006,749-754
    [9]T.Ideker et al.A new approach to decoding life:Systems Biology.Annu.Rev.Genomics Hum.Genet.2001,2
    [10]吴家睿.新时代大科学.中国科学.2002年第2期(论坛)
    [11]吴家睿.后基因组时代的交叉科学:从“Bio-X”到“X biology”.中国科学,2002年第1期(论坛)
    [12]仇伟欣.国际天然药物市场分析中国中医药信息研究会第二届理事大会暨学术交流会议.2003年11月8日
    [13]姚美村,袁月梅,艾路,乔延江.数据挖掘及其在中医药现代化研究中的应用.北京中医药大学学报2002,25(5):20-23
    [14]乔延江,李澎涛,苏钢强,肖培根,王永炎.中药(复方)KDD研究开发的意义.北京中医药大学学报,1998,21(3):15-17
    [15]X.Z.Zhou,Z.H.Wu,W.Lu.TCMMDB:a distributed multidatabase query system and its key technique implemention,In IEEE SMC 2001,2001,vol.2,1095-1100
    [16]X.Z.Zhou,Z.H.Wu,A.N.Yin,L.C.Wu,W.Y.Fan,R.E.Zhang:Ontology development for unified traditional Chinese medical language system.Artificial Intelligence in Medicine,2004,32(1):15-27
    [17]Y.Liu,Y.Sun.China traditional Chinese medicine(TCM) patent database.World Patent Information 2004,26:91-96
    [18]http://www.tradimed.com
    [19]J.J.Zhou,G.G.Xie,X.J.Yan.Traditional Chinese medicines:molecular structures,natural sources and applications,ASHGATE,Burlington,VT,2003
    [20]http://www.cintcm.com
    [21]W.Y.Fan.The traditional Chinese medical literature analysis and retrieval system(TCMLARS) and its application.International Journal of Special Libraries 2001,35(3):147-156
    [22]M.McCulloch,M.Broffman,J.M.Gao.Chinese herbal medicine and interferon in the treatment of chronic hepatitis B:a meta-analysis of randomized.controlled trials.American Journal of Public Health 2002,92(10):1619-1627
    [23]姚美村,艾路,袁月梅,乔延江.消渴病复方配伍规律的关联规则分析.北京中医药大学学报,2002,25(6):48-50
    [24]李慧琴,蒋永光.慢性乙型肝炎物配伍及其关联性辨析.中医药学刊2003,21(4)
    [25]蒋永光,李力,李认书,李慧琴,陈波.中医脾胃方配伍规律的数据挖掘试验.世界科学技术.中医药现代化,2003,5(3):33-37
    [26]C.Li,C.J.Tang,J.Peng,J.J.Hu.NNF:an effective approach in medicine paring analysis of traditional Chinese medicine prescriptions.In:Proceedings of DASFAA 2005,Lecture Notes in Computer Science 3453,2005:576-581
    [27]C.Li,C.J.Tang,J.Peng,J.J.Hu,L.M.Zeng,X.X.Yin,Y.G.Jiang,J.Liu.TCMiner:a high performance data mining system for multi-dimensional data analysis of traditional Chinese medicine prescriptions,In:Proceedings of ER Workshops 2004,Lecture Notes in Computer Science 3289,2004,246-257
    [28]何前峰,崔蒙,吴朝晖,周雪忠,周忠眉.方剂中配伍知识的发现.中国中医药信息杂志,2004,11(7):655-658
    [29]曾令明,唐常杰,阴小雄,蒋永光,刘娟,廖勇.基于位图矩阵和双支持度的中药配伍挖掘技术.四川大学学报(自然科学版),2005,42(1):57-62
    [30]Z.M.Zhou,Z.H.Wu,C.S.Wang,Y.Feng.Mining both associated and correlated patterns.In:Proceedings of ICCS 2006,Lecture Notes in Computer Science 3994,2006:468-475
    [31]X.Z.Zhou,B.Y.Liu,Z.H.Wu.Text mining for clinical Chinese herbal medical knowledge discovery.In:Proceedings of DS 2005,Lecture Notes in Computer Science 3735,Edited by Hoffmann AG,Motoda H,Scheffer T.Berlin:Springer-Verlag.2005,395-397
    [32]K.Deng,D.L.Liu,S.Gao,Z.Geng.Structural learning of graphical models and its applications to traditional Chinese medicine.In:Proceedings of FSKD 2005,Lecture Notes in Computer Science 3614,2005:362-367
    [33]Z.H.Wu,T.Yu,H.J.Chen,X.H.Jiang,C.Y.Zhou,Y.Zhang,Y.X.Mao,Y.Feng.Semantic Web Development for Traditional Chinese Medicine.In proceedings of the Twentieth Innovative Applications of Artificial Intelligence Conference (IAAI-08),in press
    [34]Z.G Xiang.A 3-stage voting algorithm for mining optimal ingredient pattern of traditional Chinese medicine,Journal of Software 2003,14(11):1882-1890
    [35]C.G.Cao,H.T.Wang,Y.F.Sui.Knowledge modeling and acquisition of traditional Chinese herbal drugs and formulae from text.Artificial Intelligence in Medicine 2004,32(1):3-13
    [36]Z.H.Wu,X.Z.Zhou,B.Y.Liu,J.L.Chen.Text mining for finding functional community of related genes using TCM knowledge.In:Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases,2004:459-470
    [37]陈晓亮,归筱铭.中药药性多因素量化分析初探——毒性的相关因素.福建中医学院学报,1995,5(1):27-30
    [38]杨国营.对417种植物类药与其中101种降压中药药性的比率分析.河南中医学院学报,2005,20(3):22-23
    [39]姚美村,张燕玲,袁月梅,艾路,乔延江.中药药性量化方法对补虚药功效归类预测的研究.北京中医药大学学报,2004,27(4):7-9
    [40]周鲁,唐向阳,付超,彭世虎.解表类中药的模糊聚类分析.华西药学杂志,2004,19(5):339-341
    [41]张菊英,查干花,李力,蒋永光.脾胃方配伍规律统计方法探讨.四川中医,2004,22(6)
    [42]何前锋,周雪忠,周忠眉,崔蒙,吴朝晖.基于中药功效的聚类分析,中国中医药信息杂志,2004,11(6):561-562
    [43]祁俊生,徐辉碧,周井炎,陆晓华,杨祥良,管竞环.植物类中药中微量元素的因子分析和聚类分析.分析化学,1998,26(11):1309-1314
    [44]祁俊生,徐辉碧,周井炎,陆晓华,管竞环.解表植物类中药中微量元素与功效关系,计算机与应用化学,2003,20(4):449-452
    [45]冯雪松,董鸿晔.中药指纹图谱中的数据挖掘技术.药学进展,2002,26(4):198-201
    [46]L.X.Zhang,Y.N.Zhao,Z.H.Yang,J.X.Wang,S.Q.Cai,H.Y.Liu.Classifier for Chinese traditional medicine with high-dimensional and small sample-size data.In:Proceedings of WCICA 2004,IEEE Computer Society.2004:330-334
    [47]陆爱军,刘冰,刘海波,周家驹.中药化学数据库关联规则的挖掘.计算机与应用化学,2005,22(2):108-112
    [48]X.Z.Zhou,B.Y.Liu,Z.H.Wu,Y.Feng,Integrative Mining TCM Literature and MEDLINE for Functional Gene Networks,Artificial Intelligence in Medicine,2007,41(2):87-104
    [49]张连文,袁世宏.隐结构模型与中医辨证研究(Ⅰ)——隐结构法的基本思想及隐结构分析工具.北京中医药大学学报,2006,29(6):365-369
    [50]N.L.Zhang,S.H.Yuan,T.Chen,Y.Wang.Latent tree models and diagnosis in traditional Chinese medicine.Artificial Intelligence in Medicine.2008,42,229-245
    [51]Cambridge Dictionary of American English,http://dictionary.cambridge.org
    [52]Wikipedia Encyclopedia,http://en.wikipedia.org
    [53]Merriam-Webster Online Dictionary,http://www.m-w.com
    [54]A.Bertoni,G.Valentini.Randomized Maps for Assessing the Reliability of Patients Clusters in DNA Microarray Data Analyses,Artificial Intelligence in Medicine,2006,37(2):85-109
    [55]L.Q.Geng,H.J.Hamilton.Interestingness Measures for Data Mining:A Survey,ACM Computing Surveys,2006,38(3)
    [56]R.Y.Wang,V.C.Storey,C.P.Firth.A Framework for Analysis of Data Quality Research,IEEE Trans.Knowledge and Data Eng.,1995,7(4):623-640
    [57]D.B.Ballou,R.Y.Wang,H.L.Pazer,G.K.Tayi.Modeling Information Manufacturing Systems to Determine Information Product Quality,Management Science,1998,44(4):462-484
    [58]S.Marsh,M.R.Dibben.The Role of Trust in Information Science and Technology,Annual Review of Information Science and Technology,2003,37(1):465-498
    [59]S.A.Melnyk.A Process Perspective Toward System Improvement,available at http://www.apics-michiana.org/public_files/pdm1099.pdf
    [60]O.Maimon,A.Kandel,M.Last.Information-Theoretic Fuzzy Approach to Data Reliability and Data Mining,Fuzzy Sets and Systems,2001,117(2):183-194
    [61]U.M.Fayyad,G.Piatetsky-Shapiro,P.Smyth.The KDD Process for Extracting Useful Knowledge from Volumes of Data,Communications of the ACM,1996,39(11):27-34
    [62]P.Chapman,J.Clinton,R.Kerber,T.Khabaza,T.Reinartz,C.Shearer,R.Wirth.CRISP 1.0 Process and User Guide,2000,available at http://www.crisp-dm.org/
    [63]J.F.Elder.Top 10 Data Mining Mistakes,available at http://sce.uhcl.edu/boetticher/ML_DataMining/elder.pdf
    [64]D.P.Ballou,H.L.Pazer.Cost/Quality Tradeoffs for Control Procedures in Information Systems,OMEGA:Int'l J.Management Science,1987,15(6):509-521
    [65]R.Y.Wang,M.P.Reddy,H.B.Kon.Toward Quality Data:an Attribute-based Approach,Decision Support Systems,1995,13(3-4):349-372
    [66]R.Y.Wang,D.M.Strong.Beyond Accuracy:What Data Quality Means to Data Consumers,Journal of Management Information Systems,1996,12(4):5-34
    [67]I.N.Chengalur-Smith,M.P.Neely,T.Tribunella.The Information Quality of Databases,Encyclopedia of Database Technologies and Applications,Idea Group,2005,281-285
    [68]J.L.Kulikowski.Data Quality Assessment,Encyclopedia of Database Technologies and Applications,Idea Group,2005,116-120
    [69]K.Su,H.J.Huang,X.D.Wu,S.C.Zhang.A Logical Framework for Identifying Quality Knowledge from Different Data Sources,Decision Support Systems,2006
    [70]U.M.Fayyad,G.Piatetsky-Shapiro,R.Uthurusamy.Summary from the KDD-03 Panel-Data Mining:The Next 10 Years,SIGKDD Explorations,2003,5(2):191-196
    [71]L.Pipino,D.Kopcso.Data Mining,Dirty Data,and Costs,In:Proc.of ICIQ 2004,2004:164-169
    [72]Z.H.Zhou,D.Wei,G.Li,H.H.Dai.On the Size of Training Set and the Benefit from Ensemble,In:Proc.of PAKDD 2004,LNAI 3056,2004,298-307
    [73]Y.Feng,Z.H.Wu,X.Z.Zhou,Z.M.Zhou,W.Y.Fan.Knowledge Discovery in Traditional Chinese Medicine:State of the Art and Perspectives,Artificial Intelligence in Medicine,2006,38(3):19-236
    [74]Y.Feng,Z.H.Wu,Z.M.Zhou.Enhancing Reliability throughout Knowledge Discovery Process,In:ICDM'06 workshop of Reliability Issues in Knowledge Discovery(RIKD 06),2006,754-758
    [75]何前锋,吴朝晖,周雪忠等.方剂数据挖掘,中国机器学习会议,上海,同济大学学报增刊,2004.(32):36-38
    [76]吴朝晖,封毅.KDD在中医药领域的若干探索(Ⅰ),中国中医药信息杂志,2005.12(10):93-95
    [77]吴朝晖,封毅.KDD在中医药领域的若干探索(Ⅱ),中国中医药信息杂志,2005.12(11):92-95
    [78]Z.M.Zhou,Z.H.Wu,C.S.Wang,Y.Feng,Efficiently mining maximal frequent mutually associated patterns,In:Proceeding of ADMA 2006,Lecture Notes in Artificial Intelligence,Vol.4093,110-117
    [79]Z.M.Zhou,Z.H.Wu,C.S.Wang,Y.Feng,Efficiently mining mutually and positively correlated patterns,In:Proceeding of ADMA 2006,Lecture Notes in Artificial Intelligence,Vol.4093,118-125
    [80] Z.M. Zhou, Z.H. Wu, C.S. Wang, Y. Feng, Efficiently mining both association and correlation rules, In: Proceeding of FSKD 2006, Lecture Notes in Artificial Intelligence, Vol. 4223,369-372

    [81] T. Yu, X.H. Jiang, Y. Feng, Semantic Graph Mining for e-Science, AAAI 2007's workshop on Semantic e-Science (SeS2007)

    [82] Y. Feng, Z.H. Wu, H.J. Chen, T. Yu, Y.X. Mao, X.H. Jiang. Data Quality in Traditional Chinese Medicine, In: Proceeding of BMEI 2008,255-259

    [83] H. Dai. Field Learning, In: Processings of the 19~(th) Australian Computer Science Conference, 1996,55-63

    [84] Z.Y. Shen. The Continuation of Kidney Study. Shanghai: Shanghai scientific & Technical Publishers, 1990,3-31

    [85] W. Grzymala-Busse et al. A comparison of three closest fit approaches to missing attribute values in preterm birth data, International journal of intelligent systems, 2002, 17:125-134

    [86] V.I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Nauk SSSR, 1965,163(4)

    [87] D. Sankoff, J.B. Kruskal, Time Warps, String Edits, and Macromolecules: the Theory and Practice of Sequence Comparison, Reading, Mass., Addison-Wesley Pub. Co. 1983

    [88] G Das, R. Fleischer, L. Gasieniec, D. Gunopulos, J. Karkkainen. Episode matching. In Proceedings of CPM 1997, LNCS, vol. 1264, Springer-Verlag,Berlin, 12-27

    [89] S.B. Needleman, CD. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins J. Mol. Biol. 1970,48:444-453

    [90] P. Jaccard. Nouvelles recherches sur la distribution florale. Bull. Soc. Vaud. Sci.Nat. 1908, 44:223-270

    [91] A. Singhal. Modern Information Retrieval: A Brief Overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering. 2001, 24(4):35-43

    [92] S.E. Robertson, S. Walker, M. Beaulieu. Okapi at TREC-7: automatic adhoc,filtering, VLC and filtering tracks. In Proceedings of the Seventh Text REtrieval Conference (TREC-7), 1999,253-264.

    [93] A. Singhal, C. Buckley, M. Mitra. Pivoted document length normalization. In Proceedings of ACM SIGIR'96,1996,21-29

    [94] W. Mao, W. Chu. Free-text medical document retrieval via phrase-based vector space model. In Proceedings of AMIA Annual Symp 2002

    [95] R. Lowrance, R. Wagner. An Extension of the String-to-String Correction Problem. J. ACM 1975,22(2): 177-183
    [96] R. Wagner. On the complexity of the Extended String-to-String Correction Problem, In: Proc. Seventh. Annual ACM symp. on Theory of Computing, 1975,218-223

    [97] A. Amir, Y. Aumann, GM. Landau, M. Lewenstein, N. Lewenstein. Pattern Matching with Swaps. J. Algorithms 2000,37(2): 247-266

    [98] F. Tichy. The string-to-string correction problem with block moves, ACM Transactions on Computer Systems, 1984,2(4): 309-321

    [99] D. Lopresti, A. Tomkins. Block edit models for approximate string matching,Theoretical Computer Science, 1997,181(1):159-179

    [100] D.D. Lewis, M. Ringuette. A comparison of two learning algorithms for text categorization. In: Third Annual Symposium on Document Analysis and Information Retrieval. 1994,81-93

    [101] K. Schneider. Techniques for Improving the Performance of Naive Bayes for Text Classification. In: CICLing 2005,2005,682-693

    [102] W.W. Cohen, Y. Singer. Context-sensitive learning methods for text categorization. In Proc. of SIGIR'96,1996,307-315

    [103] R.E. Schapire, Y. Singer. BoosTexter: A boosting-based system for text categorization. Machine Learning, 2000,39(2/3):135-168

    [104] D.D. Lewis, R.E. Schapire, J.P. Callan, R. Papka. Training algorithms for linear text classifiers. In Proc. of SIGIR'96, Zurich, Switzerland, 1996,298-306

    [105] T. Joachims. Text categorization with Support Vector Machines: Learning with many relevant features. In ECML'98: Tenth European Conference on Machine Learning. 1998,137-142

    [106] T. Joachims. Estimating the generalization performance of a SVM efficiently.In Proc. of ICML 2000,2000,431-438

    [107] S. Dumais, J. Platt, D. Heckerman, M. Sahami. Inductive learning algorithms and representations for text categorization. In Proc. of CIKM'98, 1998,148-155

    [108] Y. Yang. An Evaluation of Statistical Approaches to Text Categorization.Information Retrieval. 1999,1(1-2):69-90

    [109] N. Goevert, M. Lalmas, N. Fuhr. A probabilistic description-oriented approach for categorising Web documents. In Proc. of CIKM'99, 1999 475-482

    [110] R.E. Schapire, Y. Singer. BoosTexter: A boosting-based system for text categorization. Machine Learning, 2000,39(2/3):135-168

    [111] Y. Yang, J.O. Pedersen. A Comparative Study on Feature Selection in Text Categorization, In Proc. of ICML97,1997,412-420

    [112] V. Shanks, H.E. Williams. Fast Categorisation of Large Document Collections.In: SPIRE 2001.2001,194-204

    [113] E. Wiener, J.O. Pedersen, A.S. Weigend. A neural network approach to topic spotting. In SDAIR' 95: Proc. of the Fourth Annual Symposium on Document Analysis and Information Retrieval, 1995

    [114] M. Ruiz, P. Srinivasan. Hierarchical text categorization using neural networks.Information Retrieval, 2002, 5(1):87-118

    [115] G Jeschke, M. Lalmas. Hierarchical Text Categorisation based on Neural Network and Dempster-Shafer's Theory of Evidence, In EUROFUSE Workshop on Information Systems, 2002

    [116] A. Cardoso-Cachopo, A.L. Oliveira. An Empirical Comparison of Text Categorization Methods, In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L.(Eds.) SPIRE 2003 proceedings, 2003,183-196

    [117] L.S. Larkey, W.B. Croft. Combining classifiers in text categorization. In Proc.of SIGIR'96,1996,289-297

    [118] A. McCallum. Multi-label text classification with a mixture model trained by EM. In: AAAI'99 Workshop on Text Learning. 1999

    [119] N. Ueda, K. Saito. Parametric Mixture Models for Multi-Labeled Text. In:NIPS 2002.2002, 721-728

    [120] Y. Kaneda, N. Ueda, K. Saito. Extended Parametric Mixture Model for Robust Multi-labeled Text Categorization. In: KES 2004: Proc. of the 8-th International Conference on Knowledge-Based Intelligent Information & Engineering Systems, 2004, 616-623

    [121] D.K. Kim, J.S. Lee, K. Park, Y. Cho. Efficient Algorithms for Approximate String Matching with Swaps, Journal of Complexity 1999,15:128-147

    [122] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, C. Watkins. Text Classification using String Kernels. Journal of Machine Learning Research 2.2002,419-444

    [123] Y. Yang. A study on thresholding strategies for text categorization. In Proc. of SIGIR'01, New Orleans, US, 2001,137-145

    [124] Y. Yang. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In Proc. of SIGIR'94,1994,13-22

    [125] W. Hersh, C. Buckley, T. Leone, D. Hickam. OHSUMED: An interactive retrieval evaluation and new large test collection for research. In Proc. of SIGIR'94,1994,192-201

    [126] J.B. Lovins. Development of a stemming algorithm, Mechanical Translation and Computational Linguistics 11.1968,22-31

    [127] R. Baeza-Yates, B. Ribeiro-Neto. Modern Information Retrieval. ACM Press and Addison Wesley, New York, 1999

    [128] A.K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval,classification and clustering. 1996, available at http://www-2.cs.cmu.edu/～mccallum/bow
    [129] Y. Feng, Z.H. Wu, Z.M. Zhou. Combining an order-semisensitive text similarity and closest fit approach to textual missing values in knowledge discovery, In: Proceeding of KES 2005, Lecture Notes in Artificial Intelligence,Vol.3682,2005,943-949

    [130] Y. Feng, Z.H. Wu, Z.M. Zhou. Multi-label text categorization using K-nearest neighbor approach with M-Similarity, In: Proceeding of SPIRE 2005, Lecture Notes in Computer Science, Vol. 3772,2005, 155-160

    [131] J. Schmid. The Main Steps to Data Quality, In: Proc. of 4th Industrial Conf. on Data Mining, 2004, 69-77

    [132] S. Mark, L. Conway. Towards the Principled Engineering of Knowledge. Al Magazine, 1982,3(3):4-16

    [133] C.J. Matheus, P.K. Chan, G Piatetsky-Shapiro. Systems for Knowledge Discovery in Databases. IEEE Transactions on Knowledge and Data Engineering, 1993,5(6):903-913

    [134] J.D. Schmitz, G.D. Armstrong, J. Little. CoverStory - automated news finding in marketing. DSS Transactions, 1990,46-54

    [135] P. Hoschka, W. Klosgen. A Support System for Interpreting Statistical Data.Knowledge Discovery in Databases, 1991,325-346

    [136] G Piatetsky-Shapiro, C.J. Matheus. Knowledge Discovery Workbench: An exploratory environment for discovery in business databases. In: Workshop Notes from the Ninth National Conference on Artificial Intelligence:Knowledge Discovery in Databases. 1991,11-24

    [137] M. Holsheimer, M.L. Kersten. Architectural Support for Data Mining. KDD Workshop 1994:217-228

    [138] H. Kargupta, I. Hamzaoglu, B. Stafford. Scalable, Distributed Data Mining Using An Agent Based Architecture. In: Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (KDD 97). Newport Beach, CA: AAAI Press, 1997,211-214

    [139] I. McLaren, E. Babb, J. Bocca. DAFS: supporting the knowledge discovery process. In: Proc. 1st Int. Conf. Practical Applications of Knowledge Discovery and Data Mining. Practical Application Company Ltd., 1997,179-190

    [140] F. George, A. Knobbe. A Parallel Data Mining Architecture for Massive Data Sets, 1999, http://citeseer.ist.psu.edu/article/george99parallel.html

    [141] S.J. Stolfo, A.L. Prodromidis, S. Tselepis, W. Lee, D.W. Fan, P.K. Chan. Jam:Java agents for meta-learning over distributed databases. In: Proceedings of KDD 97. Newport Beach, CA: AAAI Press, 1997, 74-81

    [142] H. Kargupta, B.H. Park, D. Hershberger, E. Johnson. Collective data mining: A new perspective toward distributed data mining. In Advances in distributed data mining,AAAI/MIT Press, 1999,133-184

    [143] J. Chattratichat, J. Darlington, Y. Guo, S. Hedvall, M. Koler, J. Syed. An Architecture for Distributed Enterprise Data Mining. In: Proceedings of the 7th International Conference on High-Performance Computing and Networking.Amsterdam: Spinger (LNCS 1593), 1999, 573-582

    [144] S.M. Bailey, R.L. Grossman, H. Sivakumar, A.L. Turinsky. Papyrus: A System for Data Mining over Local and Wide Area Clusters and Super-Clusters. In:Proceedings of the 1999 ACM/IEEE conference on Supercomputing. Portland,Oregon, USA: ACM Press, 1999,63-63

    [145] O. Rana, D. Walker, M. Li, S. Lynden, M. Ward. PaDDMAS: Parallel and Distributed Data Mining Application Suit. In: Proceedings of the Fourteenth International Parallel and Distributed Processing Symposium. Cancun, Mexico:IEEE Press, 2000, 387-392

    [146] M.Z. Ashrafi, D. Taniar, K.A. Smith. A Data Mining Architecture for Distributed Environments. In: Innovative Internet Computing Systems, Second International Workshop (IICS 2002). LNCS 2346,2002,27-38

    [147] M. Cannataro, D. Talia. Towards the Next-Generation Grid: A Pervasive Environment for Knowledge-Based Computing. In: Proc. 4th IEEE Int. Conf.on Information Technology: Coding and Computing (ITCC2003). Las Vegas:IEEE Press, 2003,437-441

    [148] M. Cannataro, D. Talia. Semantics and knowledge grids: building the next-generation grid. Intelligent Systems, 2004,19(1): 56-63

    [149] Discovery Net Project Homepage, http://www.discovery-on-the.net/

    [150] V. Curcin, M. Ghanem, Y. Guo, M. Kohler, A. Rowe, J. Syed, P. Wendel.Discovery Net: Towards a Grid of Knowledge Discovery. In: The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2002). 2002,658-663

    [151] A. Rowe, Y. Guo, D. Kalaitzopoulos, M. Osmond, M. Ghanem. The Discovery Net System for High Throughput Bioinformatics. Bioinformatics, 2003.19(Suppl. 1): 225-231

    [152] R.A. Heckemann, T. Hartkens, K. Leung, D.L. Hil, J.V. Hajnal, D. Rueckert.Information Extraction from Medical Images (IXI): Developing an e-Science Application Based on the Globus Toolkit. In: 2nd UK e-Science All-hands Conference. Nottingham, UK, 2003, 775-779

    [153] M. Ghanem, Y. Guo, J. Hassard, M. Osmond, M. Richards. Grid-based Data Analysis of Air Pollution Data. In: Fourth International Workshop on Environmental Applications of Machine Learning. 2004

    [154] Knowledge Grid Lab Homepage, http://dns2.icar.cnr.it/kgrid

    [155] M. Cannataro. Clusters and Grids for Distributed and Parallel Knowledge Discovery. In: Proc. of HPCN Europe 2000. LNCS 1823,2000, 708-716

    [156] M. Cannataro, D. Talia, P. Trunfio. Distributed Data Mining on the Grid.Future Generation Computer Systems, 2002,18(8): 1101 -1112
    [157] M. Cannataro, A. Congiusta, D. Talia, P. Trunfio. A Data Mining Toolset for Distributed High-Performance Platforms. In: Proc. 3rd Int. Conference Data Mining 2002. Bologna, Italy: WIT Press, 2002,41-50

    [158] DataminingGrid Homepage, http://www.datamininggrid.org

    [159] G Piatetsky-Shapiro. Knowledge Discovery in Databases: 10 years after.SIGKDD Explorations, 2000,1(2):59-61

    [160] H.J. Chen, Z.H. Wu. DartGrid: A Semantic Infrastructure for Building Database Grid Application. Journal of Concurrency and Computation: Practice and Experience. v18:11. Jan. 2006

    [161] Z.H. Wu, H.J. Chen et al, DartGrid II: A Semantic Grid Platform for ITS, IEEE Intelligent Systems, vol.20, No.3, Jun. 2005

    [162] H.J. Chen, Y.M. Wang, Z.H. Wu, et al: Towards a Semantic Web of Relational Databases: A Practical Semantic Toolkit and an In-Use Case from Traditional Chinese Medicine. In: Proc. of ISWC 2006:750-763

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700