基于Web信息抽取的专业知识获取方法研究

英文题名：Research on Specialty Knowledge Retrieval Method Based on Web Information Extraction
作者：胡燕
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：网页获取 ; 网页清洗 ; 信息抽取 ; 专业知识获取 ; 特征提取 ; 文本分类 ; 信息存储
英文关键词：page acquisition ; page cleaning ; information extraction ; specialty knowledge retrieval ; characteristic extraction ; text classification ; information storage
学位年度：2007
导师：钟珞
学科代码：081203
学位授予单位：武汉理工大学
论文提交日期：2007-04-01
答辩委员会主席：卢炎生

摘要

互联网的飞速发展使其成为全球信息传播与共享的重要资源,Web上的数据一直呈几何级数增长,要想从Web上获取一条有用信息的难度却越来越大,“信息过载”已经成为一个亟待解决的问题。一种理想的情况是:人们可以像查询数据库一样查询Web上的数据。然而,如何从浩繁的Web数据中抽取出有用的信息成为众多研究工作希望解决的问题。
     Internet具有的海量、异构、动态变化等特性使Web信息抽取不同于传统信息抽取,同时带来了新的挑战。抽取技术随着需求的增加而不断丰富,近年来国内外涌现了多种信息抽取方法。本文针对智能教学系统中需要构建的学科知识数据库,研究根据用户需求从Web中自动获取各学科专门知识的方法。
     本文提出的基于Web信息抽取的专业知识获取方法主要是受SRV把信息抽取问题看成是一种分类问题的启发,结合目前已有的基于HTML结构的Web信息抽取技术,构造了基于Web信息抽取和分类技术的Web专业知识获取系统的框架,并针对该系统框架下的若干关键技术进行了专门研究,具体内容如下:
     1.研究Web网页的批量获取及预处理方法。基于Web的专业知识获取需要收集大量同一主题的网页,目前各搜索引擎所提供的服务还不能满足需求,本文提出了一种简单高效的从Web自动批量获取网页,并利用正则表达式匹配出具有主题内容的网页的方法。
     2.研究网页预处理的方法。根据HTML文档结构中的标签含义,构造HTML容器标签树,针对网页中各噪音块和主题内容块的特点,删除标签树中的噪音结点,确定主题内容块。
     3.研究网页的主题信息抽取方法。该研究针对当前的信息抽取方法需要有较多的人工干预,需要较多的先验知识,不同的系统使用的描述语言不同等特点,采用了基于XML映射的信息抽取方法,提出了利用DOM构建Jtree,根据treenode结点自动获取信息抽取的路径,学习信息抽取规则,从而达到信息抽取自动化的目的。
     4.研究中文文本特征表示方法和文本分类算法。针对向量空间模型的文本特征表示方法中特征词数量的多少,以及数据搜索空间的大小与分类算法的效率有着密切关系的特点,提出了基于词性的特征词提取方法,有效降低了特征向量的维数;提出了基于特征词减少的改进的KNN算法和基于数据分割的改进的KNN算法,提高了分类算法的效率和性能。
     5.研究训练库的自动获取方法。要提高分类算法的性能,必须建立高质量的训练库,以往的研究都是基于一个已经建立好的训练库,本文提出通过Web挖掘自动生成一个高质量的训练库,以进一步提高专业知识获取的自动化程度。
     6.研究信息的组织和存储方法。对提取的专业知识组织成用户的应用系统——智能教学系统可以直接访问的形式,并对数据按照应用系统的要求进行了初步整理。
     本文对基于Web信息抽取的专业知识获取过程中各环节的关键技术进行了研究,建立了知识获取框架,初步实现了整个获取过程的自动化。
Rapid development makes Internet become an important resource in global information transformation and sharing. The data in the web are growing at a steady rate of geometric series, so it is more and more difficult to acquire a piece of useful information from the Web, and "information overload" has become an urgent problem needed to be solved. The ideal case is described as: people can inquire into the data in the web in the same way as we inquire into the data base. However, how to extract the useful information from vast and numerous data on the Web is still a problem which the researchers hope to solve.
     Such characteristics as large quantity, isomery and dynamic variation and so on make Web information extraction different from traditional information extraction, and bring new challenges. In recent years the extraction techniques have been enriched as the demand increases, and there exist many information extraction methods domestically and abroad. In this dissertation, we investigate the method of automatic knowledge acquisition in all subjects from the Web according to the need of the customers, in accordance with the subject knowledge data base to be established in the smart instructional system.
     Specialized knowledge acquisition method based on Web information extraction, which is proposed in this dissertation, is mainly enlightened by the idea that SRV regards the information extraction as a classification problem. Along with Web information extraction method based on HTML structure, we have constructed the frame of Web specialized knowledge acquisition system based on Web information extraction and classification method, and conducted special studies on some key techniques in this system. The detailed contents of this dissertation are listed as follows:
     1. Web page large-quantity acquisition and pretreatment are analyzed. Specialized knowledge acquisition based on Web requires collecting a large quantity of web pages with the same topic. Nowadays the service provided by all Search-engines can't meet the need. In this work, we present a simple and efficient method which is employed to automatically acquire web pages in large quantity and match the pages of the same topics by using canonical expressions.
     2. Page pretreatment method is studied. According to the label meaning in the HTML file structure, HTML vessel label tree is constructed. In view of the characteristics of noise block and subject content block in the pages, the noise node in the label tree is deleted and subject content block is confirmed.
     3. Subject information extraction method of the pages is discussed. In view of the fact that the present information extraction methods need much artificial intervention and much prior knowledge, and that different systems use different descriptive languages, we employ one kind of information extraction method based on XML mapping, establish Jtree by using DOM, automatically acquire the path of information extraction according to the tree node, and study information extraction rules, in order that the automation in information extraction is achieved.
     4. Chinese text characteristic expression method and text classification algorithm are also analyzed. The quantity of characteristic word in the text characteristic expression method of vector space model and the dimension of data searching space have an intimate relationship with the efficiency of classification algorithm. Based on the fact mentioned above, we have developed a characteristic word extraction method based on word gender, which can reduce the dimensions of characteristic vector. And we have also proposed two modified KNN algorithms, which are based on lessening of characteristic words and data division respectively, so that the efficiency and performance of classification algorithm are improved.
     5. Training base's automatic extraction method is studied. In order to improve the performance of the classification algorithm, a high-class training base has to be established. All the past researches are based on the training base which had already been established. However, in present study one high-class training base is automatically generated by Web excavation, in order to further improve the automation degree of specialized information acquisition.
     6. The information organization and storage methods are analyzed. The extracted specialized knowledge is organized into a form that the customer utility system-smart instructional system- can access directly, and the data are arranged initially according to the need of the utility system.
     In this dissertation, researches have been done on key techniques in every link of specialized knowledge acquisition based on web information extraction, the knowledge acquisition frame has been established, and elementary automation in the process of acquisition is achieved.

引文

[1]David W.Embley.Toward Semantic Understanding-An Approach Based on Information Extraction Ontologies.In Klaus-Dieter Schewe and Hugh Williams,Eds.Proceedings of the Fifteenth Australasian Database Conference(ADC'04).Dunedin,New Zealand:Australian Computer Society,Inc,2004.3-12.
    [2]Ramón Aragüés Peleato,Jean-Cédric Chappelier,Martin Ra jman.Automated Information Extraction out of Classified Advertisements.In M.Bouzeghooub er al.Eds.Proceedings of the 5th International Conference on Applications of Natural Language to Information Systems-Revised Papers.London:Springer-Verlag,2000.203-214.
    [3]陈琼,苏文健.基于网页结构树的Web信息抽取方法[J].计算机工程,2005.31(20):54-55,140.
    [4]李慧,张舒,顾天竺等.一种新颖的CRE用户评论信息抽取技术[J].计算机应用,2006.26(10):2509-2512.
    [5]杨坚争,李朝平.垂直搜索引擎及其应用[J].电子商务,2006.(10):23-25.
    [6]Ying Han,Fang Li,KeBin Liu et al.Template Based Chinese News Event Summarization.The proceeding of 2nd International Conference on Semantics,Knowledge,and Gdd(SKG'06),2006.53-54.
    [7]IEPAD:Web_Information extraction based on pattern discovery.http://chunnan.iis.sinica.edu.tw/iepad/IEPAD.pdf.
    [8]朱永盛,武港山.基于Web的新闻信息抽取[J].计算机工程,2006.32(10):74-76.
    [9]谌志群,张国煊.一个基于内容的Web信息抽取方法.第二十届东方语言计算机处理国际学术会议(20thICCPOL-2003),2003.402-409.
    [10]李蕾,周延泉,王菁华.基于全信息的中文信息抽取系统及应用[J].北京邮电大学学报,2005.28(6):48-51.
    [11]王琦,唐世渭,杨冬青等.基于DOM的网页主题信息自动提取[J].计算机研究与发展.2004,41(10):1786-1792.
    [12]顾铮,顾平.信息抽取技术在中医研究中的应用[J].医学信息,2007.20(1):27-30.
    [13]周剑辉,苑春法,黄锦辉.金融领域内信息抽取规则的自动获取.Advances in Computation Of Oriental Languages--Proceedings Of the 20th International Conference On Computer Processing of Oriental Languages,2003.410-416.
    [14]李向阳,张亚飞.基于语义标注的信息抽取[J].解放军理工大学学报(自然科学版),2004.5(4):39-43.
    [15]李跃进,赵晶,林鸿飞.基于Internet的军事演习信息抽取系统[J].计算机工程与应用,2006.42(14):214-218.
    [16]肖明军,张巍,邹翔等.一种多策略联合信息抽取方法[J].小型微型计算机系统,2005.26(4):614-617.
    [17]黄永文.信息抽取在竞争情报中的应用研究[J].图书情报工作,2006.50(11):17-20,90.
    [18]赵芳,吴亚栋,宿继奎.基于音轨特征量的多音轨MIDI主旋律抽取方法[J].计算机工程,2007.33(2):165-167.
    [19]李芳,盛焕晔.特定领域专家主页信息的自动抽取.全国第八界计算语言学联合学术大会(JSCL-2005)论文集,2005.
    [20]Fang Li,Li Feng,Huanye Sheng.Web page clustering and concepts mining.2nd IEEE International Conference on Cybernetics and Intelligent System,2006.416-421.
    [21]Fang Li,Shuangqing Yuan,Huanye Sheng.Iterative Mining Translations from the Web.Proceeding of international workshop on Challenges in Web Information retrieval and Integration,2005.
    [22]姚天昉,聂青阳,李建超等.一个用于汉语汽车评论的意见挖掘系统.中国中文信息学会成立二十五周年学术年会论文集.清华大学出版社,2006
    [23]娄德成,姚天昉.汉语句子语义极性分析和观点抽取方法的研究[J].计算机应用,2006.26(11):2622-2625.
    [24]Rohini Srihare,Wei Li.Information Extraction Supported Question Answering[R].1999-10-15.
    [25]陈玉芳,葛隧和.一个基于XML的WEB数据收集模型的研究[J].计算机工程与应用,2004.40(10):150-152,156.
    [26]陈少飞,郝亚南,李天柱等.Web信息抽取技术研究进展[J].河北大学学报(自然科学版),2003.23(1):106-112.
    [27]Nancy A.Chinchor.Overview of MUC-7/MET-2.In:Proceedings of the Seventh Message Understanding Conference,1998.
    [28]Marsh,E.,Perzanowski,D.MUC-7 EVALUATION OF IE TECHNOLOGY:Overview of Results.In:Proceedings of the Seventh Message Understanding Conference,1998.
    [29]Line Eikvil,Information Extraction from World Wide Web.Survey Report,1999.
    [30]K.Zechner.A Literautre Survey on Information Extraction and Text Summarization.Term paper,Carnegie Mellon University,1997.
    [31] Jian Sun, Ming Zhou, Jianfeng Gao. Class-based Language Modeling for Named Entity Identification. http://www.nlp.org.cn/docs/20030210/114/Chinese_NE_Identification_Jian Sun_draft.pdf.

    [32] Jian Sun, Jianfeng Gao, Lei Zhang et al. Chinese Named Entity Identification Using Class-based Language Model. in Proceedings of COLING, 2002.

    [33] Laender A, Ribeiro-Neto B, Silva A. A brief survey of web data extraction Tools[J]. SIGMOD Record, 2002. 31(2): 84-93.

    [34] Crescenzi, V., and Mecca, G Grammars Have Exceptions[J]. Information Systems, 1998. 23(9):539-565.

    [35] Hammer, J., Garcia-Molina, H., Nestorov, S., et al. The TSIMMIS Experience. In Proceedings of the First East-European Symposium on Advances in Databases and Information Systems(ADBIS'97), 1997.1-8

    [36] Arocena, G O., Mendelzon, A. O. WebOQL: Restructuring Documents, Databases, and Webs. In Proceedings of the 14th IEEE International Conference on Data Engineering, 1998. 24-33.

    [37] Ludascher, B., Himmeroder, R., Lausen, G et al. Managing semistructured data with florid: A deductive object-oriented perspective[J]. Information Systems., 1998.23(8): 589-613.

    [38] Huck, G, Fankhauser, P., Aberer, K.et al. Jedi: Extracting and synthesizing information from the web. In Proceedings of the 3rd IFCIS International Conference on Cooperative Information Systems, 1998.32-43.

    [39] Sahuguet, A., Azavant, F. Building intelligent web applications using lightweight wrappers[J]. Data and Knowledge Engineering, 2001.36(3): 283-316.

    [40] Liu, L., Pu, C, and Han, W. XWRAP: An XML-enable Wrapper Construction System for Web Information Sources. In Proceedings of the 16th IEEE International Conference on Data Engineering, 2000.611-621.

    [41] Crescenzi, V., and Mecca, G, and Merialdo, P. RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In Proceedings of the 26th International Conference on Very Large Database Systems, 2001.109-118.

    [42] Baumgartner, R., Flesca, S., Gottlob, G Visual Web information extraction with Lixto. In Proceedings of the 26th International Conference on Very Large Database Systems, 2001. 119-128.

    [43] Califf M. E., MOONEY R. J. Relational Learning of pattern-match rules for information extraction. In Proceedings of the Sixteenth National Conference on Artificial Intelligence and Eleventh Conference on Innovative Applications of Artificial Intelligence, 1999. 328-334.
    [44]Freitag,D.Machine learning for information extraction in informal domains[J].Machine Learning,2000.39(2-3):169-202.
    [45]Soderland,S.Learning information extraction rules for semi-structured and Free Text[J].Machine Learning,1999.34(1-3):233-272.
    [46]Muslea,I.,Minton,S.,Knoblock,C.A.Hierarchical wrapper induction for semistructured information sources[J].Autonomous Agents and Multi-Agent Systems,2001.4(1-2):93-114.
    [47]Hsuc N,Dung M.Generating finite-state transducers for semi-structured data extraction from the Web[J].Information System,1998.23(8):521-538.
    [48]Kushmerick N.Wrapper induction:efficiency and expressiveness[J].Artificial Intelligence Journal,2000.118(1-2):15-68.
    [49]Adelberg,B.NoDoSE:A Tool for Semi-Automatically Extracting Structured and Semi-Structured Data from Text Documents.In Proceedings of the ACM SIGMOD International Conference on Management of Data,1998.283-294.
    [50]Laender,A.H.F.,Ribeiro-Neto,B.A.,Da Silva,A.S.DEByE-Data Extraction by Example[J].Data and Knowledge Engineering,2002.40(2):121-154.
    [51]Ribeiro-Neto,B.A.,Laender,A.H.E,Da Silva,A.S.Extracting Semi-Structured Data Through Examples.In proceedings of the Eighth ACM International Conference on Information and Knowledge Management,1999.94-101.
    [52]Embley D,Campbelld,Jiang S,et al.Conceptual-model-based data extraction from,ultiple record web pages[J].Data and Knowledge Engineering,1999.31(3):227-251.
    [53]Christina Yip Chung,Michael Gertz,Neel Sundaresan.Reverse engineering for Web data:From visual to semantic structures.In Proceedings of 18th International Conference on Data Engineering,2002.53-63.
    [54]Christina Yip Chung,Neel,Sundaresan.Quixote:Building XML repositories from topic specific web documents.In Fourth Int.Workshop on the Web and Databases,2001.103-108.
    [55]孟小峰,王海燕,谷明哲等.XWIS中基于预定义模式的包装器[J].计算机应用,2001.21(9):1-3,7.
    [56]李效东,顾敏清.基于DOM的Web信息提取[J].计算机学报,2002.25(5):526-533.
    [57]张绍华,徐林昊,杨文柱.基于样本实例的Web信息抽取[J].河北大学学报(自然科学版,2001.21(4):431-437.
    [58]朱明,王军,王俊善.基于多层模式的多记录网页信息抽取方法[J].计算机工程,2001.27(9):40-42.
    [59]Arnaud Sahuguet,Fabien Azavant.WysiWyg Web Wrapper Factory(W4F).In Proc.WWW'99,1999.
    [60]Google Web API.Http://www.google.ca/apis 2002.04.07.
    [61]Klaus Salchner.How To Integrate Google Search Into Your Application.http://www.codeguru.com/Csharp/Csharp/cs_webservices/tutorials/article.php/c8785.2004-12-23.
    [62]Steve Mansour.A Tao of Regular Expressions.http://www.scruz.net/%7esman/regexp.htm.1999-06-05
    [63]陈磊,冯玉珉.一种基于网页自动分类的分类查询搜索引擎[J].电脑与信息技术,2004.12(6):47-51.
    [64]李嘉佑,贾自艳,何清等.基于Web挖掘的网页清洗技术[J].计算机工程与应用,2006.42(25):98-101.
    [65]Shian-Hua Lin,Jan-Ming Ho.Discovering Informative Content Blocks from Web Documents.In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(SIGKDD 2002),2002.
    [66]Yiming Yang.Noise Reduction in a Statistical Approach to Text Categorization.In Proceedings of SIGIR-95,18th ACM International Conference on Research and Development in Information Retrieval,1995.256-263.
    [67]LI Xiaoli,SHI Zhongzhi.Innovating Web Page Classification Through Reducing Noise[J].Computer Science & Technology,2002.17(1):9-17.
    [68]Soumen Chakrabarti,Mukul M.Joshi,Vivek B.Tawde.Enhanced Topic Distillation Using Text,Markup Tags,and Hyperlinks.In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval,2001.208-216.
    [69]S.Chakrabarti,M.Joshi,M.Subramanyam.Accelerated focused crawling through online relevance feedback.In Proceedings of the 11th World Wide Web Conference(WWW),2002.148-159.
    [70]Bar-YossefZ,Rajagopalan S.Template Detection via Data Mining and its Applications.In Proceedings of the 11th World Wide Web Conference(WWW),2002.580-591.
    [71]Davision,B.D.Recognizing Nepotistic links on the web.[J].Proc of AAAI,2000.22(6):72-77.
    [72]Kao H.Y.,Lin S.H.,HO J.M.,et al.Entropy-based link analysis for mining web informative structures,In Proc.of the ACM 11th International Conf.on Information and Knowledge Management(CIKM-02),2002.574-581.
    [73]J.M.Kleinberg.Authoritative sources in a hyperlinked environment.In Proc.Of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms,1998.668-677.
    [74]Kushmerick N.Learning to remove Internet advertisements.In Proceedings of the third annual conference on Autonomous Agents,1999.175-181.
    [75]张志刚,陈静,李晓明.一种HTML网页净化方法[J].情报学报,2004.23(4):387-393.
    [76]Arnaud Le Hors,Philippe Le Hégaret,Lauren Wood et al.Document Object Model(DOM)Level 3 Core Specification Version 1.0.W3C Recommendation.http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407.
    [77]Custavo Arocean.WebOQL:Exploiting document structure in Web queries[D].Toronto:Master's thesis,University of Toronto,1997.
    [78]XSL Transformations(XSLT),W3C Recommendation.http://www.w3.org/TR/xslt.html.1999-11-16.
    [79]刘芳,卢正鼎.有效地检索HTML文档[J].小型微型计算机系统,2000.21(9):986-988.
    [80]郭辉,苏中义,王文等.一种改进的MM分词算法[J].微型电脑应用,2002.18(1):13-15.
    [81]G.Salton,A.Wong,C..S.Yang.A Vector Space Model for Automatic Indexing[J].Communication of the ACM,1975.18(11):613-620.
    [82]Salton G.Automatic Text Processing:The Transformation analysis and retrieval of information by computer.Addison-Wesley Series In Computer Science,1989.530.
    [83]李威.基于向量空间的文本自动分类系统的研究和实现[D].兰州:兰州理工大学,2005.
    [84]边肇祺,张学工.模式识别[M].北京:清华大学出版社,2000.
    [85]邹娟.面向中文文本的特征值提取[D].湘潭:湘潭大学.2005.
    [86]Riloff.E.Automatically Constructing a Dictionary for Information Extraction Task.In Proceedings of the Eleventh Annual Conference on Artificial Intelligence,1993.811-816.
    [87]Ellen Riloff,Wendy Lehnert.Information Extration as a Basis for High-Precision Text Classification[J].ACM Transactions on Information Systems.1994(3):293-333.
    [88]Yiming Yang.Pedersen J.P.A Comparative Study on Feature Selection in Text Categorization.In Proceedings of the Fourteenth International Conference on Machine Learning(ICM'97),1997.412-420.
    [89]David D.Lewis.Feature Categorization.Speech Selection and Feature Extraction for Text and Natural Language.In Proceedings of a workshop held a Harriman,1992.212-217.
    [90]黄董蓄,吴立德,石崎洋之等.独立于语种的文本分类方法[J].中文信息学报,2000.14(6):1-7.
    [91]Tom Mitchell.Machine Learning[M].McCraw Hill,1997.
    [92]Kjersti Aas,Line Eikvil.Text Categorization:A survey,Technical report,Norwegian Computing Center.http://citeseer.nj.nec.com/aas99text.html.1999-06
    [93]Kenneth Ward Church,PatricK Hanks.Word association norms,mutual information and lexicography.In Proceedings of ACL27,1989.76-83.
    [94]Oh-Woog Kwon,Jong-Hyeok Lee.Web page Classification Based on k-Nearest Neighbor Approach.In Proceedings of the 5~(th)international workshop on Information retrieval with Asian languages table of contents,2000.9-15.
    [95]孙茅松,邹嘉彦.汉语自动分词研究中的若干问题[J],语言文字应用,1995.4(4):40-46.
    [96]刘源,谭强等.信息处理用现代汉语分词规范及自动分词方法[M].北京:清华大学出版社,1994.
    [97]向永红等.串的最大匹配算法[J].计算机工程与科学,2003.25(4)72-74.
    [98]邹加棋.中文网页自动分类关键技术研究[D].福州:福州大学,2005.
    [99]白广奇.网页内容过滤的关键技术研究及实现[D].济南:山东大学.2004.
    [100]Yubin Dai,Teck Ee Loh,Christopher Khoo.A New Statistical Formula for Chinese Text Segmentation Incorporating Contextual Information.In Proceedings of the 22"d Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,1999.82-89.
    [101]黄昌宁.中文信息处理中的分词问题[J].语言文字应用,1997.6(1):72-78.
    [102]ICTCLAS授权策略.http://www.i3s.ac.cn.2004-4-30.
    [103]T.Joachims.Text categorization with support vector machines:learning with many relevant features.In Proceedings of ECML-98,10th European Conference on Machine Learning,1998.137-142.
    [104]Yiming Yang,S.Slattery,R.Ghani.A study of approaches to hypertext categorization[J].Journal of Intelligent Information System,2002.18(2-3):219-241.
    [105]Jyh-Jong Tsay,Jing-Doo Wang.Design and Evaluation of Approaches to Automatic Chinese Text Categorization[J].Computational Linguistics and Chinese Language Processing,2000.5(2):43-58.
    [106]Salton,G.Developments in automatic text retrieval.Science,1991.253(5023):974-979.
    [107]Miguel E.Ruiz,padmini Srinivasan.Hierarchical Neural Networks for Text Categorization.In Proceedings of SIGIR299,22nd ACM International Conference on Research and Development in Information Retrieval,1999.281-282.
    [108]Yang Y,Liu X.Are-examination of text categorization methods.Proceedings 22nd Annua International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR'99),1999.42-49.
    [109]Furnkranz J.Exploiting structural information for text classification on the WWW[M].Springer Berlin / Heidelberg,1999.
    [110]Oh.H,M yaeng.S,HoLee.M.A practical hypertext categorization method using links and incrementally available class information.In:Belkin NJ,Ingwersen P,Leong MK,eds.Proc.of the 23rd ACM Int'l Conf.on Research and Development in Information Retrieval(SIGIR-00),2000.264-271.
    [111]Wai Lam,Chao Yang Ho.Using a generalized instance set for automatic text categorization.In Proceedings of the 21th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR'98),1998:81-89.
    [112]C.K.P.Wong,R.W.P.Luk,K.F.Wong et all.Text categorization using hybrid(ninded)terms.5th International Workshop on Information Retrieval with Asian Languages,2000.217-218.
    [113]刁倩,王永成,张惠惠等.文本自动分类中的词权重和分类算法[J].中文信息学报,2000.14(3):25-29.
    [114]G.-H.Cha,X.Zhu,D.Petkovic,C.-W.Chung.An efficient indexing method for nearest neighbor searches in high-dimensional image databases.IEEE Transactions on Multimedia,2002,4(1):76-87.
    [115]Hanan Samet.Depth-First K-Nearest Neighbor Finding Using the MaxNearestDist Estimator.In Proceedings of the 12th International Conference on Image Analysis and Proceeding,2003.486-491.
    [116]L.s.Larkey,W.B.Croft.Combining classifiers in text categorization.In Proceedings of SIGIR-96,19th ACM International Conference on Research and Development in Information Retrieval,1996.289-297.
    [117]D.D.Lewis.An evaluation of phrasal and clustered representations text categorization task.In Proceedings of SIGIR 92,15th ACM on a International Conference on Research and Development in Information Retrieval,1992.37-50.
    [118]Lewis,D.D.,Gale,W.A.A sequential algorithm for training text classifiers.In Proceedings of SIGIR-94,17th ACM International Conference on Research and Development in Inforation Retrieval,1994.3-12.
    [119]Y.H.Li,A.K.Jain.Classification of text documents[J].The Computer Journal,1998.41(8):537-546.
    [120]M.E.Ruiz,P.Srinivasan.Combining Machine Learning and Hierarchical Indexing Structures for Text Categorization[D].Ames:Graduate College of University of Iowa,2001.
    [121]胡蓉.中文Web文档倾向性自动分类研究[M].重庆:四川大学,2003.
    [122]V.Vapnik and A.Y.Chetvonenkis.On the Uniform Convergence of Relative Frequencies of Events to their Probabilities[J].Theory of Probab.And its Application,1971.16(2):263-280.
    [123]刘卓.基于KNN算法的中文文本自动分类[M].长春:吉林大学.2004.
    [124]Yu Jiangsheng.Method of k-Nearest Neighbors.http://www.nlp.org.cn/docs/20020903/36/kNN.pdf.2002-09-03.
    [125]Mineichi Kudo,Hideyuki Imai,Akira Tanaka et al..A Nearest Neighbor Method Using Bisectors[M].Springer Berlin / Heidelberg,2004.
    [126]J.Nievergelt,H.Hinterberger,K.Sevcik.The gddfile:An Adaptable Symmetric Multikey File Stucture.ACM Trans.on Database Systems,1984.9(1):38-71.
    [127]J.L.Bentley.Multidimensional Binary Search Trees in Database Applications.on Software Engineering,1979.5(4):333-340.
    [128]N.Beckmann,H.Kriegel,R.Schneider et al.R~*-tree:An Efficient and Robust Access Method for Points and Rectangles.ACM SIGMOD,1990.322-231.
    [129]S.Berchtold,D.Keim,H.p.Kriegel.The X-tree:An Index Structures for High-Dimensional Data.22th VLDB,1996.28-39.
    [130]White,D.A.;Jzin,R.;Similarity Indexing with the SS-tree.In Proceedings of the Twelfth International Conference on Data Engineering,1996.516-523.
    [131]Jin,H.,Ooi,B.B.,Shen,H.T.,Ao Ying Zhou.An adaptive and efficient dimensionality reduction algorithm for high-dimensional indexing.In Proceedings of the 19th International Conference on Data Engineering,2003.87-98.
    [132]M.Flickner,H.Sawhney,W.Niblack et al.Query by Image and Video Content:The QBIC System[J].Computer,1995.28(9).23-32.
    [133]P.Wu,B.S.Manjunath,S.Chandrasekaran.An Adaptive Index Structure for High-Dimensional Similarity Search.PCM 2001,LNCS 2195,2001.71-78.
    [134]David D.Lewi.Marc Ringuette.A Comparison of Two Learning Algorithms for Text Categorization.In Proceedings of SDAIR,Third Annual Symposium on Document Analysis and Information Retrieval,1994.81-93.
    [135]王小华,张国煊,陆蓓.文本分类系统的评价因素探讨[J].杭州电子工业学院学报.2002.22(3):11-14.
    [136] K. Nigam, A. K. McCallum, S. Thrun, and T. M. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 2000.39(2-3): 103-134.

    [137] A. McCallum, K. Nigam. Text classification by bootstrapping with keywords. In ACL Workshop for Unsupervised Learning in Natural Language Processing, 1999.

    [138] J. H. H. Yu, C. Zhai. Text classification from positive and unlabeled documents. In Proceedings of the 12th Annual International ACM Conference on Information and Knowledge Management, 2003. 232-239.

    [139] C.-C. Huang, S.-L. Chuang, L.-F. Chien. Liveclassifier: Creating hierarchical text classifiers through web corpora. In Proceedings of the 10th International World Wide Web Conference, 2004.184-192.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700