面向智能服务的Web内容计算研究与应用

英文题名：Intelligent Service Oriented Study and Application on Web Content Computing
作者：张友华
论文级别：博士
学科专业名称：模式识别与智能系统
中文关键词：Web内容计算 ; Web挖掘 ; Web信息提取 ; Web文本分类 ; Web智能服务
英文关键词：Web Content Computing ; Web Mining ; Web Information Extraction ; Web Text Classification ; Web Intelligent Service
学位年度：2006
导师：熊范纶
学科代码：081104
学位授予单位：中国科学技术大学
论文提交日期：2006-05-01

摘要

WEB是人们获取信息与知识的重要途径，它的海量性、多样性、动态性和半结构化等特性增加了其信息进行自动处理的难度，也吸引了研究者的兴趣。如何从大量的信息中发现用户感兴趣的信息是目前因特网信息搜索研究的课题；如何将WEB上丰富的信息转化为有用的知识是WEB挖掘和WEB知识发现的任务；如何使用户获取个性化信息，从而使WEB提供更多的服务功能是WEB智能需要解决的问题。目前WEB信息数据大致可以分为三类：内容数据(Content Data)、访问数据(Usage Data)和结构数据(Structure Data)，因此也形成WEB研究的三个大的方向：WEB内容挖掘、WEB访问挖掘和WEB结构挖掘。WEB的信息载体主要是WEB页面，它的内容包含显示的数据、标记和超链接。基于WEB内容的计算就是以WEB页面为对象，研究WEB的信息提取、WEB的信息检索和WEB智能服务等涉及到的问题。本文在综合了WEB内容计算的研究基础上，重点研究并取得如下创新性成果：
     (1) 提出了一种增量式挖掘方法iFP-Growth，使传统的FP-Growth方法适应于Web动态数据环境的关联规则挖掘。
     Web页面数据的半结构化、不规则性和动态更新等特征，使得基于Web内容的数据挖掘研究具有一定的复杂性。本文总结了多种从Web页面中提取半结构化数据的理论与方法，针对Web内容数据的特点，提出的增量式挖掘方法iFP-Growth，使传统的FP-Growth方法适应于动态数据环境的关联规则挖掘。并以中国汽车市场网为例，挖掘消费者对不同类别、不同型号、不同价格轿车的购买偏好。
     (2) 提出一种基于句子相关度的文本自动分类模型TCSC)
     针对中文WEB文档集的分类和聚类等WEB信息检索(IR)课题中需要进行中文分词和词的多义性问题，利用语料库，提出了一种基于句子的文本特征选择，利用训练文本自动生成类别语料库，根据句内词元的类别相关性和句子位置信息，给出了基于句子类别相关度矩阵的文本分类方法，从而在分类阶段避免了分词处理，同时该方法对于词的多义性具有不敏感性。
Web is now the most important way for man to acquire information and knowledge. But its hugeness, diversity, dynamics and semi-structure promote the difficulty in processing data by machine. It attracts many researchers devoting to find way to retrieve interesting information from the enormous amount Web pages, how to convert the information into knowledge and how to get individualized service from Web. Now research in web data can be roughly categorized in three fields: web content mining, web usage mining and web structure mining. Web content data is the main carrier of Internet information. It contains content data, marking or token and hyperlink. Web content based computing research focuses on web pages' content data, the hotspots includes information extraction (IE), information retrieval (IR) and intelligent web services. On the basis of survey of web content computing, this paper casts its focus on the following issues:
    1. Proposed an approach named Incremental FP-Growth, which can be applied in dynamic environment for mining the association rules.
    The data in web pages has the characteristics of semi-structure, irregularity and dynamics, and it makes web-content based data computing and mining difficult and complex. By making a survey of the theories and approaches, we proposed the iFP-Growth algorithm for the association rules mining for the web content data. And as an application in China car market, our experiments show the efficiency of association rules mining in the car consumption preference in various types, models and prices of cars.
    2.Proposed an model for text classification based on sentence correlation (TCSC).
    For the problems of text segmentation and multivocal in the research of information retrieval on classification and cluster of Chinese web document set, we present a method based on Chinese sentence to express the characteristics of Chinese text document with the help of corpus. It incrementally updates category corpus with the training documents; then calculates the sentences correlation matrix by their position weight and corpus item weight to classify documents. This model avoids the problem of word segment in Chinese documents and lowers the effect of multivocal of words in the phase of classification.

引文

[1] http://www.google.com
    [2] http://www.baidu.com/about/
    [3] Monika Henzinger. Link Analysis in Web Information Retrieval. IEEE Data Engineering Bulletin, page 3-8, September 2000.
    [4] S. Brin and L. Page. The anatomy of a large-scale hypertexual web search engine. In Proc. of the WWW7 Conference, page 107-117, Brisbane, Australia, April 1998.
    [5] 白硕，程学旗，郭莉，王斌，余智华，刘群，大规模内容计算，2003年，全国第七届计算语言学联合会议论文集，13～25，清华大学出版社。
    [6] 史忠植《知识发现》，清华大学出版社，2002。
    [7] Tao Guan and Kam-Fai Wong. KPS-A Web Information Mining Algorithm. Computer Networks 31 (1999) 1495-1507
    [8] Raymond Kosala and Hendrik Blockeel. Web Mining Research: A Survey. SIGKDD Explorations, Volume2, Issuel, Page1-15
    [9] Hobbs J, The Generic Information Extraction System. In Proceedings of the Fifth Message Understanding Conference (MUC-5), pages 87-91. Morgan Kaufman, 1993
    [10] D. W. Cheung etal. Maintenance of discovered association rules in large databases: an incremental updating technique[A]. Proceedings of the 12nd International Conference on Data Engineering[C], New Orleans, Louisiana, 1996. 106-114.
    [11] D. W. Cheung, Lee S. D., Kao B. A general incremental technique for updating discovered association rules[A]. Proceedings of the 1997 International Conference on Databases Systems for Advanced Applications[C], 1997.
    [12] Feldman R, Aumann Y, Amir A etal. Efficient algorithm for discovering frequent sets in incremental databases[A]. Proceedings of the 1997 SIGMOD workshop on Research Issues on Data Mining and Knowledge Discovery[C], 1997. 59-66.
    [13] 冯玉才，冯剑琳．关联规则的增量式更新算法[J]．软件学报，1998，9(4)：301-306．
    [14] 周海岩．关联规则的开采与更新[J]．软件学报，1999，10(10)：78-84．
    [15] 欧阳为民，蔡庆生，广义序贯模式的增量式更新技术，《软件学报》，1998．10
    [16] 欧阳为民，蔡庆生，基于时间窗口的增量式关联规则更新技术，《软件学报》，1999．4
    [17] 李盛韬，基于主题的WEB信息采集技术研究，2002年，中科院计算所硕士学位论文。
    [18] Gerald Salton, Automatic information organization and retrieval, Addison-Wesley, Reading PA, 1968.
    [19] Gerald Salton and Buckley, C., "Term weighting approaches in automatic text retrieval",[A]In Information Processing & Management, vol. 24, no. 5, 1988, pp. 513-523.
    [20] S. Deerwester, S. Dumais, G. Furnas, T. Landauer and R. Harshman, Indexing by Laent Semantic Analysis, Journal of the American Society for Information Science, Vol. 41, No. 6, PP. 391-407,1990
    [21] Nicholas J. Belkin, W. Bruce Croft, Information filtering and information retrieval: two sides of the same coin?, Communications of the ACM, 1992, 35(12), 29-38.
    [22] John GH, Kohavi R, Pfleger K. Irrelevant features and the subset selection problem[EB/OL]. http://citeseer.nj.nec.com/john94irrelevant.html, 1997.
    [23] 李凡，鲁明羽，陆玉昌关于文本特征抽取新方法的研究清华大学学报(自然科学版)2001，41(7)：98-101．
    [24] Zhu Lanjuan. The Theory and Experiments on Automatic Chinese Documents Classification. Journal of the China Society for Scientific and Technical Information, 1987 (6)
    [25] 朱华宇，孙正兴，张福炎。一个基于向量空间模型的中文文本自动分类系统。计算机工程，2001，27(2)：15—17
    [26] Cao Suqing, Zeng Fuhu and Cao Huanguang. A Mathematical Model for Automatic Chinese Text Categorization. Journal of the China Society for Scientific and Technical Information, 1999(1)
    [27] 贺海军，王建芬，周青，曹元大。基于决策支持向量机的中文网页分类器。计算机工程，2003，29(2)：47-48
    [28] HTTP://WWW.WebServicesSummit.com
    [29] Tim Berners-Lee, James Hendler, Ora Lassila, The Semantic Web, Scientific American, May 2001
    [30]Alberto O. Mendelzon, Davood Rafiei. What do the neighbors think? Computing web page reputations. IEEE Data Engineering Bulletin, Page 9-16, September 2000.

    [31]Neel Sundaresan and Jeonghee Yi. Mining the Web for relations. Computer Networks 33 (2000) 699-711.

    [32]R. Cooley and B. Mobasher etc Web Mining: Information and Pattern Discovery on the World Wide Web. IEEE


    [33]Steve Lawrence. Context in Web Search. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2000

    [34]Jamie Callan. Searching for Needles in a World of Haystacks. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2000

    [35] STEVE LAWRENCE AND C. LEE GILES. Context and Page Analysis for Improved Web Search. IEEE Internet Computing, July/August 1998, Page 38-46

    [36]Dayne Freitag and Andrew Kachites McCallum. Information Extraction with HMMs and Shrinkage.

    [37]Ralph Grishman. Information Extraction: Techniques and Challenges.

    [38]Sriram Raghavan, Hector Garcia-Molina. Crawling the Hidden Web.

    [39]Chare Bradford and Ian Marshall. Analyzing Users WWW Search Behavior.

    [40]K. Langa, M. Burnett. XML, metadata and efficient knowledge discovery. Knowledge-Based Systems 13 (2000) 321-331

    [41]Philippe Martin and Peter W. Eklund. Knowledge Retrieval and the World Wide Web. IEEE INTELLIGENT SYSTEMS, MAY/JUNE 2000, Page 2-9

    [42]Zacharis Z. Nick and Panayiotopoulos Themis. Web Search Using a Genetic Algorithm. IEEE INTERNET COMPUTING, March/April 2001, Page 18-25

    [43]Chia-Hui Chang and Shao-Chen Lui. IEPAD: Information Extraction Based on Pattern Discovery. Proceeding of WWW10 May 1-5, 2001, Hong Kong, Page 681-688

    [44]James Hendler. Agents and the Semantic Web. IEEE INTELLIGENT SYSTEMS,

    [45]Dieter Fensel, Mark A. Musen. The Semantic Web: A Brain for Humankind. IEEE INTELLIGENT SYSTEMS, MARCH/APRIL 2001, Page 24-25

    [46]Jeff Heflin and James Hendler. A Portrait of the Semantic Web in Action. IEEE INTELLIGENT SYSTEMS, MARCH/APRIL 2001, Page 54-59

    [47]Dieter Fensel, Frank van Harmelen, etc. OIL: An Ontology Infrastructure for the Semantic Web. IEEE INTELLIGENT SYSTEMS, MARCH/APRIL 2001, Page 38-45

    [48] 王继成,萧嵘,孙正兴,张福炎 Web信息检索研究进展计算机研究与发展 Vol．38 No．2 Feb．2001 Page 187—193
    [49] 黄豫清，戚广志，张福炎从Web文档中构造半结构化信息的抽取器软件学报 2000．11(1)Page 73-78
    [50] 张岭陈建国马范援CALA：一种结合内容相关性分析的Web链接分析算法待投
    [51] TIM BERNERS-LEE, JAMES HENDLER and ORA LASSILA. The Semantic Web. Scientific American, Hay 2001
    [52] http://www.scientificamerican.com/2001/0501issue/0501berners-lee.html
    [53] Tim Berners-Lee. Semantic Web Road map.
    [54] www.w3.org/DesignIssues/Semantic.html
    [55] http://www.w3c.org
    [56] http://www.ontology.org/
    [56] http://www.semanticweb.org/
    [1] Sager N, Natural Language Information Processing, Reading, Massachusetts: Addison Wesley, 1981.
    [2] Dejong G, An Overview of the FRUMP Szstem. In: LEHNERT, W., & RINGLE, M. h. (eds), Strategies for Natural Language Processing. Lawrence Erlbaum, 1982, 149-176.
    [3] http://www.cald.cs.cmu.edu/Research/mitchell.html
    [22] Y. Sakakibara, "Resent advances of grammatical inference", Theoretical Computer Science 185, 14-45, 1997.
    [4] Chen H H, Ding Y W, Tsai S C, et al, Description of the NTU System Used for MET2, In Proceedings of the Seventh Message Understanding Conference, 1998
    [5] Yu S H, Bai S H, Wu P, Description of the Kent Ridge Digital Labs System Used for MUC-7, In Proceedings of the Seventh Message Understanding Conference, 1998
    [6] Zhang Y M, Zhou J F, A Trainable Method for Extracting Chinese Entity Names and Their Relations, In Proceedings of the Second Chinese Language Processing Workshop, Hong Kong, Oct. 2000
    [7] 李保利陈玉忠俞士汶信息提取研究综述计算机工程与应用 2003，Vol39(10)，1-5
    [8] Hobbs J, The Generic Information Extraction System. In Proceedings of the Fifth Message Understanding Conference (MUC-5), pages 87-91. Morgan Kaufman, 1993
    [9] 张玲 Web信息提取技术研究与应用中科院硕士学位论文 2003．
    [10] Y. Sakakibara, "Resent advances of grammatical inference", Theoretical Computer Science 185, 14-45,1997.
    [11] 张瑞岭，“文法推断研究的历史和现状”，软件学报，1999，Vol10(8)．
    [12] Gold E M. Language identification in the limit. Information and Control, 1967, 10(5): 447～474
    [13] Angluin D. Learning regular sets from queries and counter-examples. Information and Computation, 1987, 75(1): 87～106
    [14] Angluin D. Queries and concept learning. Machine Learning, 1988, 2(3): 319～342
    [15] Kushmerick, N. Wrapper induction for information extraction. Ph. D. Dissertation, Dept. of Computer Science, Univ. of Washington, 1997.
    [16] A. Sahuguet, F. Azavant. W4F: a WysiWyg web wrapper factory. Technical report, 1998
    [17] 李效东，顾毓清基于DOM的Web信息提取计算机学报 2002，25(5)，
    [18] 王琦，唐世渭，杨冬青，王腾蛟基于DOM的网页主题信息自动提取计算机研究与发展 2004，41(10)，1786-1792
    [19] D. W. Embley, Y. Jiang, Y.-K. Ng, "Record-Boundary Discovery in Web Documents", SIGMOD' 99.
    [20] D. W. Embley, D.M. Campbell, T. S. Jiang, etc, "A Conceptual-Modeling Approach to Extracting Data from the Web", In Proceedings of the 17th International Conference on Conceptual Modeling (ER'98), Singapore, November 1998.
    [21] 蔡智基于Web的中文信息智能获取研究中国科学技术大学博士论文，2002
    [22] Dan Roth, Wen-tau Yih, "Relational Learning via Propositional Algorithms: An Information Extraction Case Study", Proc, of the International Joint Conference on Articial Intelligence, 2001.
    [23] I. Muslea, S. Minton, C. Knoblock. A hierarchical approach to Wrapper induction. In: Proc. 3nd International Conference Automatious Agents, 1999.
    [24] McCallum, A.; Nigam, K.; Rennie, J.; and Seymore, K. A machine learning approach to building domainspecific search engines. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, CA 1999: 662-667
    [25] 董振东，董强，“知网”，http://www.keenage.com，1999
    [26] R. Agrawal, R. Srikant. Fast algorithms for mining association rules[A]. Proceedings of the 20th Int'l Conference on Very Large Databases[C], Santiago, Chile, 1994. 487-499.
    [27] J. S. Park. Using a hash-based method with transaction trimming for mining association rules[J]. IEEE Transaction on knowledge and data engineering, 1997, 9(5): 813-825.
    [28] Han J, Pei J, Yin Y, "Mining Frequent Patterns without Candidate Generation", in proceeding of 2000 ACM-SIGMOD international conference on management of data. pages 1～12, Dallas, TX, May 2000.
    [29] D. W. Cheung etal. Maintenance of discovered association rules in large databases: an incremental updating technique[A]. Proceedings of the 12nd International Conference on Data Engineering[C], New Orleans, Louisiana, 1996. 106-114.
    [30] D. W. Cheung, Lee S. D., Kao B. A general incremental technique for updating discovered association rules[A]. Proceedings of the 1997 International Conference on Databases Systems for Advanced Applications[C], 1997.
    [31] Feldman R, Aumann Y, Amir A etal. Efficient algorithm for discovering frequent sets in incremental databases[A]. Proceedings of the 1997 SIOMOD workshop on Research Issues on Data Mining and Knowledge Discovery[C], 1997. 59-66.
    [32] 冯玉才，冯剑琳．关联规则的增量式更新算法[J]．软件学报，1998，9(4)：301-306．
    [33] 欧阳为民，蔡庆生，广义序贯模式的增量式更新技术，《软件学报》，1998．10
    [34] 欧阳为民，蔡庆生，基于时间窗口的增量式关联规则更新技术，《软件学报》，1999．4
    [35] Liu L, Pu C, Han W. Xwrap: An XML-enabled Wrapper Construction System for Web Information Sources, International Conference on Data Engineering. pages 611～621, San Diego, CA. 2000.
    [36] 张友华，熊范纶，杭小树．基于Web的增量式数据挖掘的研究与应用，模式识别与人工智能，2004，17(4)：491-496．
    [37 Wang K, Tang L, Han J, Liu J, "Top down FP-Growth for association rule mining" In proceeding of the 6th Pacific Area Conference on Knowledge Discovery-and Data Mining (PAKDD-2002), pages 334～340, Taipei, Taiwan. 2002
    [38 Geoffrey. I. Web, "Efficient search for association rules". In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 99～107. Boston, MA, USA. 2000
    [39] Mohammed J. Zaki, "Generating Non-Redundant Association Rules" In proceedings of the International Conference on Knowledge Discovery and Data Mining. pages 34～43, Boston, MA, USA. 2000.
    [40] Zheng Zijian, Ron Kohavi, Llew Mason. "Real World Performance of Association Rule Algorithms". In proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pages 401～406, San Francisco, California, August 2001.
    [1] 王斌许洪波大规模内容计算信息技术快报 2004年第3期(总第10期)(中科院计算所内部刊物)
    [2] 史忠植《知识发现》，清华大学出版社，2002。
    [3] Salton G. Buckley C. Term-weighting approaches in automatic text retrieval. Information Processing & Management. 1988, 24: 513-523.
    [4] J Thorsten. A probabilistic analysis of the Rocchio algorithm with TF-IDF for text categorization. In Proc of the 14th Int'l Conf on Machine Learning(ICML' 97). 1997 143-151.
    [5] Dunja Mladenic, Marko Grobelnik. Feature selection for unbalanced class distribution and Naive Bayes. In Proc of the 16th Int'l Cinf on Machine Learning(ICML' 99). San Francisco: Morgan Kaufmann Publishers. 1999. 258-267.
    [6] Shrikanth Shankar, George Karypis. A feature weight adjustment algorithm for document categorization. In Proc of KDD2000. 2000.
    [7] 陆玉昌鲁明羽等向量空间法中单词权重函数的分析和构造计算机研究与发展 2002 Vol．39(10)1205-1210
    [8] Gao, Jianfeng and Wu, Andi and Huang, Cheng-Ning and Li, Hong qiao and Xia, Xinsong and Qin, Hauwei, Adaptive Chinese Word Segmentation, ACL-2004. 2004. 10. 06
    [9] http://www.nlp.org.cn/project/project.php?proj_id=6
    [10] David D. Lewis and Tobert E. Schapire, Training Algorithms for Linear Text Classifiers, Appeared in H. P. Frei, et al., eds., SIGIR96: Proceedings of the 19th Annual International ACM-SIGIR Conference, (August 18-22, 1996, Zurich). Konstanz: Hartung-Gorre Verlag, 298-306
    [11] Belur V. dasarathy. Nearest Neighbor(NN)Norms: NN Pattern Classification Techniques. IEEE Computer Society Press, Las Alamitos, California, 1991
    [12] 鲁松，白硕等：文本中词语权重计算方法的改进．2000 International Conference On Multilingual Information Processing．2000．31-36．
    [13] J Rocchio. The SMART Retrieval System: Experimentsin Automatic Document Processing. Englewood Cliffs, NJ: Prentice-Hall, 1971
    [14] Mc Callum A and Nigam K, A comparison of Event Models for Naive bayes Text Classification[A]. AAAI-98 Workshop on Learning for Text Categorization[C]. Madison, Wisconsim: AAAI Press, 1998, 509-516
    [15] L. Breiman, J. H. Friedman, R. A. Olshen and C. J. Stone. Classification and Regression Trees. Wadsworth, Belmont, 1984
    [16] J. R. Quinlan. Induction of decision trees. Machine Leaning, 1: 81-106, 1986
    [17] J. Ross Quinlan. C4. 5: Programs for Machine Learning. Morgan Kaufman, 1993
    [18] John Shafer, Rakesh Agrawal, and Manish Mehta. SPRINT: A scalable parallel classifier for data mining. In Proc. of the VLDB Conference, Bombay, India, September 1996
    [19] Manish Mehta, Rakesh Agrawal, and Jorma Rissanen. SLIQ: A fast scalable classifier for data mining. In EDBT 96, Avignon, France, March 1996
    [20] http://www.keenage.com
    [21] 李盛韬，基于主题的Web信息采集技术研究，2002年，中科院计算所硕士学位论文。
    [22] 黄海英，林士敏，严小卫基于概念空间的文本分类研究，计算机科学，2003，30(3)：46-49．
    [23] 谢冲锋，李星基于序列的文本自动分类算法．软件学报 2002，13(4)：783-789
    [24] 李盛韬，基于主题的Web信息采集技术研究，2002年，中科院计算所硕士学位论文。
    [25] Gerald Salton, Automatic information organization and retrieval, Addison-Wesley, Reading PA, 1968.
    [26] Gerald Salton and Buckley, C., "Term weighting approaches in automatic text retrieval",[A]In Information Processing & Management, vol. 24, no. 5, 1988, pp. 513-523.
    [27] S. Deerwester, S. Dumais, G. Furnas, T. Landauer and R. Harshman, Indexing by Laent Semantic Analysis, Journal of the American Society for Information Science, Vol. 41, No. 6, PP. 391-407, 1990
    [28][21] Nicholas J. Belkin, W. Bruce Croft, Information filtering and information retrieval: two sides of the same coin?, Communications of the ACM, 1992, 35(12), 29-38.
    [29] John GH, Kohavi R, Pfleger K. Irrelevant features and the subset selection problem[EB/OL]. http://citeseer.nj.nec.com/john94irrelevant.html, 1997.
    [30] 黄萱菁，夏迎炬，吴立德．基于向量空间模型的文本过滤系统．软件学报，2003 Vol．14．No．3 435-442
    [31] Hart Jiawei，Kamber M．数据挖掘概念与技术．机械工业出版社，2001
    [32] 黄海英，林士敏，严小卫基于概念空间的文本分类研究．计算机科学 2003 Vol．30 No．3 46-49
    [33] 张友华，熊范纶．基于句子相关度的文本自动分类．中国科学技术大学学报，2006，36(5)
    [1] N. Kushmerick, Daniel S. Weld and R. Doorenbos, Wrapper induction for information extraction, in Proc. of the 15th International Joint Conference on Artificial Intelligence, 1997, 729-737.
    [2] Dayne Freitag: Information Extraction from HTML: Application of a General Machine Learning Approach. AAAI/IAAI 1998: 517-523
    [3] J. Hammer, M. Breunig, H. Garcia-Molin, S. Nestorov, V. Vassalos, and R. Yereni, "Template-Based Wrappers in the TSIMMIS System" In Proceeding of Twenty-Third ACM SIGMOD International Conference on Management of Data, Tucson, Arizona, 1997.
    [4] Liu, L., Pu, C., & Han, W. (2000) Xwrap: An XML-enabled Wrapper Construction System for Web Information Sources, International Conference on Data Engineering, San Diego, CA.
    [5] Sinclair, J., ed. (1987), Look Up: An Account of the COBUILD Project in Lexical Computing, Collins.
    [6] B. Adelberg, NoDOSE - A tool for semi-automatically extracting structured and semistructured data from text documents, in Proceedings of SIGMOD'98, 1998, 283-294.
    [7] S. Soderland, Learning to extract text-based information from the world wide web, In Proceedings of 3rd International Conf. on Knowledge Discovery and Data Mining (KDD-97), 1997, 251-254.
    [8] Dan Dipasquo, "Using HTML Formatting to Aid in Natural Language Processing on the World Wide Web", Senior Honors Thesis, School of Computer Science, Carneige Mellon University, June, 1998.
    [9] 袁红春，熊范纶，张友华，等．一个适用于地理信息系统的数据挖掘工具—GisMiner．中国科技大学学报，2002，32(2)：217—224
    [10] 张友华．杭小树，等．基于定向语料库的向量空间模型关联研究及应用．The 4th world Congress on Intelligent Control and Automation(WCICA 2002) 2002, 1672-1675
    [11] http://keg.cs.tsinghua.edu.cn/papers_4.pdf
    [12] http://www.kelkoo.com/
    [13] http://www.Shopping.com
    [1] http://logicerror. com/timsDream

    [2]Bernd Amann, Catriel Beeri, Irini Fundulaki, and Michel Scholl. Ontology- based integration of xml web resources. In International Semantic Web Conference 2002 (ISWC 2002), pages 117-131, 2002.

    [3]Tim Berners-Lee, James Hendler, Ora Lassila, The Semantic Web, Scientific American, May 2001

    [4]Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, and Eve Maler. Extensible markup language (xml) 1.0 (second edition) w3c recommendation,
    [5]http://www-128. ibm. com/developerworks/cn/xml/6 October 2000.
    [6]http://wiki. w3china. org/wiki/index. php

    [7] Neches, R. , Fikes, RE. , Finin, T. , Gruber, TR., Sdenator, T. and Swartout,WR.. 1991. Enabling technology for knowledge sharing. AI magazine, 12(3) :36-56

    [8] Gruber, T. 1993. 0ntolingua:A translation approach to portable ontology specifi-cations. Knowledge Acquisition, 5(2), 199-200
    [9] William, S. .Austin, T. .Ontologies, IEEE Intelligent Systems, 1999 Jan/Feb, 18-19

    [10] Chandrasekaran, B. , Josephson, J. R., Benjamins, V. R. , What are ontologies, and why do we need them? Jan/Feb, 1999, 20-25
    [11]B. Lenat, R. V. Guha. Buliding Large Knowledge-Based Systems. Reading, MA: Addition-Wesley , 1990

    [12] Gruninger, M. and Fox, M. S. 1995. Methodology for the design and evaluation of ontologies, In proceedings of the Workshop on Basic Ontological Issues in Knowledge Sharing, held in conjunction with IJCAI-95, Montreal, Canada.

    [13] A.Bernaras, et al. Building and reusing ontologies for electrical network applications. In: Proc of the European Conf on Artificial Intelligence. Budapest, Hungary John Wiley and Sons, 1996. 298-302

    [14] K.Knight, et al. Filling knowledge gaps in a broad-coverage MT system. The 14th Int'l Joint Conf on Artificial Intelligence, Montreal, Canada, 1995
    [15] 曹存根，大规模知识获取和分析，知识科学和计算科学(陆汝钤主编)，清华大学出版社，2003，pp271-274
    [16]The OWL Services Coalition: Semantic Markup for Web Services(OWL-S): http://www.daml.org/services/owl-s/1.0/
    [17] 李善平，尹奇(韦华)，胡玉杰等．本体论研究综述．计算机研究与发展[J]，2004，41(7)：1401-1502
    [18] Horrocks, P F Patel-Schneider, F Harmelen. Reviewing the design of DAML+OIL: An ontology language for the semantic Web. In: Proc of the 18th National Conf on Artificial Intelligence, AAAI-2002. Edmonton, Alberta, Canada: AAAI Press, 2002
    [19] Dave Beckett, Brian McBride. RDF/XML Syntax Specification (Revised). World Wide Web Consortium. http://www.w3.org/tr/rdf-syntax-grammar/, 2004-02-10
    [20] D Brickley, R V Guha. RDF Vocabulary Description Language 1.0: RDF Schema. World Wide Web Consortium.
    [21] http://www.w3.org/tr/rdf-schema/, 2004-02-10
    [22] D Fensel, et al. OIL in a nutshell. The 12th Int'l Conf on Knowledge Engineering and Knowledge Management, Juan-les-Pins, France, 2000
    [23] 田春虎国内语义Web研究综述信息技术快报 2005；vo13(1)
    [24] 邓志鸿，唐世渭，杨冬青．面向语义集成：本体在Web信息集成中的研究进展．计算机应用，2002；(1)
    [25] 寥明宏．本体论与信息检索．计算机工程，2000；(2)
    [26] 姚绍文，余江，周明天．面向语义Web的逻辑描述原语扩展．电子学报，2002；(12)
    [27] 廖乐健，曹元大，幺敬国，李守丽．一个语义Web架构及其实现．计算机工程与应用，2003；(15)
    [28] 刘红阁郑丽萍张少方本体论的研究与应用现状信息技术快报 2005：vol3(1)
    [29] D. Fensel, et al. OIL in anutshell. The 12th Int' IConf on Knowledge Engineering and Knowledge Management, France, 2000.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700