基于本体的主题爬行技术研究

英文题名：Research of Ontology-based Focused Crawling Technique
作者：罗娜
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：主题爬行 ; 本体 ; PU分类 ; 隧道穿越 ; 用户兴趣
英文关键词：Focused Crawling ; Ontology ; PU Classification ; Cross Tunneling ; User Interest
学位年度：2009
导师：左万利
学科代码：081203
学位授予单位：吉林大学
论文提交日期：2009-06-01
答辩委员会主席：邢忠宝

摘要

随着网络信息内容的迅速增长以及信息环境的越趋复杂,现有的以覆盖所有网页为目标的搜索引擎正面临着严峻的挑战。首先,网页数量呈现出指数级的爆炸性增长趋势,搜索引擎无法索引所有的页面,即使是目前全球最大的搜索引擎Google,其索引的页面数量也仅占Web总量的40%左右。其次,Web信息资源是动态变化的,而这种变化使得搜索引擎对于用户的返回结果中有相当比例是过时的甚至是打不开的网页。再次,由于Internet上的信息过于庞杂,往往让用户对五花八门扑面而来的各种信息而无所适从,不知道如何去获取自己需要的信息,陷入了“信息过载”和“资源迷向”的困境。
     针对上述问题,作者全面的回顾了主题爬行和本体论的研究历史,系统深入地分析了主题爬行算法和本体原理,从而总结归纳了现有主题爬行的缺陷与不足,并在此基础上重点研究了基于本体的主题爬行技术,及实现此技术过程中涉及到的相应问题。
     本文首先提出了基于本体的主题爬行框架,该框架的优点在于我们不但利用关键字,在爬行算法的设计中还依靠概念和关系等高层次的背景知识来对比搜索网页的文本。这种方法能够很容易达到一个直接的主题。其次,对主题爬行中的关键技术之一网页分类进行了深入研究,提出了基于本体特征提取的PU分类方法,该方法通过两次遍历文档,实现了降维和形成文本向量,再通过CoTraining的学习方式和Affinity Propogation聚类算法使PU文本在正例较少时,提高了PU分类器的性能,并得到了实验验证。再次,利用网页中的视觉信息、标签信息、链接信息和本体概念信息等对网页进行内容分块,在具体的网页分块过程中还提出了一些启发式规则来控制分块的精度和粒度。实验表明,这种分块主题爬行能够解决多主题问题,可以有效的避免主题漂移现象,在一定程度上能解决了灰色隧道穿越问题。同时,我们还首次提出了采用关联规则解决黑色隧道的穿越,该思想也在试验中得到了可行性的验证。最后,我们将前面的思想用于科技文献检索方面,并提出了基于认知心理学、信息传播与遗忘规律的特点构建特定用户兴趣的主题爬行,我们根据用户的检索习惯,跟踪用户的行为模式,通过机器学习方法学习和训练特定用户模型,实现面向特定用户的推荐、过滤等个性化服务。
     作者结合国家自然科学基金和吉林省科技发展计划项目的研究,给出了具体的实践。理论分析和实验证明上述方法的实用性及可靠性。
With the rapid expand and growth of web pages information from the World Wide Web, it gets harder to retrieve the information and knowledge relevant to a specific domain. Threrfore, focused crawling technique for retrieving the specific-domain information has got more attention and development in recent years. While crawling the World Wide Web, a focused web crawler aims to collect as many relevant web pages with respect to predefined topic and as few irrelevant ones as possible. The fundamental technical difficulty of focused crawling lies in the necessity to predict a web page’s topical relevancy before downloading it.
     Ontology as the new concept to describe the semantic hierarchy of knowledge has been widely used in different fields such as Computer Information Processing, Artificial Intelligence and Knowledge Engineering. The information retrieval methods combined with ontology can not only emphasize the advantages of knowledge-based retrieval but also deal with the relationships between the various concepts. Though the research of ontology is just at the beginning, and there have no uniform standard and stationary applications, the research of ontology applied in the Semantic Web will certainly become a hot spot, the application of ontology in information retrieval and semantic web will be the focus in this field. Ontology has capability to represent meaning of the information by a hierarchical structure, and its reasoning support. Ontology-based information retrieval is a promising method. Ontology includes the definition to judge concept so that the machine can understand the concepts of the domain, the relationship between concepts in a unified framework. The system could comprehend the query of user by analyzing user’s query expression and mapped it to information resources. Retrieval has much higher performance than traditional methods.
     The main contribution of this dissertation and result of study are as follows: 1. This dissertation makes a general summary of the research on web information retrieval and
     the correlative techniques, analyzes the derivation background and the course of development. After introducing and analyzing the development of search engines and ontology, the virtues and necessary of a topic-specific search engine be presented. Furthermore, the future of search engines is also discussed in this dissertation. The basic theory and strategies of topical web crawling and text classification technique are also introduced and analyzed, which are the groundwork of farther research works.
     2. A focused crawling algorithm loads a page and extracts the links. By rating the links based on keywords the crawler decides which page to retrieve next. Link by link the Web is traversed. Our crawling framework builds on and extends existing work in the area of focused document crawling. We do not only use keywords for the crawl, but rely on high-level background knowledge with concepts and relations, which are compared with the texts of the searched page. This ontology-based focused crawling method we can easily achieve a direct focus. This method provide the following main contributions: An ontology structure extended for the purposes of the focused crawler, several new and innovative approaches for relevance computation based on conceptual and linguistic means reflecting the underlying ontology structures, both the management of the focused crawling process and the management of the ontology, and an empirical evaluation which shows that crawling based on ontology clearly outperforms standard focused-crawling techniques.
     3. It is an effective topical web crawling approach that the relevance of a target web page is evaluated by using web page information. However, the common problem in the construction of classifier is that we need to label great training examples manually. It’s easier to get positive examples than negative examples. In the other side, the negative examples we find are deflected, because of our subjective factors, so that they will affect the performance of classifier. Therefore, researchers advanced that we can build a classifier using a few positive and many unlabeled examples, which is called PU problem. This dissertation put forward ontology-based feature selection for PU classification which scanned the documents twice. In the first time, we get the semantic meanings of the documents with WordNet. In the next time, we had filterated terms without synsets. After that we reduced the dimensionality and get the text vector. Combining with CoTraing and Affinity Propagation, we proved that the ontology-based feature selection can improved the performance of classifier greatly as the positive examples are few. An empirical evaluation shows that compared with document frequency method, our algorithm increases the F1 of One-Class classifier of 10.183% for the fewer positive examples case and 1.941% for the more positive examples case, and increases the F1 of PEBL classifier of 2.781%.
     4. Due to the complexity of the web environment and topic-multiplicity of the contents of web pages, it is quite difficult to get all the web pages relevant to a specific topic. It is possible for irrelevant web pages to link a relevant web page, so we need to traverse the irrelevant web page to get more relevant pages. This procedure is called Tunneling. There are two types of tunneling, grey tunneling and black tunneling. Our main works are bringing forward a new page segmentation method and finishing a grey tunneling system based on page segmentation. The method makes use of the vision information, tag information, link information and ontology information, which are in the web pages. The vision information contains background color, font size and color etc; the tag information used an order tag collection {, ,

} to recursive segment page; the link information is make use of“pagelet”concepts and the anchor text and ontology information provided hierarchical concepts. At last we bring forward to a lot of heuristic rules to control the accuracy and grain degree of the block when segment a page. Face to the black tunneling, we use Association Rules to slove these prblems.
5. Respect for users, study on user’s behavior and interests are the fundamental for User-oriented personalized service. It provides a better guarantee for users’utilize resources. User-oriented personalized service which aim is satisfy the user’s requests and everything from the user’s requirements. Not only can users customize their interface, but also can freely select the contents of required services, and denifit their own preferences property documents. Information services through the network in accordance with the specific user interest, babits, etc. to carry out personalized services to meet the needs of the user’s individual requirements. Personalized service has been an inevitable trend for the development of search engines. Based on the thinking of focused crawling that we had proposed above, we had built a focused crawling model for specific user’s interests, and this model based on cognitive psychology, information dissemination and the discipline of forgotten. We will accord with user’s search habits and track user’s behavior patterns to realize specific user-oriented recommendation, filtering and other personalized services thought machine learning and training specific user models. At the same time, we note that the groups of user behavior will have the same similar acts of users to create user group. This group can achieve the informations sharing and dissemination of them. We can also indentify the typical users and filed experts. The research has the characters of semantic, personalized, Intelligent and decision support.
To sum up, research on semantic information retrieval is of important theoretical value and widely used in search engine area. This dissertation has done some research on its modeling and application. The emphasis of our further research will be on the application, evaluation, and employment of the ontology-based focused crawling to the web search engine.

引文

[1] Hobbes’Internet Timeline [OL]. http://www.zakon.org/robert/internet/timeline 2005.
    [2]中国互联网络中心第23次中国互联网络发展状况统计报告[OL]. http://www.cnnic.net.cn/up-loadfiles/pdf/2009/1/12/92458.pdf, 2009-1-12.
    [3] Murray B H, Moore A. Sizing the Internet [Z]. A White Paper: Cyveillance, Inc. 2000.
    [4] Lawrence S, Giles L. Accessibility and distribution of information the Web [J]. Nature, 1999, 400:107-109.
    [5] Cho J, Garcia-Molina H. The evolution of the web and implications for an incremental crawler. Proceedings of the 26th International Conference on Very Large Database (VLDB) [C]. Cairo, Egypt, 2000. 117-128.
    [6] Google Information for Webmasters [OL]. http://www.google.com/webmasters/2.html.
    [7] Sellberg E, Etzioni O. Multi-service Search and Comparison Using the Meta-crawler. Proceedings of the 4th International Conference on the World Wide Web (WWW4) [C]. Boston, USA, 1995. 195-208.
    [8] A.K. McCalllum, K. Nigam, J. Rennie, et al. Automating the Construction of Internet Portals with Machine Learning. Information Retrieval [J], 2000, 3(2):127-163.
    [9] G. Pant, K. Tsioutsiouliklis, J. Johnson, et al. Panorama: Extending Digital Libraries with Topical Crawlers. Proceedings of the Fourth ACM/IEEE-CS Joint Conference Digital Libraries [C], 2004. 142-150.
    [10] J. Qin, Y. Zhou and M. Chau. Building Domain-Specific Web Collections for Scientific Digital Libraries: A Meta-Search Enhanced Focused Crawling Method. Proceedings of the Fourth ACM/IEEE-CS Joint Conference Digital Libraries [C], 2004. 158-165.
    [11] P. D. Bra, G. Houben, Y. Kornatzky, et al. Information retrieval in distributed hypertexts. Proceedings of the 4th RIAO Conference [C], New York, 1994. 481–491.
    [12] P D Bra, et al. Searching for arbitrary information in the WWW: The fish-search for Mosac. Second WWW Conference [C]. Chicago: ACM Press, 1994. 45-51.
    [13] M. Hersovici, M. Jacovi, Y. S. Maarek, et al. The shark-search algorithm-an application: Tailored Web Site Mapping [J]. Computer Networks. 1998, 30(17): 317-326.
    [14] J. Cho, H. Garcia-Molina, L. Page. Efficient Crawling Through URL Ordering, Proceedings of the 7th ACM-WWW International Conference [C]. Brisbane: ACM Press, 1998. 161-172.
    [15] L. Page, S. Brin, R. Motwani, et al. The PageRank Citation Ranking: Bringing Orderto the Web [OL], Stanford Digital Library Technologies Project, http://google.stanfor-d.edu/~backrub/pageranksub.ps. 1998.
    [16] Menczer F, Pant G, Ruiz M, et al. Evaluating topic-driven Web crawlers. Proceedings of the 24th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval [C], New York, 2001. 241-249.
    [17] S. Chakrabarti, M. van den Berg and B. Dom. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery [J]. Proceedings of the 8th International WWW Conference, 1999, 31(11-16):1623-1640.
    [18] A. McCallum, K. Nigam, J. Rennie, et al. Building domain-specific search engines with machine learning technique. Proceedings of AAAI Spring Symposium on Intelligent Engine in Cyberspace [C], 1999. 100-108.
    [19] Rennie J, McCallum A. Using reinforcement learning to spider the web efficiently. Proceedings of the 16th International Conference on Machine Learning ICML-99 [C]. 1999. 335-343.
    [20] M. Diligenti, F. M. Coetzee, S. Lawrence, et al. Focused crawling using context graphs [M]. Proceedings of the International Conference on Very Large Database (VLDB’00), 2000. 527-534.
    [21] Najork M, Heydon A. High-Performance Web Crawling [R]. Technical Reprot 173, Compaq Systems Research Center, Palo Alto, CA 94301, September 2001.
    [22] F. Menczer, G. Gant, P. Srinivasan. Topic-driven crawlers: Machine Learning Issues[M]. ACM TOIT, 2002. 58-70.
    [23] C. Chung and C. Clarke. Topic-Oriented Collaborative Crawling. Proceedings of the 11th International Conference on Information and Knowledge Management (CIKM’02) [C], 2002. 34-42.
    [24] Jan Fiedler, Joachim Hammer. Using the Web Efficiently: Mobile Crawlers. Proceedings of the 7th AoM/IaoM Intl Conference on Computer Science [C], San Diego CA: Maximilian Press. 1999. 324-329.
    [25] I. Silva, B. Ribeiro-Neto, P. Calado, et al. Link-based and content-based evidential information in a belief network model. Proceedings of the 23rd International ACM SIGIR Conference on Research and Development in Information Retrieval [C], 2000. 96-103.
    [26] B.Amento, L. Terveen, and W. Hill. Does“Authority”Mean Quality? Predicting Expert Quality Ratings of Web Documents. Proceedings of the 23rd ACM SIGIR Conference on Research and Development in Information Retrieval [C], 2000. 296-303.
    [27] G. Pant, P. Srinivasan, and F. Menczer. Exploration versus Exploitation in Topic Driven Crawlers [EB/OL]. [2004-07-02]. http://dollar.biz.uiowa.edu/~fil/papers.html.
    [28] A.Spink, D. Wolfram, B. Jansen, et al. Searching the Web: The public and their queries [J]. Journal of the American Society for Information Science, 2001, 52(3): 226–234.
    [29]刘挺,秦兵,张宇,车万翔.信息检索系统导论[M].机械工业出版社, 2008.
    [30]孙建军,成颖等.信息检索技术[M].科学出版社, 2004.
    [31]李晓明,闫宏飞,王继民.搜索引擎—原理、技术与系统[M].科学出版社, 2004.
    [32] B. Shu and S. C. Kak. A neural network-based intelligent metasearch engine [J]. Information Sciences, 1999, 120(1-4):1–11.
    [33] L. Introna and H. Nissenbaum. Defining the web: the politics of search engines [J]. Computer, 2000, 33(1):54–62.
    [34] R. Steel. Techniques for specialized search engines. Proceedings of Internet Computing’01. [EB/OL]. Las Vegas, NV, [2006-07-12]. http://www.fravia.com/library/techniques-for-specialized- search.pdf.
    [35] H. Chen, H. Fan, M. Chau and D. D. Zeng. Metaspider: Meta-searching and categorization on the web [J]. ASIST, 2001, 52(13):1134–1147.
    [36] A. E. Howe and D. Dreilinger. SAVVYSEARCH: A metasearch engine that learns which search engines to query [J]. AI Magazine, 1997, 18(2):19–25.
    [37] S. Chakrabarti, B. Dom, P. Raghavan, et al. Automatic resource compilation by analyzing hyperlink structure and associated text [J]. Computer Networks, 1998. 30(1-7):65-74.
    [38] R. Neches, RE. Fikes, TR. Gruber. Enabling Technology for Knowledge Sharing [J]. AI Magazine, 1991, 12(3):36-56.
    [39] T. Gruber, Towards principles for the design of ontologies used for knowledge sharing [J]. International Journal of Human-Computer Studies. 1995, 43(5/6):907-928.
    [40] AsunciOn GoOmez-Perez and Oscar Corcho. Ontology languages for the semantic web [J]. IEEE Intelligent Systems, 2002, 17(1):54-60.
    [41] Paolo Ciancarini, A bibliograpghy on coordination. 1997.
    [42] Perez A G, Benjamins V R. Overview of Knowledge Sharing and Reuse Compo-nents: Ontologies and Problem 2 Solving. Methods. Proceedings of the IJCAI299 workshop gies and Problem Solving Methods (KRR5) [C]. 1999:1-15.
    [43] Wordnet [OL]. http://Pwww.cogsci.princeton.edu.
    [44] Framenet [OL]. http://www.icsi.berkeley.edu.
    [45] GUM [OL]. http://www.darmstadt.gmd.de/publish/komet/gen2um/newUM.html.
    [46] SENSUS [OL]. http://www.isi.edu/natural-language/resourcesPsensus.html.
    [47] Mikrokmos [OL]. http://crl.nmsu.edu/Research/Projects/mikro/.
    [48] Guarino N. Semantic Matching: Formal Ontological Distinctions for Information Organization Extraction and Integration [J]. Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology, SpringerVerlag, 1997:137-170.
    [49] P. De Bra and R. Post. Information retrieval in the World Wide Web: Making client-based searching feasible [J]. Computer Networks and ISDN Systems, 1994, 27(2):183-192.
    [50] C. C. Aggarwal, F. Al-Garawi and P. S. Yu. Intelligent crawling on the world wide web with arbitrary predicates. Proceedings of the 10th International Conference on World Wide Web [C]. 2001. 96–105.
    [51] S. Chakrabarti, K. Punera and M. Subramanyam. Accelerated focused crawling through online relevance feedback. Proceedings of the 11th International Conference on World Wide Web [C]. 2002. 148–159.
    [52] Scott Deerwester, Susan T. Dumais, Richard Harshman, Indexing by Latent Semantic Analysis [J]. Journal of the American society for information science, 1990, 41(6):391-407.
    [53] Nicholas, J. Belkin, W. Bruce Croft, Information filtering and information retrieval, Communications of the ACM, 1992, 35(12), 29-38.
    [54] Salton G., M. J. McGill, Introduction to Modern Information Retrieval [J]. Journal of the American Society for Information Science, 1983, 41:288-297.
    [55]鲁松,李晓黎,白硕,文本中词语权重计算方法的改进[J],中文信息学报,2000,14(6):9-13.
    [56] Salton G, Buckley C, Term weighting approaches in automatic text retrieval [J], Information Processing and Management, 1988, 24(5):513-523.
    [57] E.S. Han, G. Karypis, V. Kumar, Text categorization using weight adjusted k-nearest neighbor classification [R], Computer Science Technical Report , 1999, TR99-019.
    [58]支持向量机[OL].http://www.svms.org/srm/
    [59]核函数[OL].http://www.kernel-machine.org/
    [60] L. Barbosa and J. Freire. Combining classfiers to identify online databases. Proceedings of the International Conference on World Wide Web [C]. ACM Press, 2007. 431–439.
    [61] S. Brin and L. Page. The anatomy of a large-scale hypertextualWeb search engine [J]. Computer Networks, 1998. 30(1-7):107–117.
    [62] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Proceedings of the 9th ACM-SIAM Symp on Discrete Algorithms [C]. 1998. 668–677.
    [63] F. Menczer, G. Pant, P. Srinivasan, et al. Evaluating topic-driven web crawlers. Proceedings of the 24th Annual International ACM/SIGIR Conference [C]. 2001. 241–249.
    [64] F. Menczer and R. K. Belew. Adaptive retrieval agents: internalizing local context and scaling up to the Web [J]. Machine Learning, 2000. 39(2-3):203–242.
    [65] T. Berners-Lee, J. Hendler, and O. Lassila. The Semantic Web [M]. In Scientific American, 2001.
    [66] A. Maedche and S. Staab. Ontology Learning for the Semantic Web [J]. IEEE Intelligent Systems, Special Issue on the Semantic Web, 2001, 16(2):72-79.
    [67] A. Gómez-Pérez, and D. Manzano-Macho. A survey of ontology learning methods and techniques [R]. Deliverable 1.5, IST Project IST-2000-29243-OntoWeb, 2003.
    [68] M. Shamsfard and A.A. Barforoush. The State of the Art in Ontology Learning [J]. The Knowledge Engineering Review, Cambridge Univ. Press, 2003, 18(4):293-316.
    [69] B. Omelayenko. Learning of ontologies for the Web: the analysis of existent approaches. Proceedings of the international workshop on Web dynamics [C], London, 2001. 58-64.
    [70] M. Ester, M. Gross and H.-P. Kriegel. Focused Web Crawling: A Generic Framework for Specifying the User Interest and for Adaptive Crawling Strategies. Proceedings of the 27th International Conference on Very Large Databases [C]. Roma, Italy, 2001. 148-156.
    [71] S. T. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. Proceedings of the 7th International Conference on Information and Knowledge Management [C]. 1998.148-155.
    [72] M. Ehrig, and A. Maedche. Ontology-focused crawling of web documents. Proceedings of the 2003 ACM symposium on Applied computing [C]. New York: ACM Press, 2003.1174-1178.
    [73] Amphibian本体[OL]. http://obofoundry.org/cgi-bin/detail.cgi?amphibian_anatomy.
    [74] C-C. Chang and C-J. Lin. LIBSVM: a library for support vector machines [OL]. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. 2001.
    [75] S. Gauch, J.M. Madrid, S. Induri, et al. KeyConcept: A Conceptual Search Engine [R]. Center, Technical Report: ITTC-FY2004-TR-8646-37, University of Kansas.
    [76] Jiang Y, Zhou ZH. A Text Classification Method Based on Term Frequency Classifier Ensemble [J]. Journal of Computer Research and Development, 2006, 43(10):1681-1687.
    [77] G Salton. Development in automatic text retrieval [J]. Science, 1991, 253(5023):974-980.
    [78] L L Diao, K Y Hu, Y C Lu, et al. Improved stumps combined by boosting for text categorization[J]. Journal of Software, 2002, 13(8):1361-1367.
    [79] S Wermter, G Arevian, C Panchev. Recurrent neural network learning for text routing. The International Conference on Artificial Neural Networks [C], Edinburgh, UK, 1999.
    [80] Ma Liang, Chen Qunxiu, Cai Lianhong. An improved model for text information filtering [J]. Journal of Computer Research and Development. 2005, 42(1):79-84.
    [81] C Cortes, V Vapnik. Support vector networks [J]. Machine learning, 1995, 20:273-297.
    [82] Nigam K, McCallum A, Mitchell T. Learning to classify text from labeled and unlabeled documents. Proceedings of the AAAI’98 [C]. 1998. 792-799.
    [83] Yu H, Han J, Chang K C. PEBL: Positive example based learning for Web page classification using SVM. Proceedings of the international conference on Knowledge Discovery and Data mining (KDD) [C]. 2002. 239-248.
    [84] Zhu JB, Wang HZ, Zhang XJ. Confusion Class Discrimination Techniques for TextClassification [J]. Journal of Software, 2008, 19(3):630-639.
    [85] Li XL, Liu B. Learning to Classify Texts Using Positive and Unlabeled Data. Proceedings of Eighteenth International Joint Conference on Artificial Intelligence. Acapulco: Morgan Kaufmann, 2003. 587-594.
    [86] Jain AK, Zongker D. Feature selection: Evaluation, application, and small sample performance [J]. IEEE Trans. on Pattern Analysis and Machine Intelligence, 1997, 19(2):153-158.
    [87] Zhu MH, Zhu JB, Chen WL. Effect analysis of dimension reduction on support vector machines. Proceedings of the IEEE Int’l conf. on Natural Language Processing and Knowledge Engineering [C]. 2005. http://www.nlplab.cn/chinese/lunwen.htm
    [88] Yang YM, Pedersen JO. A comparative study on feature selection in text categorization. Proceedings of the 14th International Conference on Machine Learning [C]. 1997. 412-420.
    [89] Scholkopf B, Platt J, Schawe-Taylor J et al. Estimating the support of a high-dimensional distribution [R], Technical Report, 99-87, Microsoft Research, 1999.
    [90] Liu B, Lee W, Yu P, Li X. Partially supervised classification of text documents. Proceedings of the 19th International Conference on Machine Learning [C]. 2002. 387-394.
    [91] Liu B. Web Data Mining Exploring Hyperlinks Contents and Usage Data [M]. New York: Springer Berlin Heidelberg, 2007. 156-158.
    [92] Liu B, Dai Y, Li X, Lee WS, Yu PS. Building text classifiers using positive and unlabeled examples. Proceedings of the 3rd IEEE International Conference on Data Mining [C]. IEEE Computer Society, 2003. 179-188.
    [93]于海龙.面向PU问题的文本分类的研究与实现[D].长春:吉林大学. 2005.
    [94] Max V, Markus K, Denny V, et al. Semantic Wikipedia. Proceedings of the 15th International Conference on World Wide Web [C]. 2006.
    [95] Schaffert, S. IkeWiki:A Semantic Wiki for Collaborative Knowledge Management. Proceedings of the 1st International Workshop on Semantic Technologies in Collaborative Applications [C], Manchester, UK, 2006. 23-35.
    [96] Chua S, Kulathuramaiyer N. Semantic feature selection using WordNet. Proceedings of the 14th International Conference on Web Intelligence [C]. Beijing: IEEE Computer Society. 2004. 166-172.
    [97]谭新.基于语义特征提取的PU文本分类的研究与实现[D].长春:吉林大学. 2007.
    [98] Blum A, Mitchell T. Combining labeled and unlabeled data with co-training. Proceedings of the workshop on Computational Learning Theory [C]. New York: ACM, 1998: 92-100.
    [99] Nigam K. Ghani R. Analyzing the effectiveness and applicability of co-training. Proceedings of International Conference on Information and Knowledge Management [C]. New York: ACM, 2000: 86-93.
    [100] Zhou Y, Goldman S. Democratic co-learning. Proceedings of the 16th IEEE Int Conf on Tools with Artificial Intelligence [C]. Washington: IEEE Computer Society, 2004: 594-602.
    [101] Zhou Z-H, Li M. Tri-training: Exploiting unlabeled data using three classifiers [J]. IEEE Trans on Knowledge and Data Engineering, 2005, 17(11): 1529-1541.
    [102]唐焕玲,林正奎,鲁明羽,邬俊,一种结合独立性模型与差异评估的Co-Training改进方案[J],计算机研究与发展,2008,45(11): 1874-1881.
    [103]肖宇,于剑,基于近邻传播算法的半监督聚类[J],软件学报,2008,19(11): 2803-2813.
    [104] Frey BJ, Dueck D. Clustering by passing messages between data points [J]. Science, 2007,315(5814):972-976.
    [105] H. Chen, D.R. Karger, Less is more: probabilistic models for retrieving fewer relevant documents. Proceedings of the 29th Annual Int’l ACM SIGIR Conference on Research and Development in Information Retrieval [C], SIGIR’06, ACM, New York, NY, 2006.429-436.
    [106] K. Song, Y. Tian, W. Gao, T. Huang, Diversifying the image retrieval results. Proceedings of the 14th Annual ACM Int’l Conference on Multimedia [C], MULTIMEDIA’06, ACM, New York, NY, 2006.707-710.
    [107] C.X. Zhai, W.W. Cohen, J. Lafferty. Beyond independent relevance: methods and evaluation metrics for subtopic retrieval. Proceedings of the 26th Annual Int’l ACM SIGIR Conference on Research and Development in Information Retrieval [C], SIGIR’03, ACM, New York, NY, 2003.10-17.
    [108]徐晴阳.基于关系子群发现算法的聚焦爬行技术.长春:吉林大学. 2008
    [109]彭涛.面向专业搜索引擎的主题爬行技术研究.长春:吉林大学. 2007
    [110]张超群.基于网页分块技术的主题爬行.长春:吉林大学. 2007
    [111] Deng Cai, Shipeng Yu, Ji-Rong Wen, et al. VIPS: a vision-based page segmentation algorithm [R]. Techinical Report, MSR-TR-2003-79, Redmond: Microsoft Research Corporation, 2003. 1-79.
    [112] Agrawal R, Imielinski T, Swami A. Mining association rules between sets of items in large databases. Proceedings of the ACM SIGMOD Conf. on Management of Data [C]. New York: ACM Press, 1993. 207?216.
    [113] Agrawal R, Srikant R. Fast algorithms for mining association rules. Proceedings of the 20th International Conference on Very Large Data Bases [C]. Santigo: Morgan Kaufman Publishers, 1994. 478?499.
    [114] Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data [C]. New York: ACM Press, 2000. 1?12.
    [115] Hipp J, Güntzer U, Nakhaeizadeh G. Algorithms for association rule mining—A general survey and comparison[J]. SigKDD Explorations, 2000,2(1):58?64.
    [116] Mannila H, Toivonen H, Verkamo A. Discovery of frequent episodes in event sequences [J]. Data Mining and Knowledge Discovery 1. Netherlands: Kluwer Academic Publishers, 1997. 259?289.
    [117] Xu QF, Xiao B, Guo J. A mining algorithm with alarm association rules based on statistical correlation [J]. Journal of Beijing University of Posts and Telecommunications, 2007, 30(1):66?70.
    [118] Xiao Bo, Xu QiangFang, Lin ZhiQing, Guo Jun and Li ChunGuanng. Credible Association Rule and Its Mining Algorithm Based on Maximum Clique[J]. Journal of Software, 2008, 19(10): 2597-2610.
    [119] A. F. R. Rahman, H. Alam and R. Hartono, Content Extraction from HTML Documents, Document Analysis and Recognition Team(DART) BCL Computers Inc.
    [120] B. Krulwich, C. Brukoy. The Info Finder Agent: Learning User Interests Through Heuristic Phrase Extraction [J]. IEEE Expert, 2004, (5):22-27.
    [121] Liu Fang, Meng Weiyi. Personalized Web Search by Mapping User Queries to Categories. Proceedings of the International Conference on Information and Knowledge Management [C], McLean, Virginiia, USA, 2002. 558-565.
    [122] Micro Speretta, Susan Gauch. Personalized Search Based on User Search Histories [J], IEEE International Conference on Web Intelligence. 2005.15-23.
    [123]艾宾浩斯遗忘曲线[OL].http://www.ttpsy.com/Aricle/zjxl/xlkt/2007-02/2748.html
    [124]崔航,文继荣,李敏强,基于用户日志的查询扩展统计模型[J],软件学报,2003,14(9):1593-1599.
    [125] Michael K. Bergman. The‘Deep’Web: Surfacing Hidden Value [OL]. http://www.brightpla-net.com/esources/details/deepweb.html.
    [126] K. C.-C. Chang, B. He, C. Li, et al. Structured Databases on the Web: Observations and Implications [J]. SIGMOD Record. 2004, 33(3): 1-70.
    [127] Elena Simperl. Reusing Ontologies on the Semantic Web: A Feasibility Study [J]. Data and Knowledge Engineering, In Press.2009.
    [128] Yoo Jung An, Soon Ae Chun, Kuo-Chun Huang, James Geller. Enriching Ontology for Deep Web Search [J]. Springer-Verlag, ACM, 2008.73-80.
    [129] Wei Fang, Pengyu Hu, Pengpeng Zhao, et al. Ontology-Based Deep Web Data Sources Selection [J]. Springer-Verlag, ACM, 2008.483-490.
    [130] PengYu Hu, Wei Fang, ZhiMing Cui. Ontology-Based Deep Web Synchronous-Annotation [J]. IEEE Computer Society, ACM, 2008.585-589.
    [131] Angus Roberts, Robert Gaizauskas, Mark Hepple, et al. Building a semantically annotated corpus of clinical texts [J]. Journal of Biomedical Informatics, In Press, 2009.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700