农业复杂自适应搜索模型研究及实现

英文题名：Complex Adaptive Agriculture Vertical Search Model and Its Implementation
作者：黄河
论文级别：博士
学科专业名称：模式识别与智能系统
中文关键词：复杂自适应系统 ; 垂直搜索引擎 ; 网络资源发现 ; 深度网页 ; 用户个性化 ; 网页结构化数据抽取 ; 形式化概念分析
英文关键词：Complex Adaptive System ; Vertical Search Engine ; Web knowledge discovery ; Deep Web ; User Profile ; Structural Data Extraction ; Formal Concept Analysis
学位年度：2010
导师：王儒敬
学科代码：081104
学位授予单位：中国科学技术大学
论文提交日期：2010-06-01

摘要

截止2009年底,互联网上的涉农网站已超过30000个,积累了丰富的农业技术、市场信息、政策法规、农业新闻等信息资源。然而由于互联网信息资源缺少统一的形式化表达,信息异质、异构、分散、重复现象严重,形成“信息孤岛”,很难发挥农业信息资源的集成效用。同时,由于农户文化水平、计算机操作能力的限制,“三农”用户很难使用传统的搜索工具去直接交互、捕捉和筛选个性化信息。面对海量的农业信息资源,“三农”用户只能望洋兴叹,“信息淹没”问题严重。因此,建立专业化、个性化、智能化的农业搜索模型及相应的搜索引擎系统意义重大。
     本文针对互联网的开放性、分散性、层次性、演化性、巨量性等本质特性,提出了一种农业复杂自适应搜索模型。该模型建立农业信息资源发现、信息获取、信息处理与用户服务主体联盟,通过主体与网络资源、主体与网页内容和网页表现形式、主体与用户个性化需求之间的学习与适应机制,实现对复杂、动态的互联网环境的适应,从而提高农业搜索引擎的查全率与查准率,解决新一代搜索引擎面临的核心问题。
     针对农业互联网资源的动态性和高度分散性特点,本文提出了AADWED(Adaptive Agriculture Deep Web Entry Discovery)算法,一种自适应农业领域Deep Web资源发现算法。该算法通过不断从样本中学习到合适的查询表达式提交给通用搜索引擎来高效获取领域Deep web资源入口页面。实验证明,该算法大幅度提高农业领域Deep Web资源发现的收益率。
     针对对Web站点页面表现形式具有多样性、动态性等特点,本文提出了一种自适应的Web结构化数据提取算法。该算法在MDR算法的基础上,提出了一种基于相对熵的页面去噪算法,从而提高了Web结构化数据抽取的准确率。
     针对互联网存在的大量农业领域数据描述不统一、不完整、冗余等问题,本文重点研究了农产品价格、供求等信息的空间属性自动标注和基于语义的数据冗余处理问题,提高了数据的质量和可用性,为进行精确检索和可视化分析服务提供了基础。
     针对不同Web用户的个性化需求,本文提出了一种基于FCA的自动挖掘用户兴趣主题算法。挖掘出的兴趣主题模式被描述成一组形式化概念,兴趣主题模式之间的联系被显示的在概念格中描述出来,利于用户理解。本文还提出了一种文档和用户感兴趣主题相关度的计算方法。通过对比实验,证明该方法是有效的。
     最后,本文基于所提出的农业复杂自适应搜索模型,设计并实现了农业垂直搜索引擎系统“中国搜农”,该系统已经开始大规模对外公开服务,并已在多个省市得到推广和应用。
By the end of 2009, there have been more than 30000 agricultural web sites on the internet, which cover almost all kinds of agricultural information, such as agricultural technology, market information, agricultural news and policies. However, agricultural information on the web has no uniform representation and is heterogeneous, distributed and redundant, which forms isolated information islands. Since the knowledge of farmers to operate a computer is limited, it would be hard for them to use traditional search tools to acquire and filter personalized information on the web. Facing huge amount of information, farmers are often frustrated and the phenomenon of "information overload" is a serious matter here. Obviously, it is significant to develop personalized, intelligent and professional web search models and tools.
     For the characteristics of openness, scatterings, hierarchy, evolution and hugeness of internet, an agricultural search model based on complex adaptive system is proposed in this dissertation. This model constructs the agent alliance of agricultural information discovery agent, information acquisition agent, information processing agent and service agent. The model fit the complex and dynamic internet environment through learning mechanisms between agents and web contents, representation methods and user needs. The method proposed improves the precision and recall of agricultural search engine and solves the core problem for the next generation search engine.
     For the characteristics of dynamics and high scattering of web resources, AADWED (Adaptive Agriculture Deep Web Entry Discovery) algorithm is proposed to acquire domain-specific deep web resources effectively and efficiently. This algorithm constantly constructs queries according to the sample and submits the queries to a search engine in order to find the entry page of hidden web resources. The experiments validate that this method can significantly improve the efficiency of finding hidden web resources.
     Aiming at the two characteristics (dynamics and diversity) of web pages on the web sites, an adaptive web structural data extraction algorithm is presented in this dissertation. This algorithm is based on traditional MDR algorithm and adopts relative entropy theory for noise removal so as to improve the precision of web structural data extraction.
     Aiming at huge amount of heterogeneous, incomplete and redundant agricultural information on the web, this dissertation studied the automatic spatial property annotation and processing redundant data based on semantics for agricultural product price and buy/sell information. The proposed method improves the quality of data and constructs a fundamental for precise retrieval and visualization.
     To tackle the problem of personalized information needs from different web users, a new approach that automatically mining web user profile based on FCA is proposed. The interest models of web users are represented as formal concepts and the relationship between these models are described in a concept lattice. The method of assessing document relevance to the topics is also proposed. The experiments show that our approach is effective.
     At last, based on the complex adaptive agricultural search model proposed in this dissertation, agricultural vertical search engine "Sounong" has been designed and implemented. This search engine has served publicly for many provinces.

引文

Adelberg B., Denny M..1999. Nodose version 2.0 [C]. In:Proceedings of the 18th ACM SIGMOD International Conference on Management of Data, Philadelphia,1999,559-561
    Arocena G. O., Mendelzon A.O.1998. WebOQL:restructuring documents, databases, and Webs [C]. In:Proceedings of the 14th International Conference on Data Engineering, Orlando,1998, 24-33
    Adelberg B.1998. NoDoSE-a tool for semi-automatically extracting semi-structured data from text documents [C], In:Proceedings of the 17th ACM SIGMOD International Conference on Management of Data,1998,283-294
    Barabasi A L.1999. Emergence of scaling in random networks [J]. Science 1999 286:509-512
    Barbosa and J. Freire. Searching for Hidden-Web Databases. In Proceedings of WebDB, pages 1-6, 2005.
    Barbosa L, Freire J. An adaptive crawler for locating hidden-Web entry points. In:Williamson CL, Zurko ME, Patel-Schneider PF, Shenoy PJ, eds. Proc. of the World Wide Web Conf. (WWW). ACM,2007.441-450.
    Bin He, Mitesh Patel, Zhen Zhang, Kevin Chen-Chuan Chang:Accessing the deep web. Commun. ACM 50(5):94-101 (2007)
    B. Liu and K. Chang.2004. SIGKDD Explorations [J], Special issue on Web content mining, vol. 6, no.2, pp.1-4,2004
    Califf M. E.1998. Relational Learning Techniques for Natural Language Information Extraction[R]. Ph.D. thesis, Department of Computer Sciences, University of Texas, Austin, August 1998. Technical Report AI98-276.
    Crescenzi V., Mecca G.1998. Grammars have exceptions [J]. Inf. Syst.,1998,23,8:539-565
    CNNIC.2009a中国互联网络发展状况统计报告[R],CNNIC,2009.7
    CNNIC.2009b.2008-2009中国互联网研究报告系列之“中国农村互联网发展状况调查报告[R],CNNIC,2009.03
    Crescenzi V., Mecca G., Merialdo P.2001. RoadRunner:towards automatic data extraction from large Web sites [C]. In:Proceedings of the 27th International Conference on Very Large Data Bases, Roma,2001,109-118.
    Crescenzi V., Mecca G., Merialdo P.2002. RoadRunner:automatic data extraction from data-intensive web sites [C]. In:Proceedings of the 21th ACM SIGMOD International Conference on Management of Data, Madison,2002,624.
    Cai D., Yu S., Wen J., Ma W.2003. Extracting content structure for Web pages based on visual representation [C]. In:Proceedings of the 5th Asian-Pacific Web Conference, Xian,2003, 406-417
    Chakrabarti, M. van den Berg, and B. Dom. Focused Crawling:A New Approach to Topic-Specific Web Resource Discovery. Computer Networks,31 (11-16):1623-1640,1999.
    Chakrabarti, K. Punera, and M. Subramanyam.Accelerated focused crawling through online relevance feedback. In Proceedings of WWW, pages 148-159,2002.
    Christopher D. Manning,Hinrich Schutze.1999. Foundations of Statical Language Processing [M] The MIT Press.1999:95-116
    Cooley R., B. Mobasher, and J. Srivastava.1999. Data Preparation for Mining World Wide Web Browsing Patterns [J]. Knowledge and Information Systems, 1(1):5-32,1999. Carpineto, C.,& Romano, G.1993. GALOIS:An order-theoretic approach to conceptual clustering [C]. Proc. of the 10th Conf. on Mach. Learn., Amherst, MA, Kaufmann,33-40
    D.A. Hull and S. Roberston.1999. The TREC-8 Filtering Track Final Report [R]. Proc. Text Retrieval Conf. (TREC-8),1999
    Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused Crawling Using Context Graphs. In Proceedings of VLDB, pages 527-534,2000.
    Doorenbos R. B., O. Etzioni, D. S. Weld.1996. A Scalable Comparison-Shopping Agent for the World Wide Web [R]. Technical report UW-CSE-96-01-03, University of Washington, 1996.
    Doorenbos R. B., O. Etzioni, D. S. Weld.1997. A Scalable Comparison Shopping Agent for the World-Wide-Web [C]. Proceedings of the first International Conference on Autonomous Agents, California, February 1997.
    Dumais, S.2004. Latent semantic analysis [R]. Annual Review of InformationScience and Technology (ARIST),38 (2004)
    E.J. Glover, G.W. Flake, S. Lawrence, W.P. Birmingham, A. Kruger, C.L. Giles, and D.M. Pennock,.2001. Improving Category Specific Web Search by Learning Query Modifications [J]. SAINT, pp.23-34,2001
    Fu X., J. Budzik, and K. J. Hammond.2000. Mining Navigation History for Recommendation [C]. In Proc. of the 5th International Conference on Intelligent User Interfaces (IUI 2000), pages 106-112,2000.
    Freitag. D.1998a. Information Extraction from HTML:Application of a General Machine Learning Approach [C].Proceedings of the 15'th National Conference on Artificial Intelligence (AAAI-98),1998.
    Freitag. D.1998b. Multistrategy Learning for Information Extraction [C]. Proceedings of the 15'th International Conference on Machine Learning (ICML-98), Madison, Wisconsin, July 1998.
    Freitag. D.1998c. Machine Learning for Information Extraction in Informal Domains [M]. Ph.D. dissertation, Carnegie Mellon University, November 1998.
    Godin, R., Gecsei, J.,& Pichet, C.1989. Design of browsing interface forinformation retrieval [C]. In N. J. Belkin,& C. J. van Rijsbergen (Eds.), Proc. SIGIR 1989,32-39
    Godin, R., Missaoui, R.,& April, A.1993a. Experimental comparison ofnavigation in a Galois lattice with conventional information retrieval methods [J]. Int. J. Man-Machine Studies 38, 747-767
    He B., Tao T., Chang K C.. Clustering structured Web sources:a schema-based, model-differentiation Approach. In:Proceedings of the 9th International Conference on Extending Database Technology, Heraklion, Crete,2004,536-546
    He H., Meng W., Yu C. T., Wu Z.:WISE-Integrator:an automatic integrator of Web search interfaces for e-commerce. In:Proceedings of the 29th International Conference on Very Large Data Bases, Berlin,2003,357-368
    Hammer S., Hector G., Nestorov S., Yerneni R., Breunig M. M., Vassalos V..1997 Template-based wrappers in the TSIMMIS system [C]. In:Proceedings of the 16th ACM SIGMOD International Conference on Management of Data. Tucson,1997,532-535
    Haveliwala.T. H.2002. Topic-Sensitive PageRank [C]. In Proc. of the 11th International World Wide Web Conference (WWW2002), pages 517-526,2002.
    Hock Dee W.2000. Birth of the Chaordic Age [J]. Berrett-Koehler PuA.2000(2).
    Holland J H.1995. Hidden Order:How Adaptation Builds Complexity [M]. Reading, MA: Addison-Wesley,1995.
    Jansen B.J. A. Spink, and T Saracevic.2000. Real life, real users, and real needs:A study and analysis of user queries on the Web [J]. Information Processing and Management,36(2):207-227,2000.
    J. Srivastava, R. Cooley, M. Deshpand.2002. Web Usage Mining:Discovery and Applications of Usage Pattern from Web Data [C]. SIGKDD Explorations, vol.1, no.2, pp.12-23,2002
    J.D. Holt and S.M. Chung.2001. Multipass Algorithms for Mining Association Rules in Text Databases [J]. Knowledge and Information Systems, vol.3, pp.168-183,2001
    J. Mostafa, W. Lam, and M. Palakal.1997. A Multilevel Approach to Intelligent Information Filtering:Model, System, and Evaluation [J]. ACM Trans. Information Systems, vol.15, no.4, pp.368-399,1997
    Ken Lang.1995. NewsWeeder:Learning to Filter Netnews [C]. ICML 1995:331-339
    N. Kushmerick, D. S. Weld, R. Doorenbos.1997. Wrapper Induction for Information Extraction [C].15'th International Joint Conference on Artificial Intelligence (IJCAI-97), Nagoya, August 1997.
    Lada Adamic.1999a. The Small World Web [C]. ECDL'99, LNCS 1696, Springer,443-452.
    Lada A Adamic.l999b. Scaling Behavior of the World Wide Web [J]. Science 286,1999,15: 509-512.
    Lada A Adamic.2001. Friends and Neighbors on the Web[R]. Pre-print last modified,2001 Xerox Palo Alto Research Center.
    Laender A. H. F., Berthier A. R., Altigran S.2002. DEByE-data extraction by example [J]. Data Knowl. Eng.,2002,40,2:121-154
    Liren Chen, Katia P. Sycara.1998. WebMate:A Personal Agent for Browsing and Searching [J]. Agents 1998:132-139
    Liu L, Pu C, Han.2000. XWRAP:An XML-Enabled wrapper construction system forweb information sources [C]. In:Proceedings of the 16th international conference on data engineering. SanDiego, California, USA:IEEE Computer Society,2000,611-621
    Liu B., Grossman R. L., Zhai Y.2003. Mining data records in Web pages [C], In:Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington,2003,601-606
    Manber U. A. Patel, and J. Robison.2000. Experience with Personalization on Yahoo! [J]. Communications of the ACM,43(8):35-39,2000.
    Meng X., Lu H., Wang H., Gu M.2002. SG-WRAP:a schema-guided wrapper generator [C]. In: Proceedings of the 18th International Conference on Data Engineering, San Jose,2002,331-332
    Michael Gordon.2006. Adaptive Web Search:Evolving a Program That Finds Information [J]. IEEE INTELLIGENT SYSTEMS 20069/10
    M. Perkowitz and O. Etzioni.2002. Adaptive Web Sites [J].Comm. ACM, vol.43, no.8, pp. 152-158,2002
    Muslea I., Minton S., Knoblock C. A.2001. Hierarchical wrapper induction for semistructured information sources [J]. Autonomous Agents and Multi-Agent Systems,2001,4,1/2:93-114
    Neches R, Fikes R E, Gruber T R, etal.1991. Enabling Technology for Knowledge Sharing [J]. AI Magazine,1991,12(3):36-56
    Ning Zhong, Juzhen Dong, Yiyu Yao, Setsuo Ohsuga.2002. Gastric Cancer Data Mining with Ordered Information [J]. Rough Sets and Current Trends in Computing 2002:467-478
    Priss, U.2000. Lattice-based Information Retrieval [J]. Knowledge Organization,27,3,132-142 (2000)
    Prediger, S.1998. Kontextuelle Urteilslogik mit Begriffsgraphen. Ein Beitrag zur Restrukturierung der mathematischen Logik [M]. PhD Thesis. (1998)
    Prediger, S.,& Stumme, G.1999. Theory-driven Logical Scaling. Conceptual information Systems meet Description Logics [C]. In P. Lambrix, A. Borgida, M. Lenzerini, R. Muller,& P. Patel-Schneider (Eds.), Proceedings DL'1999. CEUR Workshop Proc.
    Rennie and A. McCallum. Using Reinforcement Learning to Spider the Web Efficiently. In Proceedings of ICML, pages 335-343,1999.
    Reuters-21578:http://www.daviddlewis.com/resources/testcollections/reuters21578/
    R. Feldman, I. Dagen, and H. Hirsh.1998. Mining Text Using Keywords Distributions [J]. J. Intelligent Information Systems, vol.10, no.3, pp.281-300,1998
    Sahuguet A, Azavant F.1999. Building light-weight wrappers for legacy web data—sources using W4F [C]. In Proceedings of the 25th international conference on very large databases. San Francisco, CA, USA:Morgan Kaufmann Publishers Inc.,1999,738-741
    Searchenginewatch.2004. http://www.searchenginewatch.com/.
    Sebastian M Maurer, Bernardo A Huberman.2000. Competitive Dynamics of Web Sites [R]. Pre-print Last modified 2000 Xerox Palo Alto Research Center.
    Shivakumar N. and H. Garca-Molina.1998. Finding near-replicas of documents on the web [C].presented at Proceedings of Workshop on Web Databases (WebDB'98), Mar,1998
    Sizov, M. Biwer, J. Graupmann, S. Siersdorfer.M. Theobald, G. Weikum, and P. Zimmer. The BINGO! System for Information Portal Generation and Expert Web Search. In Proc. of CIDR, 2003.
    Soderland. S.1999. Learning Information Extraction Rules for Semistructured and Free Text [J]. Machine Learning,1999.
    Song R., Liu H., Wen J., Ma W.2004. Learning important models for web page blocks based on layout and content analysis. SIGKDD Explorations,2004,6,2:14-23
    Spiliopoulou M. and L. Faulstich.1998. WUM-A Tool for WWW Utilization Analysis. In Proc. of the International Workshop on the World Wide Web and Databases (WebDB'98), pages
    184-203,1998.
    S. Robertson and D.A. Hull.2000. The TREC-9 Filtering Track Final Report [R]. Proc. Text Retrieval Conf. (TREC-9),2000
    Stanley H E.1971. Introduction to Phase Transitions and Critical Phenomena [M]. Oxford University Press, New York,1971.
    Stumme, G.2002. Formal Concept Analysis on Its Way from Mathematics to Computer Science [C]. In U. Priss, D. Corbett,& G. Angelova (Eds.), Conceptual Structures:Integration and Interfaces,10th International Conference on Conceptual Structures, LNCS 2393. Berlin: Springer,2-19
    S. Schocken and R.A. Hummel.1993. On the Use of the Dempster Shafer Model in Information Indexing and Retrieval Applications [J]. Int'l J. Man-Machine Studies, vol.39, pp.843-879, 1993
    Yuefeng Li, Ning Zhong.2006. Mining Ontology for Automatically Acquiring Web User Information Needs [J]. IEEE Trans. Knowl. Data Eng.18(4):554-568 (2006)
    Yuefeng Li, Y. Y. Yao.2002. User Profile Model:A View from Artificial Intelligence [J]. Rough Sets and Current Trends in Computing 2002:493-496
    Wang J., Z. Chen, L. Tao.2002. Ranking Relevance to a Topic through Link Analysis on Web Logs [C]. The 4th ACM CIKM International Workshop on Web Information and Data Management (WIDM'02), pages 49-54,2002.
    Wang et al.2003. Data extraction and label assignment for Web databases [C]. In Proceedings of the Twelfth International World Wide Web Conference (WWW),187-196.
    Wille, R.1982. Restructuring lattice theory:an approach based on hierarchiesof concepts [C]. In I. Rival (Ed.), Ordered sets. Reidel, Dordrecht-Boston,445-470 (1982)
    Wille, R.1999. Conceptual landscapes of knowledge:a pragmatic paradigmfor knowledge processing [J]. In W., Gaul,& Locarek-Junge (Eds.), Classificationin the Information Age. Berlin:Springer,1999,344-356
    X. Li and B. Liu.2003. Learning to Classify Texts Using Positive and Unlabeled Data [C]. Proc. Int'l Joint Conf. Artificial Intelligence, pp.587-592,2003
    Y. Li, C. Zhang, and J.R. Swan.2000. An Information Filtering Model on the Web and Its Application in JobAgent [J]. Knowledge-Based Systems, vol.13, no.5, pp.285-296,2000
    Y. Li and N. Zhong.2004. Web Mining Model and Its Applications on Information Gathering [J]. Knowledge-Based Systems, vol.17, pp.207-217,2004
    Zhai Y, Liu B.2005. Web data extraction based on partial tree alignment [C]. In:Proceedings of the 14th International World Wide Web Conference, Chiba,2005,76-85
    戴汝为.2003.互联网复杂巨系统[J].中国科学E辑2003年4月4期
    黄良,赖茂生.2006.web信息检索技术及研究进展[J].现代图书情报技术,2006,30(5)：P46-48.
    张卫丰、徐宝文.2001.基于遗传算法的搜索引擎调度[J].微电子学与计算机2001年4期

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700