Deep Web入口识别和个性化搜索研究与设计

英文题名：Deep Web Entrance Recognition and Personalized Search Research & Design
作者：陈文
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：Deep ; Web ; 主动学习 ; PU学习 ; 个性化搜索 ; 兴趣树
英文关键词：Deep Web ; Active Learning ; PU learning ; personalized search ; interest tree
学位年度：2010
导师：晏立
学科代码：081203
学位授予单位：江苏大学
论文提交日期：2010-05-01

摘要

用户对Deep Web站点的访问主要是通过其在Web页面中提供的具有特定查询能力的接口来获取所需要的结果。为了帮助用户简单高效的查找Deep Web信息,就必须提供统一的查询接口,方便用户对多个Deep Web站点同时进行查询。而Deep web入口识别是整个DeepWeb集成搜索的重要组成部分,是搜索信息的来源和后续工作的前提,对整个Deep Web集成系统有着重要的意义。同时,大量的DeepWeb信息犹如浩瀚的海洋,为了使得Deep Web集成搜索获得的数据具有更高的使用价值,避免“信息过载”,就要对Deep Web集成搜索的结果进行处理,为用户提供个性化Deep Web集成搜索服务。
     本文重点研究了Deep Web入口识别和Deep Web集成结果显示的相关技术,给出了一种具有增量学习能力的PU主动学习算法并应用到Deep Web入口识别中以及一种面向Deep Web集成的个性化搜索方法,最后设计和实现了一个面向Deep Web集成的个性化搜索原型系统。
     本文主要研究的内容包括:
     (1)研究如何从不断增加的Web页面中判断出Deep Web入口并对其分类。针对初始正例样本较少并且不同类别反例获取困难的情形,给出了一种具有增量学习能力的PU主动学习算法,该算法使用三个支持向量机进行协同半监督学习的同时,利用基于网格的聚类方法进行无监督学习,当分类与聚类结果不一致时,引入主动学习来标记无标记样本。将该算法应用于Deep Web入口的在线判断和分类中,实验表明,该方法能提高新的类型的发现能力以及处理增量无标记样本的能力。
     (2)为了缓解Deep Web集成搜索结果页面中信息量过大,导致信息过载的问题,给出了一种面向Deep Web集成的个性化搜索方法。该方法利用Deep Web站点目录和用户调查表生成兴趣树,并根据用户反馈和成员Deep Web站点返回的参数等更新用户兴趣。针对不同的用户兴趣对页面进行过滤和排序,从而得到最终显示页面。实验结果表明,该方法优化了Deep Web集成搜索,使得用户感兴趣的个性化信息更加突出。
     (3)设计和实现了一个面向Deep Web集成的个性化搜索原型系统,并将上文给出的技术在该系统上的应用做了分析。实际应用表明,该系统可以取得较好的效果。
The visits of users to Deep Web sites are mainly achieved through obtaining the desired results from the interfaces which have specific query ability provided in Web pages. It is necessary to provide a unified query interface which could make multiple Deep Web sites visited simultaneously to help users search Deep Web information simply and effectively. The recognition of the Deep Web entrance is an important component of the integrated search, the source of information searching and the prerequisite condition for the following works. And it is important for the entire integrated search system of Deep Web. Meanwhile, huge number of Deep Web information likes a vast ocean. For the sake of making the data obtained by integrated search of Deep Web have higher value and avoiding "Information Overloading", it needs to process the integrated search results and provide the intelligent services of personalized search for users.
     This paper mainly studies the techniques about the recognition of the Deep Web entrance and the display of the integrated results of Deep Web. In addition, a PU active learning algorithm which has incremental learning ability is proposed. We apply it into the recognition of the Deep Web entrance. Moreover, we put forward a personalized search method based on the integration of Deep Web. Finally, a personalized search prototype system based on the integration of Deep Web is designed and implemented.
     The main work of this paper is introduced as follows:
     (1) Study how to determine the entrances of Deep Web from the increased Web pages and classify them. For lowering the risk of lacking of initial positive samples and hardly obtaining negative samples of corresponding positive samples of different classes. A PU active learning method which has incremental learning ability is presented. This method employs three SVM classifiers in cooperative meta-supervised learning while unsupervised learning based on grid-based clustering is used. When the results of classification and cluster analysis are not unanimous, we introduce active learning to mark the unlabeled samples. The algorithm is applied to the online recognition of Deep Web interfaces and classification. Experiments show that the method can effectively improve the ability of identifying new classes and processing incremental unlabeled samples.
     (2) Present a personalized search approach based on the integration of Deep Web in order to solve the problem that information overloading due to the excessive amount of information in the integrated search of Deep Web. This method uses Deep Web directories and user questionnaire to generate interest tree and update user interest according to the feedback from users and the returned parameters from the members of the Deep Web sites. The pages are filtered and sorted according to different user interests so as to get the final displayed pages. Experimental results demonstrate that this method effectively optimizes the integrated search process of Deep Web, leading to the more prominent personalized information.
     (3) Design and implement an integrated personalized search prototype system of Deep Web. Moreover, we analyze the application of the techniques mentioned above to the system. The practical application shows that the system can has a good effect.

引文

[1]Fetterly D,Manasse M,Najork M,et al.A large-scale study of the evolution of web pages [C]//Proceedings of the 12th International World Wide Web Conference,Budapest,2003:669-678
    [2]刘伟,孟小峰,孟卫一.Deep Web数据集成研究综述[J].计算机学报,2007,30(9):1475-1489.
    [3]M.K.Bergman.The Deep Web:Surfacing Hidden Value[J].The Journal of Electronic Publishing,2001,7(1):8912-8914
    [4]Bin He,Mitesh Patel,Zhen Zhang,et al.Accessing the Deep Web:A Survey[EB/OL].(2004-07).http://eagle.cs.uiuc.edu/tr/dwsurveytr-hpzc-ju104.pdf
    [5]Chang K C,He B,Li C,Zhang Z.Structured databases on the Web:Observations and implications[J].SIGMOD Record,2004,33(3):61-70
    [6]MetaQuerier Research Group[EB/OL].(2006-06).http://metaquerier.cs.uiuc.edu/2006-06
    [7]Davulcu H,Freire J,Kifer M,et al.A layered architecture for querying dynamic Web content[C]//Proc of International Conference on Management of Data.New York:ACM Press,1999:491-502
    [8]S.Raghavan,H.Garcia-Molina.Crawling the hidden Web[C]//Proc of the 27~(th) International Conference on Very Large Data Bases.Roma,Italy,2001:129-138
    [9]Robert B.Doorenbos,Oren Etzioni,Daniels Weld.A scalable comparison shopping agent for the World-Wide Web[C]//Proc of the First International Conference on Autonomous Agents,Marina del Rey,CA,1997:39-48
    [10]QProber Research Group[EB/OL].Accessible at http://qprober.cs.columbia.edu/Oct 2005
    [11]L.Barbosa,J.Freire.Siphoning hidden-web data through keyword-based interfaces[e]//Proc of the Brazilian Symposium on Database,New York:ACM Press,2004:309-321
    [12]Akilandeswari J,Gopalan N P.A novel Design of Hidden Web crawler Using Reinforcement Learning Based Agents[C].In APPT 2007,LNCS 4847,2007:433-440.
    [13]Chakrabarti J,Gopalan N P.Focused crawling:A New Approach to Topic specific Web Resource Discovery[J].Computer Networks 31(11-16),1999:1623-1640.
    [14]Diligent M,Coetzee F,Lawrence S,Giles C L,Gori M.Focused Crawling using Context Graphs[C].Proc.Of the 26th International Conference,2000:527-534.
    [15]Miller R C,Bharat K.Sphinx:A framework for creating personal,site specific web crawlers[C],the 12th International World Wide Web Conference,1998.
    [16]BergmanMK.DeepWebWhitePaper[EBJOL].2004.http:/lbrightplanet.com/technology/deep web.asp.
    [17]Barbosa L,Freire J.Searching for hidden-web databases[C],workshop on the web and Database 8th International Conference,2005.
    [18]Barbosa L,Freire J.An Adaptive Crawler for Locating hidden-Web EPoints[C].International Conference,2007:441-450.
    [19]Cope J,Graswll N,Hawking D.Automated Discovery of Search Interfaces on the Web[C].In proc.of ADC,2003.
    [20]Gao Ling,Zhao pengpeng,Cui Zhi-Ming.Automatic Judgement of Deep Web Query Interfaces[J].Compute technology and development,2007,17(5):148-151.
    [21]Peng Q,Meng W,He H,Yu C T.Wise-cluster:clustering e-commerce search engines automatically.Proceedings of the 6th ACM International Workshop on Web Information and Data Management.Washington,2004.104-111
    [22]He B,Chang KCC,Han J.Discovering complex matching across Web query interfaces:A correlation minim approach.In:Proc.of the SIGKDD Conf.2004.Seattle:ACM Press,2004.148-157.
    [23]赵朋朋,高岭,崔志明.基于查询接口特征的Deep Web数据源自动分类.微电子学与计算机,Vol.23 No.10,2006.47-50
    [24]李志涛.使用多分类器进行Deep Web数据源的分类和判定[D].苏州大学,硕士学位论文.2009
    [25]徐和祥.Deep Web集成中若干技术研究[D].复旦大学,博士学位论文,2009
    [26]Hexiang Xu,Chenghong Zhang,Xiulan Hao,Yunfa Hu.A Machine Learning Approach Classification of Deep Web Sources.In Proceedings of FSKD 2007 VOL.4,IEEE CNF,2007.561-565.
    [27]O.Chapelle and B.Schokopf and A.Zien.Semi-Supervised Learning[M].MIT Press Cambridge,MA,2006.
    [28]F.Denis.PAC Learning from Positive Statistical Queries[C].The 9th International Workshop on Algorithmic Learning Theory(ALT'98),LNCS,Springer,Heidelberg,1998,112-126.
    [29]B.Liu,W.S.Lee,P.S.Yu and X.L.Li.Partially Supervised Classification of Text Documents[C].Proceedings of the Nineteenth International Conference on Machine Learning(ICML-2002),8-12,July 2002.
    [30]张邦佐.基于正例和无标记样例学习研究.吉林大学,博士学位论文,2009
    [31]ZHANG Bang-zuo,ZUO Wan-li,Co-EM Support Vector Machine based Text Classification from Positive and Unlabeled Examples,1st International Conference on Intelligent Networks and Intelligent Systems,Wuhan,China,Nov.1-3,2008,745-748.
    [32]Bangzuo Zhang,Wanli Zuo,Tri-Training based Learning from Positive and Unlabeled Data,2008 International Symposiums on Information Processing and 2008 International Pacific Workshop on Web Mining and Web-based Application,Moscow,Russia,May 23-25,2008,650-654.
    [33]Bangzuo Zhang,Wanli Zuo.Learning from Positive and Unlabeled Examples:A Survey [C].2008 International Symposiums on Information Processing and 2008 International Pacific Workshop on Web Mining and Web-based Application,Moscow,Russia,May 23-25,2008,640-644.
    [34]B.Liu,Y.Dai,X.L.Li,W.S.Lee,and Philip Y.Building Text Classifiers Using Positive and Unlabeled Examples[C].Proceedings of the Third IEEE International Conference on Data Mining(ICDM-03),Melbourne,Florida,November 2003,19-22.
    [35]H.Yu,J.Han,and Chang K.C.-C.PEBL:Positive Example Based Learning for Web Page Classification Using SVM[C].Proc.Eighth Int'l Conf.Knowledge Discovery and Data Mining(KDD'02),2002,239-248.
    [36]X.L.Li and B.Liu.Learning to Classify Text using Positive and Unlabeled Data[C].Proceedings of Eighteenth International Joint Conference on Artificial Intelligence (IJCAI-03),Aug 9-15,2003.
    [37]G.P.C.Fung and H.J.Lu.Text Classification without Negative Examples Revisit[J].IEEE Transactions on Knowledge and Data Engineering,18(1):6-20,2006.
    [38]F.Denis,R.Gilleron,M.Tommasi.Text Classification from Positive and Unlabeled Examples[C].The 9th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems,IPMU 2002.
    [39]Bangzuo Zhang,Wanli Zuo,Constrained-KMeans Cluster based Learning from Positive and Unlabeled Examples,Journal of Computational Information Systems,Binary Information Press,May,2009,5(3):1209-1216.
    [40]Bangzuo Zhang,Wanli Zuo,Reliable Negative Extracting based on kNN for Learning from Positive and Unlabeled Examples,Journal of Computers,Academy Publisher,Finland,Jan.2009,4(1):94-101.
    [41]F.Denis,R.Gilleron,M.Tommasi.Text Classification from Positive and Unlabeled Examples[C].The 9th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems,IPMU 2002.
    [42]F.Denis,R.Gilleron,A.Laurent,M.Tommasi.Text Classification and Co-Training from Positive and Unlabeled Examples[C].Proceedings of the ICML 2003 Workshop:The Continuum from Labeled to Unlabeled Data,80-87,2003.
    [43]W.S.Lee,B.Liu.Learning with Positive and Unlabeled Examples using Weighted Logistic Regression[C].Proceedings of the Twentieth International Conference on Machine Learning(ICML-2003),August 21-24,2003.
    [44]D.Zhang,W.S.Lee.A Simple Probabilistic Approach to Learning from Positive and Unlabeled Examples[C].Proceedings of the 5th Annual UK Workshop on Computational Intelligence(UKCI),London,UK,September 2005.
    [45]Dell Zhang,Wee Sun Lee.Learning Classifiers without Negative Examples:A Reduction Approach[C].Proceedings of the 3rd IEEE International Conference on Digital Information Management(ICDIM),London,UK,Nov 2008.
    [46]G.Salton,A.Wong,C.S.Yang.A vector space model for automatic indexing,Communications of the ACM,1975,18(11):613-620
    [47]邵华,高凤荣,邢春晓,蒋丽华.基于VSM的分层网页推荐算法,计算机科学,2006,33(11)
    [48]曾春,邢春晓,周立柱.基于内容过滤的个性化搜索算法,软件学报,2003,14(5):1000-1004
    [49]卢林兰,李明.用户Ontology的构建及其在个性化检索中的应用,计算机应用,2006, 26(11):2635-2638
    [50]黄国景.元搜索引擎个性化搜索的研究与设计.苏州大学,学位论文,2005
    [51]Z.H.Zhou,and M.Li,"Tri-Training:Exploiting Unlabeled Data Using Three Classifiers".IEEE Trans.Knowl.Data Eng.17(11),2005,pp.1529-1541.
    [52]WANGWei,YANG J iong,MUNTZ R.STING:a statistical information grid app roach to spatial data mining[C]//Proc of the 23rd Conference on VLDB.Athens:[s.n.],1997:1862195.
    [53]MA Jun,SONG Ling,HAN Xiao-Hui,YAN Po,Classification of Deep Web Databases Based on the Context of Web Pages[J],Journal of Software,Vol.19,No.2,February 2008,pp.267-274.
    [54]WANG Hui,LIU Yan-Wei,ZUO Wan-Li,Using Classifiers to Find Domain-Specific Online Databases Automatically[J],Journal of Software,Vol.19,No.2,February 2008,pp.246-256.
    [55]Li Zhi-tao,Liu Quan,Cui Zhi-ming,Fu Yu-chen,A Method to Automatically Discover and Classify Deep Web Data Source Using Multi-Classifier[C],2009 WRI World Congress on Computer Science and Information Engineering,csie,vol.3,pp.736-740.
    [56]Bing Liu.Web Data Mining:Exploring Hyperlinks,Contents and Usage Data,Springer,December,2006

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700