非合作结构化深网数据源选择技术研究

英文题名：Research of Data Source Selection of Non-cooperative Structured Deep Web
作者：邓松
论文级别：博士
学科专业名称：信息管理与信息系统
中文关键词：深网 ; 数据源选择 ; 用户反馈 ; 主题语义 ; 非合作 ; 结构化
英文关键词：Deep Web ; Data Source Selection ; User Feedback ; Subject Semantics ; Non-cooperation ; Structured
学位年度：2013
导师：万常选
学科代码：0812
学位授予单位：江西财经大学
论文提交日期：2013-12-01
答辩委员会主席：李国徽

摘要

随着Web规模的不断扩大,用户准确地从中找到所要查询的Web数据源并进行查询是非常困难的事情。为了能有效地访问这些数据源,Web数据集成系统应运而生。由于在Web中,深网(Deep Web)即不能通过超链接访问的资源集合,占据重要地位,因此如何对深网中的数据进行有效地集成检索,近几年来一直是信息检索和数据库领域关注的前沿问题。深网数据集成的数据源众多,数据源自治,数据动态变化,而且数据更不规范。这些特点给深网数据的有效利用提出了新的挑战。
     每个领域中都存在着大量的可供访问的深网数据源,但由于它们的接口不尽相同,因此集成检索系统需要对深网数据源的查询接口进行集成。当有了统一的集成接口之后,如果仅把集成接口上的用户查询经过简单转换后提交给每个具体的深网数据源进行检索,显然是不行的。因为这样不仅会造成查询代价过高,且难以保证查询结果的数据质量。基于以上原因,数据源选择成为了深网数据集成中的关键问题,它的目标在于通过查询很少量的数据源,获取满足用户查询需求的检索结果。
     深网数据源主要分为文本数据源和结构化与半结构化数据源两种类型。文本数据源通常可以被看作为一个由许多网页构成的“文件集”。结构化与半结构化数据源中存储的是由多属性组成的现实世界的实体,其中半结构化数据源中存储的主要是XML数据。目前多数研究成果是针对以上两类数据源选择,前者主要是把成熟的信息检索技术引入到文本数据源的选择过程中,依据数据源中词项与文档排序信息评判一个数据源的相关性,后者主要是通过挖掘蕴含在数据源中的结构化特征信息对数据源进行评价。
     文本数据源选择研究起步较早,已经取得了很多可喜的研究成果。近年来,商业化深网发展迅猛,对应的结构化与半结构化深网数据源选择的研究引起了越来越多的关注,总体来说,相关研究还处于起步阶段,主要还存在以下问题需要解决：
     (1)在依据相关性进行数据源选择的时候没有考虑数据源自身的质量,这样容易给后续数据集成工作,例如实体识别、数据融合等,带来繁重的负担。
     (2)已有的结构化与半结构化深网数据源选择的高质量研究成果均假设数据源是合作型的,即它们可以向用户提供其索引结构及全部数据,以方便构建数据源摘要,但是在现实情况下以上假设难以实现。因此,需要进一步研究,如何抓住抽样数据中蕴含的主题语义信息即主题词与主题词、主题词与子主题词、主题词与特征词之间存在的关联信息,构建非合作结构化深网数据源摘要,以便更好地满足用户的查询需求。
     (3)深网数据源是实时更新的,当数据源内容更新之后,数据源摘要必然也需要做相应的调整,然而已有研究还未涉及非合作结构化深网数据源动态摘要更新问题。
     (4)用户经常会提交一个既包含检索型关键词又包含约束型关键词的混合类型关键词查询,其中检索型关键词表达了用户的主体查询意图,约束型关键词用于表达在用户主体查询意图基础上的约束条件,常用离散值表示。已有结构化深网数据源选择方法构建的摘要还未考虑以上查询需求。
     由于当前结构化深网的应用较为广泛,本文主要针对非合作结构化深网数据源选择,围绕以上四个方面,具体研究了以下内容：
     (1)数据源质量的评价。数据源质量评价关键是建立相应的评价模型,本文首先依据用户反馈获取推荐数据源与拒绝数据源集合；然后通过计算分析两集合数据源在各客观维度上的得分,依据相差度与重叠度设计数据源质量核心维度评价模型；通过支持向量机(SVM)训练建立质量评价模型；最后采用多个领域的数据评测方法的性能。
     (2)面向检索型关键词查询的数据源选择。首先,采用基于回溯下钻的无偏抽样方法获取具有代表性的数据源抽样数据,再依据词性、词频、位置、覆盖范围等因素设计针对数据源抽样数据的主题词获取方法；利用主题语义信息分析,获取每个数据源抽样数据中各主题词对应的特征词；面向检索型关键词查询需求,依据主题词与主题词、主题词与特征词之间的关联构建数据源摘要,并基于此摘要给出相应的数据源选择策略。其次,给出主题空间选择方法,以及基于所建摘要的数据源评价策略。最后,依据领域数据源主题词更新的相关性结合抽样技术,给出基于抽样的动态摘要更新算法。
     (3)面向混合类型关键词查询的数据源选择。当构建了面向检索型关键词查询需求的数据源摘要之后,为了有效地实现面向混合类型关键词查询的数据源选择,在数据源摘要中还需要增加一些表征特征词与约束型属性离散值相关的信息。本文通过主题词与特征词之间的关联,特征词在约束型属性离散值上的记录分布直方图,以及直方图之间的关联,构建数据源的混合摘要,对数据源中各类型属性进行有效地概括。其中,针对直方图关联的特点,给出直方图之间的约束相关性得分计算方法以及基于混合摘要的数据源评价策略。
     本文的创新性工作主要体现在：
     (1)把用户反馈作为重要手段,提出了领域高质量数据源选择方法。已有的基于质量的数据源选择方法通常依据经验选择统一的质量维度,因此不同领域下数据源选择的准确性有较大差异。本文依据用户反馈的推荐、拒绝数据源集合特征数据,获取用户推荐可信度,再结合数据源被选次数,获取准确的推荐数据源集合与拒绝数据源集合成员。通过引入重叠度、相差度两个指标分析推荐数据源和拒绝数据源质量维度特征,建立了维度重要性评价模型,动态地为每个领域的数据源选择不同的核心质量维度,从而建立相应的领域数据源质量评价模型。
     (2)构建了基于主题语义的非合作结构化深网数据源的层次化摘要,并提出了一种基于抽样的动态摘要更新方法。充分考虑主题语义信息以及同领域数据源主题更新的关联特性,通过建立主题词与主题词之间的关联、主题词与特征词之间的关联、主题词与子主题词之间的关联,构建了一种基于主题语义的数据源层次化摘要,该摘要不仅可以有效地表征数据源中的数据内容,而且反映了多关键词组合后的查询语义；在构建的数据源摘要的基础上,给出了面向检索型关键词查询的数据源选择策略。依据同领域数据源主题更新的关联特性,设计了主题空间变化率计算方法,可以有效地发现领域更新主题词、准确地度量数据源中某主题的变化程度,进而提出了一种基于抽样的动态摘要更新方法。
     (3)基于多类型属性的混合摘要可满足混合类型关键词查询的需求。通过建立主题词与特征词之间的关联、主题词与主题词之间的关联、每两个特征词在同一约束型属性上的直方图之间的约束关联,构建了数据源的混合摘要,可有效地对数据源中多类型属性进行特征概括；在构建的混合摘要的基础上,依据数据源混合摘要匹配查询中检索型关键词的程度与满足查询中约束型关键词约束条件的程度,给出了相应的面向混合类型关键词查询的数据源选择策略。
With the constantly expansion of Web, it's very difficult for user to exactly find and query the Web data sources which they really need. In order to efficiently access these data sources, Web data integration system comes into being. Deep Web is a resource collection, which can't be accessed by hyperlinks. Deep Web predominates in the field of Web, in recent years, it is a frontier issue that how to integrate retrieve data in Deep Web effectively. The above problem has been concerned by the researchers from information retrieval field and database field all the time. Deep Web data integration has these Characteristics:the number of data sources is large, autonomous, data is dynamic and irregular. These features present new challenges to the effective application of Deep Web data.
     There are a lot of accessible data sources in each filed and their interfaces are different, an integrated retrieval system needs to integrate all query interfaces. After having unified integrated interfaces, it is clearly infeasible that submit user queries on the integrated interfaces to each specific data source to retrieve results only with a simple conversion. Because not only it will causes a high price of the query, but also make it hard to ensure the quality of query results. Based on the above reasons, data source selection becomes a key issue of the data integration of Deep Web. Its purpose is to obtain retrieval results which can meet users' requirements, by querying a very small amount of data sources.
     Deep Web data sources are divided into two types:text data source, structured and semi-structured data source. Generally speaking, the former can be viewed as a file set which includes many Web pages, the latter mainly stores the real-world entities with many attributes. Specially, semi-structured data source mainly stores XML data. Currently, many researches of data source selection are on these two types of data sources. The former mainly brings the mature information retrieval technology into the selection process of text data sources, and judges the availability of a data source base on terms and documents sorting. The latter mainly makes an evaluation on data sources by mining structured feature information from their content.
     As researches of text data source selection start earlier, it has made a lot of promising research results. In recent years, with the rapid development of commercial Deep Web, more and more people pay more attention to the corresponding structured and semi-structured Deep Web data source selection research. In general, these related researches are still in infancy, principally, there are still many issues to be resolved as follows:
     (1) During the time of selecting data sources by correlation without considering their own quality, it is easy to put a heavy burden on data integration, such as entity recognition, data fusion, etc.
     (2) The high-quality research results of existing structured and semi-structured Deep Web data source selection bases on this assumption that data sources are cooperative and they can provide users with index structures and all data in order to build theirs abstract easily. But in fact, it is difficult to establish this hypothetical. Therefore, there is a need to make further researches on how to seize thematic semantic information from sample data to build the corresponding data source summary which can further satisfy query demands. Thematic semantic information includes relationship feature between subject heading and subject heading, relationship feature between subject heading and sub-subject heading, relationship feature between subject heading and feature word.
     (3) Deep Web data source is updated timely, after updating data source, its summary needs to be adjusted accordingly. However, exsiting studies have not been involved in dynamic summary updated issues.
     (4) Customers maybe submit hybrid queries, which include search type keywords and constrained type keywords. Search type keywords reflect user's primary query intent, constrained type keywords reflect the constraints on primary query intent. The constrained type keyword is commonly expressed by discrete values. The summary of existing methods for structured and semi-structured Deep Web data source selection haven't considered above query needs.
     As current structured Deep Web data sources are widely used, this paper focuses on four above aspects about structured Deep Web data source selection, and specific researches are as follows:
     (1) The evaluation of data sources quality. The key of Data sources quality evaluation is to establish corresponding evaluation models. First, with users' feedback, we gain collections of recommended data sources and refused ones. Second, we analyze and calculate the objective dimensions scores of two collections, and design a core dimensions quality model of data sources, according to the degree of discrimination and the degree of overipping. Thirdly, we establish the quality model by SVM training. Finally, we evaluate this method's performance with multi-domains data.
     (2) Data source selection for search type keyword query. Firstly, we obtain the representative sample data based on an unbiased sample method of backtracking drill; designing the subject heading access schemes of sample data of data source base on term nature, word frequency, position information, coverage; obtaining the feature words of each subject headings base on subject semantic information; arounding user's needs about data source selection of search type keyword query, we use the relationship between two subject headings, subject heading and feature word to build a corresponding summary in order to deal with data source selection problem. Secondly, we have proposed the subject space selection method and data source evaluation strategy based on above summary. Finally, based on updated relevant of subject headings of data sources in a field, combining sampling techniques, we design a sample-based dynamic summary update algorithm.
     (3) Data source selection for mixed-type keyword query. After building a summary of data source for query requirement of search type keyword query, in order to implement data source selection for mixed-type keyword query, we add related information of discrete values of feature words's constraint properties to the above summary. Our method effectively summarizes all type attributes, by creating the histogram for discrete values of constraint properties, the association of subject headings and feature words, as well as the association between record distributed histogram. In addition, in light of the characteristics of the histogram association, giving a calculation method of constraint correlation score between histograms, and providing a data source evaluation strategy based on mixed summary.
     Innovations of this thesis are mainly reflected in the following aspects:
     (1) Regarding users' feedback as an important means, proposing the field oriented high-quality data source selection method. Existing data source selection methods based on the quality, usually select uniform quality dimensions by researcher's experience, and the accuracy of data source selection in different fields are quite different. According to characteristic data of refused data sources set and recommended data sources set, which got by user feedback, we gain the user recommend credibility and recommendation number of data sources. With above information, we accurately get the members of the refused data sources set and the members of the recommended data sources set. By introducing overlapping degree and difference degree to analyze the dimensional feature of refused data sources set and recommended data sources set, building an evaluation mode of dimension importance, so we can dynamically select different core quality dimensions for data sources in a field. After completion of the above work, it can establish the appropriate quality evaluation models of data sources.
     (2) Building a subject semantic-based hierarchical summary of non-cooperative structured data source for Deep Web, and present a dynamic update method of summary based on sampling. Take full account of subject semantic information, relationship feature between subject heading and subject heading, relationship feature between subject heading and feature word, relationship feature between subject heading and sub-subject heading, constructing a hierarchical data source summary. This summary not only can effectively characterize contents in data sources, but also reflects inquiry semantics of multiple keywords combination. Then, give the data source selection strategy for search type keyword query base on above summary. In addition, we have designed a calculation method for change rate of subject space. This method can find the update subject headings effectively, and measure the degree of the variation of a subject space accurately. Base on this, it is the first time to propose a sampling-based dynamic summary update method.
     (3) Mixed summary based on multi-type attributes meets users' mixed types keyword query needs. Through the establishment of association of subject headings, association between subject heading and feature word, and the constraint association between histograms for every two feature words in the same constraint attribute, mixed summary have bean build. Mixed summary can characteristic multi-type attributes efficiently. Finally, we give a data source selection strategy of corresponding keyword query of mixed types, which is based on the degree of search type keywords in data source matching user query and the degree of constraint conditions satisfied user query.

引文

…刘伟,孟小峰,孟卫一Deep Web数据集成研究综述.计算机学报,2007,30(9)：1475-1489.
    [2]Ipeirotis PG, Gravano L. Distributed search over the hidden Web:hierarchical database sampling and selection. In:Bernstein PA, Ioannidis YE, Ramakrishnan R, Papadias D, eds. Proc. of the 28th Int'l Conf. on Very Large Data Bases(VLDB'O2). San Francisco:Morgan Kaufmann Publishers,2002.394-405.
    [3]余伟,李石君,文利娟,田建伟.基于数据质量的Deep Web数据源排序.小型微型计算机系统,2010,31(4)：641-646.
    [4]Dasgupta A, Das G, Mannila H. A random walk approach to sampling hidden databases. In:Chan CY, Ooi BC, Zhou Aoying, eds. Proc. of the 2007 ACM SIGMOD Int'l Conf. on Management of Data(SIGMOD'07). New York:ACM Press,2007.629-640.
    [5]Dasgupta A, Jin X, Jewell B, Zhang N, Das G. Unbiased estimation of size and other aggregates over hidden web databases. In:Ahmed KE, Divyakant A, eds. Proc. of the 2010 ACM SIGMOD Int'l Conf. on Management of Data(SIGMOD'10). New York:ACM Press,2010.855-866.
    [6]刘伟,孟小峰,凌妍妍.一种基于图模型的Web数据库采样方法.软件学报,2008,19(2)：179-193.
    [7]Callan JP, Lu ZH, Croft W. Searching distributed collections with inference networks. In:Fox EA, Ingwersen P, Fidel R, eds. Proc. of the 18th Annual Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR 1995). New York:ACM Press,1995.21-28.
    [8]Callan J, Connell M. Query-Based sampling of text database. ACM Trans, on Information Systems (TOIS),2001,19(2).97-130.
    [9]Craswell N. Methods for distributed information retrieval [Ph.D. Thesis]. Canberra: The Australian Nation University,2000.
    [10]D'Souza D, Zobel J, Thom J. Is CORI effective for collection selection? an exploration of parameters, queries, and data. In:Bruza P, Moffat A, Turpin A, eds. Proc. of the 9th Australasian Document Computing Symp. (ADCS 2004). Melbourne,2004.41-46.
    [11]Si L, Callan J. Relevant document distribution estimation method for resource selection. In:Callan J, Cormack G, Clarke C, Hawking D, Smeaton A, eds. Proc. of the 26th Annual Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR 2003). New York:ACM Press,2003.298-305.
    [12]Si L, Callan J. Unified utility maximization framework for resource selection. In: Grossman DA, Gravano L, Zhai CX, Herzog O, Evans DA, eds. Proc. of the 13th ACM Conf. on Information and Knowledge Management (CIKM 2004). New York:ACM Press,2004.32-41.
    [13]Milad S. Central-Rank-Based collection selection in uncooperative distributed information retrieval. In:Amati G, Carpineto C, Romano G, eds. Proc. of the 29th European Conf. on IR Research (ECIR 2007). Heidelberg:Springer-Verlag,2007. 160-172.
    [14]Thomas P, Shokouhi M. SUSHI:Scoring scaled samples for server selection. In: Allan J, Aslam JA, Sanderson M, Zhai CX, Zobel J, eds. Proc. of the 32nd Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR 2009). New York:ACM Press,2009.419-426.
    [15]D'Souza D, Thom JA, Zobel J. Collection selection for managed distributed document databases. Information Processing and Management,2004,40(3): 527-546.
    [16]Markov I, Azzopardi L, Crestani F. Reducing the uncertainty in resource selection. In:Serdyukov P, Braslavski P, Kuznetsov SO, Kamps J, Ruger S, Agichtein E, Segalovich I, Yilmaz E, eds. Proc. of the 35th European Conf. on IR Research (ECIR 2013). Heidelberg:Springer-Verlag,2013.507-519.
    [17]Huang SM, Yen DC, Yang LW, Hua JS. An investigation of Zipf s law for fraud detection. Decision Support Systems,2008,46(1):70-83.
    [18]French JC, Powell AL, Gey F, Perelman N. Exploiting a controlled vocabulary to improve collection selection and retrieval effectiveness. In:Paques H, Liu L, Grossman D, eds. Proc. of the 10th Conf. on Information and Knowledge Management (CIKM 2001). New York:ACM Press,2001.199-206.
    [19]Gravano L, Ipeirotis PG, Sahami M. QProber:A system for automatic classification of hidden-Web databases. ACM Trans. on Information Systems (TOIS),2003,21(1):1-41.
    [20]Ipeirotis PG, Gravano L, Sahami M. Probe, count and classify:Categorizing hidden Web databases. In:Aref WG, ed. Proc. of the 2001 ACM SIGMOD Int'l Conf. on Management of Data (SIGMOD 2001). New York:ACM Press,2001. 21-24.
    [21]Ipeirotis PG, Gravano L. Classification-Aware hidden-Web text database selection. ACM Trans, on Information Systems (TOIS),2008,26(2):1-66.
    [22]Ipeirotis PG, Gravano L. When one sample is not enough:Improving text database selection using shrinkage. In:Weikum G, Konig AC, Deβloch S, eds. Proc. of the 2004 ACM SIGMOD Int'l Conf. on Management of Data (SIGMOD 2004). New York:ACM Press,2004.767-778.
    [23]Hong D, Si L, Bracke P, Witt M, Juchcinski T. A joint probabilistic classification model for resource selection. In:Crestani F, Marchand-Maillet S, Chen HH, Efthimiadis EN, Savoy J, eds. Proc. of the 33rd Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR 2010). New York: ACM Press,2010.98-105.
    [24]Abbaci F, Savoy J, Beigbeder M. A methodology for collection selection in heterogeneous contexts. In:Proc. of the Int'l Conf. on Information Technology: Coding and Computing (ITCC 2002). Washington:IEEE Computer Society Press, 2002.529-535.
    [25]Rasolofo Y, Abbaci F, Savoy J. Approaches to collection selection and results merging for distributed information retrieval. In:Proc. of the 10th Conf. on Information and Knowledge Management (CIKM 2001). New York:ACM Press, 2001.191-198.
    [26]段青玲,杨仁刚,华松青.基于动态学习的Deep Web数据源选择算法.郑州大学学报(理学版),2010,42(1)：5-8.
    [27]Liu VZ, Luo RC, Chu WW. Dpro:A probabilistic approach for hidden Web database selection using dynamic probing. In:Ozsoyoglu ZM, Zdonik SB, eds. Proc. of the 20th Int'l Conf. on Data Engineering (ICDE 2004). Washington:IEEE Computer Society Press,2004.1-12.
    [28]Gravano L, Garcia-Molina H, Tomasic A. The effectiveness of GIOSS for the text database discovery problem. In:Snodgrass RT, Winslett M, eds. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data (SIGMOD 1994). New York:ACM Press,1994.126-137.
    [29]Cetintas S, Si L, Yuan H. Learning from past queries for resource selection. In: Cheung DWL, Song IY, Chu WW, Hu XH, Lin JJ, eds. Proc. of the 18th ACM Conf. on Information and Knowledge Management (CIKM 2009). New York: ACM Press,2009.1867-1870.
    [30]Cetintas S, Yuan H. Using past queries for resource selection in distributed information retrieval. Technical Report 11-012. West Lafayette:Purdue University, 2011.
    [31]Puppin D, Silvestri F, Laforenza D. Query-Driven document partitioning and collection selection. In:Li JZ, Lee WC, Silvestri F, eds. Proc. of the 1st Int'l Conf. on Scalable Information Systems (InfoScale 2006). New York:ACM Press,2006. 34-41.
    [32]Yu C, Meng W, Wu WS, Liu KL. Efficient and effective metasearch for text databases incorporating linkages among documents. In:Aref WG, eds. Proc. of the 2001 ACM SIGMOD Int'l Conf. on Management of Data (SIGMOD 2001). New York:ACM Press,2001.187-198.
    [33]Hawking D, Thomas P. Server selection methods in hybrid portal search. In: Baeza-Yates RA, Ziviani N, Marchionini G, Moffat A, Tait J, eds. Proc. of the 28th Annual Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR 2005). New York:ACM Press,2005.75-82.
    [34]Larkey LS, Connell ME, Canllan J. Collection selection and results merging with topically organized U.S. patents and TREC data. In:Proc. of the 9th Conf. on Information and Knowledge Management (CIKM 2000). New York:ACM Press, 2000.282-289.
    [35]Seo J, Croft WB. Blog site search using resource selection. In:Shanahan JG, Amer-Yahia S, Manolescu I, Zhang Y, Evans DA, Kolcz A, Choi KS, Chowdhury A, eds. Proc. of the 17th Conf. on Information and Knowledge Management (CIKM 2008). New York:ACM Press,2008.1053-1062.
    [36]Bender M, Michel S, Triantafillou P, Weikum G, Zimmer C. Improving collection selection with overlap awareness in P2P search engines. In:Baeza-Yates RA, Ziviani N, Marchionini G, Moffat A, Tait J, eds. Proc. of the 28th Annual Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR 2005). New York:ACM Press,2005.15-19.
    [37]Fuhr N. A decision-theoretic approach to database selection in networked IR. ACM Trans, on Information Systems,1999,17(3):229-249.
    [38]Nottelmann H, Fuhr N. Evaluating different methods of estimating retrieval quality for resource selection. In:Callan J, Cormack G, Clarke C, Hawking D, Smeaton A, eds. Proc. of the 26th Annual Int'l ACM SIGIR Conf. on Research and Development in Informaion Retrieval. New York:ACM Press,2003.290-297.
    [39]Nottelmann H, Fuhr N. Combining CORI and the decision-theoretic approach for advanced resource selection. In:McDonald S, Tait J, eds. Proc. of the 26th European Conf. on IR Research (ECIR 2004). Heidelberg:Springer-Verlag,2004. 138-153.
    [40]Nottelmann H, Fuhr N. Decision-Theoretic resource selection for different data types in MIND. In:Callan J, Crestani F, Sanderson M, eds. Proc. of ACM SIGIR 2003 Workshop on Distributed Information Retrieval. New York:ACM Press, 2003.43-57.
    [41]Callan J, Crestani F, Nottelmann H, Pala P, Shou XM. Resource selection and data fusion in multimedia distributed digital libraries. In:Callan J, Cormack G, Clarke C, Hawking D, Smeaton A, eds. Proc. of the 26th Annual Int'l ACM SIGIR Conf. on Research and Development in Informaion Retrieval. New York:ACM Press, 2003.363-364.
    [42]Nottelmann H, Fuhr N. The MIND architecture for heterogeneous multimedia federated digital libraries. In:Callan J, Crestani F, Sanderson M, eds. Proc. of ACM SIGIR 2003 Workshop on Distributed Information Retrieval. New York: ACM Press,2003.112-125.
    [43]Arguello J, Callan J, Diaz F. Classification-based resource selection. In:Cheung DWL, Song IY, Chu WW, Hu XH, Lin JJ, eds. Proc. of the 18th Conf. on Information and Knowledge Management (CIKM 2009). New York:ACM Press, 2009.1277-1286.
    [44]Balakrishnan R, Kambhampati S. SourceRank:Relevance and trust assessment for deep Web sources based on inter-source agreement. In:Srinivasan S, Ramamritham K, Kumar A, Ravindra MP, Bertino E, Kumar R, eds. Proc. of the 20th Int'l Conf. on World Wide Web (WWW 2011). New York:ACM Press,2011. 227-236.
    [45]Yu C, Philip G, Meng WY. Distributed top-N query processing with possibly uncooperative local systems. In:Freytag JC, Lockemann PC, Abiteboul S, Carey MJ, Selinger PG, Heuer A, eds. Proc. of the 29th Int'l Conf. on Very Large Data Bases (VLDB 2003). San Francisco:Morgan Kaufmann Publishers,2003.117-128.
    [46]范举,周立柱.基于关键词的深度万维网数据库的选择.计算机学报,2011,34(10)：1797-1804.
    [47]Wang Y, Zuo WL, He FL, Wang X, Zhang AQ. Ontology-Assisted deep Web source selection. Computer Science for Environmental Engineering and EcoInformatics,2011,159(2):66-71.
    [48]Mihaila GA, Raschid L, Vidal ME. Using quality of data metadata for source selection and ranking. In:Suciu D, Vossen G, eds. Proc. of the 3rd Int'l Workshop on the Web and Databases (WebDB 2000). Heidelberg:Springer-Verlag,2000.93-98.
    [49]Wang F, Agrawal G, Jin RM. A system for relational keyword search over deep Web data sources. Technical Report, Columbus:The Ohio State University,2008.
    [50]Nguyen K, Cao J. K-Graphs:Selecting top-k data sources for XML keyword queries. In:Hameurlain A, Liddle SW, Schewe KD, Zhou XF, eds. Proc. of the 22nd Int'l Conf. on Database and Expert Systems Applications (DEXA 2011). Heidelberg:Springer-Verlag,2011.425-439.
    [51]朱冠胜,黄浩,杨卫东.XML关键字检索系统的数据源选择.小型微型计算机系统,2012,33(6)：1183-1188.
    [52]Yu B, Li GL, Sollins K, Tung AKH. Effective keyword-based selection of relational databases. In:Chan CY, Ooi BC, Zhou AY, eds. Proc. of the 2007 ACM SIGMOD Int'l Conf. on Management of Data (SIGMOD 2007). New York:ACM Press,2007.139-150.
    [53]Wang RY, Strong DM. Beyond accuracy:What data quality means to data consumers. Journal of Management Information Systems,1996,12(4):5-33.
    [54]Naumann F, Freytag JC, Spiliopoulou M. Quality-Driven source selection using data envelopment analysis. In:Chengalur-Smith IN, Pipino L, eds. Proc. of 3rd Int'l Conf. on Information Quality (ICIQ 1998). Cambridge:MIT,1998.137-152.
    [55]Aboulnaga A, El Gebaly K. μBE:User guided source selection and schema mediation for Internet scale data integration. In:Chirkova R, Dogac A, Ozsu MT, Sellis TK, eds. Proc. of the 23rd Int'l Conf. on Data Enginering (ICDE 2007). Washington:IEEE Computer Society Press,2007.186-195.
    [56]Xian XF, Zhao PP, Yang YF, Xin J, Cui ZM. Efficient selection and integration of hidden Web database. Journal of Computers,2010,5(4):500-507.
    [57]Dong XL, Saha B, Srivastava D. Less is More:Selecting Sources Wisely for Integration. In:Proc. of the 39th Int'l Conf. on Very Large Data Bases (VLDB 2013). San Francisco:Morgan Kaufmann Publishers,2013.37-48.
    [58]Raghavan S, Garcia-Molina H. Crawling the hidden Web. In:Apers PMG, Atzeni P, Ceri S, Paraboschi S, Ramamohanarao K, Snodgrass RT, eds. Proc. of the Proceedings of 27th International Conference on Very Large Data Bases. San Francisco:Morgan Kaufmann Publishers,2001.129-138.
    [59]Gravano L, Ipeirotis PG, Sahami M. QProber, a system for automatic classification of hidden-Web databases. ACM TOIS,2003,21(1):1-41.
    [60]Chang KCC, He B, Zhang Z. Toward Large Scale Integraiton:Building a MetaQuerier over Databases on the Web. In:Proc of the Second Conference on Innovative Data System Research(CIDR),2005:44-55. http://www-db.cs.wisc.edu/cidr/
    [61]Chang KC, He B, Li C, Patel M. Structured databases on the Web:Observations and Implications. SIGMOD Record,2004,33(3):61-70.
    [62]Zhang Z, He B, Chang KC. Understanding Web query interfaces:Best-effort parsing with hidden syntax. In:Weikum G, Konig AC, Deβloch S, eds. Proc. of the 23rd ACM SIGMOD International Conference on Management of Data. New York:ACM Press,2004.107-118.
    [63]Arasu A, Garcia-Molina H. Extracting structured data from Web pages. In:Halevy AY, Ives ZG, Doan A, eds. Proc. of the 22nd ACM SIGMOD International Conference on Management of Data. New York:ACM Press,2003.337-348.
    [64]He H, Meng WY, Yu CT, Wu Z. WISE- integrator:An automatic integrator of web search interfaces for e-commerce. In:Freytag JC, Lockemann PC, Abiteboul S, Carey MJ, Selinger PG, Heuer A, eds. Proc. of the 29 internation conference on Very Large Data Base. San Francisco:Morgan Kaufmann Publishers,2003.357-368.
    [65]Peng Q, Meng W, He H, Yu CT. WISE-cluster:Clustering e-commerce search engines automatically. In:Proc. of the 6th ACM International Workshop on Web Information and Data Management. New York:ACM Press,2004.104-111.
    [66]Ipeirotis PG, Gravano L, Sahami M. Probe, count, and classify:Categorizing hidden Web databases. In:Elmagarmid AK, Agrawal D, eds. Proc. of ACM SIGMOD International Conference on Management of Data. New York:ACM Press,2001.67-78.
    [67]Wu W, Yu CT, Doan A, Meng W. An interactive clustering-based approach to integrating source query interfaces on the Deep Web. In:Weikum G, Konig AC, DeBloch S, eds. Proc of the 23rd ACM SIGMOD International Conference on Management of Data. New York:ACM Press,2004.95-106.
    [68]Zhang Zhen, He Bin, Chang KCC. Light-weight Domain-based form assistant: Querying Web databases on the fly. In:Bohm K, Jensen CS, Haas LM, Kersten ML, Larson A, Ooi BC, eds. Proc of the 31st VLDB Conference. San Francisco: Morgan Kaufmann Publishers,2005.97-108.
    [69]Hammer J, Hector G, Nestorov S, Yemeni R, Breunig MM, Vassalos V. Template-based wrappers in the TSIMMIS system. In:Peckham, eds. Proc of the 16th ACM SIGMOD International Conference on Management of Data. New York:ACM Press,1997.532-535.
    [70]Arocena GO, Mendelzon AO. WebOQL:Restructuring documents, databases, and Webs. In:Proc. of the 14th International Conference on Data Engineering. Washington:IEEE Computer Society,1998.24-33.
    [71]Liu L, Pu C, Han W. XWRAP:An XML-enabled wrapperconstruction system for Web information sources. In:Jensen CS, Jermaine CM, Zhou XF, eds. Proc. of the 16th International Conference on Data Engineering. Washington:IEEE Computer Society Press,2000.611-621.
    [72]Crescenzi V, Mecca G, Merialdo P. RoadRunner:Towardsautomatic data extraction from large Web sites. In:Apers P, Atzeni P, Ceri S, Paraboschi S, Ramamohanarao K, Snodgrass RT, eds. Proc. of the 27th International Conference on Very Large Data Bases. San Francisco:Morgan Kaufmann Publishers,2001. 109-118.
    [73]Kushmerick N. Wrapper induction:Efficiency and expressiveness. Artificial Intelligence,2000,118(1-2):15-68.
    [74]Muslea I, Minton S, Knoblock CA. Hierarchical wrapperinduction for semistructured information sources. Autonmous Agents and Multi-Agent Systems, 2001,4(1-2):93-114.
    [75]Arlotta L, Crescenzi V, Mecca G, Merialdo P. Automaticannotation of data extracted from large Web sites. In:Proc. of the 6th International Workshop on Web and Data-bases. New York:ACM Press,2003.7-12.
    [76]Lim EP, Srivastava J, Prabhakar S, Richardson J. Entity identification in database integration information systems. Information Systems,1996,89(1):1-38.
    [77]Wei W, Liu M, Li S. Merging of XML documents. In:Proc. of the 23 th Internation Conference on Conceptual Modeling-ER 2004. Heidelberg:Springer-Verlag,2004. 273-285.
    [78]李怀祖.管理研究方法论.西安：西安交通大学出版社,2004.118-123.
    [79]于海涛.抽样技术在数据挖掘中的应用研究[硕士学位论文].合肥：合肥工业大学,2006.
    [80]刘祺.决策树ID3算法的改进研究[硕士学位论文].哈尔冰：哈尔滨工业大学,2009.
    [81]钱博,唐振民,李燕萍,徐利敏.基于分层采样的集成k近邻说话人识别算法.计算机工程与应用,2007,43(35)：226-229.
    [82]张建锦,吴渝,刘小霞.一种改进的密度偏差抽样算法.计算机应用,2007,27(7)：1695-1698.
    [83]许高建,路遥,胡学刚,涂立静.一种改进的文本特征选择方法研究与设计.苏州大学学报,2008,8(11)：31-32.
    [84]陈建华.中文文本分类特征选择方法研究[硕士学位论文].兰州：西北师范大学,2012.
    [85]张俊丽.文本分类中的关键技术研究[博士学位论文].湖北：华中师范大学,2008.
    [86]邓琦,苏一丹,曹波,闭剑婷.中文文本体裁分类中特征选择的研究.计算机工程,2008,34(23)：89-91.
    [87]荣光.中文文本分类方法研究[硕士学位论文].济南：山东师范大学,2009：21-23.
    [88]Rocchio JJ. Relevance Feedback in Information Retrieval. In:Salton G, eds, The SMART Retrieval System.1971, Ebgke-wood Cliffs, N J:Prentici-Hall, Inc. 313-323.
    [89]钟敏娟.基于检索结果聚类的XML伪反馈技术研究[博士学位论文].南昌：江西财经大学,2011.
    [90]Ide E. New Experiment in Relevance Feedback. In:Salton G, eds, The SMART Retrieval System:Experiment in Automatic Document Processing. Pretice-Hall, 1971,337-354.
    [91]Sparck JK. Search Term Relevance Weighting Given Little Relevance Information. Journal of Documentation,1979,35(1):30-48.
    [92]Turtle H, Croft WB. Inference Networks for Document Retrieval. In:Vidick JL, eds. Proc. of the 13rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM Press,1990.1-24.
    [93]De Campos LM, Fernandez-Luna JM, Huete JF. Implementing Relevance Feedback in Bayesian Network Retrieval Model. Journal of the American Society for Information Science and Technology,2003,54(4):302-313.
    [94]Riberiro-Neto BA, Muntz R. A Belief Network Model for Information Retrieval. In: Frei HP, Harman D, Schauble P, Wilkinson R, eds. Proc. of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM Press,1996.253-260.
    [95]Lynam TR, Buckley C, Clarke C, Cormack GV. A Multi-System Analysis of Document and Term Selection for Blind Feedback. In:Grossman DA, Gravano L, Zhai CX, Herzog O, Evans DA, eds. Proc. of the 13rd ACM International Conference on Information and Knowledge Management(CIKM). New York: ACM Press,2004.261-269.
    [96]Robertson SE, Jones KS. Relevance Weighting of Search Terms. Journal of the American Society of Information Science.1976,27(3):129-146.
    [97]Lv YH, Zhai CX. A Comparative Study of Methods for Estimating Query Language Models with Pseudo Feedback. In:Cheung DWL, Song IY, Chu WW, Hu XH, Lin JJ, eds. Proc. of the 13th ACM International Conference on Information and Knowledge Management(CIKM). New York:ACM Press,2009. 1895-1898.
    [98]Abdul-Jaleel N, Allan J, Croft W B, Diaz F, Larkey LS, Li XY, Smucker MD, Wade C. Umass at Trec 2004:Novelty and Hard. Technical Report A811064. Massachusetts:University of Massachusetts,2004.
    [99]Lavrenko V, Croft WB. Relevance-based Language Models. In:Croft WB, Harper DJ, Kraft DH, Zobel J, eds. Proc. of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press,2061.120-127.
    [100]Nichols DM. Implicit Ratings and Filtering. In:Proc. of the 5th DELOS Workshop on Filtering and Collaborative Filtering. Budapest:ERCIM Press, 1997.31-36.
    [101]Joachims T, Granka L. Accurately Interpreting Clickthrough Data as Implicit. In: Baeza-Yates RA, Ziviani N, Marchionini G, Moffat A, Tait J, eds. Proc. of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM Press,2005.154-161.
    [102]王波.Deep Web数据库选择和查询转换技术研究[硕士学位论文].大连：大连理工大学,2009：7-9.
    [103]鲜学丰,方巍,赵朋朋,崔志明,胡鹏昱.一种Deep Web数据源质量评估模型.微电子学与计算机,2008,25(10)：47-50.
    [104]胡鹏昱,赵朋朋,方巍,崔志明.深网数据源质量估计模型.计算机工程,2009,35(9)：204-207.
    [105]Chaudhuri S, Ganti V, Kaushik R. A Primitive Operator for Similarity Joins in Data Cleaning. In:Liu L, Reuter A, Whang KY, Zhang JJ, eds. Proc of the 22th IEEE Internation Conference on Data Engineering. Washington:IEEE Computer Society Press,2006.5-16.
    [106]Chauhuri S, Ganti V, Motwani R. Robust Identification of Fuzzy Duplicates. In: Aberer K, Franklin MJ, Nishio S, eds. Proc. of the 21th IEEE International Conference on Data Engineering(ICDE2005). Washington:IEEE Computer Society Press,2005.865-876.
    [107]Vu QH, Qoi BC, Papadias D, Tung AKH. A Graph Method for Keyword-based Selection of the top-k Databases. In:Wang JT, eds. Proc. of 2008 ACM SIGMOD Int'l Conf on Management of Data. New York:ACM Press,2008. 915-926.
    [108]卓林,杨舟,岳亮,赵朋朋,崔志明.Deep Web爬虫的一种增量式更新策略.苏州大学学报(工学版),2011,31(4)：6-10.
    [109]Todd AE, Orengo CA, Thornton JM. Evolution of Function in Protein Superfamilies from a Structural Perspective. J Mol Biol,2001,307(4): 1113-1143.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700