文本相似度计算核函数的构造及其在分布式信息检索中的应用研究

英文题名：Construction of Kernels for Text Similarity Detection and Application in Distributed Information Retrieval
作者：王秀红
论文级别：博士
学科专业名称：系统工程
中文关键词：文本相似度 ; 核函数 ; 分布式信息检索 ; 资源选择 ; 结果融合
英文关键词：text similarity ; kernels ; distributed information retrieval ; resource selection ; result
英文关键词：merging
学位年度：2012
导师：鞠时光
学科代码：081103
学位授予单位：江苏大学
论文提交日期：2012-04-01

摘要

随着互联网、数字图书馆以及其它信息资源的快速发展,异构形式的数据项正快速遍布于全球范围的特定的节点中,这些节点相互连接形成分布式处理系统。如何从信息的海洋中以较低的时间开销、较高的精准率和召回率提供给检索用户所需要的信息是一个极富有挑战性的问题。在信息检索(Information Retrieval,简称IR)领域,从空间上分布的数据服务器中检索数据就是分布式信息检索(Distributed Information Retrieval,简称为DIR)。DIR需要解决两个主要问题是资源选择和结果融合。文本相似度计算技术研究的是如何计算或比较两个文本的相似性,是在语言学、心理学和信息理论等领域内被广泛研究的一个重要课题,也是信息检索、数据挖掘、知识管理、人工智能等领域的基本问题,是自然语言处理的一项基础技术,也是复制检测、新颖检测和信息过滤研究的重要内容。提高计算的精准率和召回率是文本相似度计算方法研究的出发点和目标。如何在分布式环境下尽可能快速、准确、全面地检索到相似的文本,是本文研究的主要内容,主要研究工作包括：
     (1)分布式信息检索的资源选择研究。资源选择又叫服务器选择、集合选择、数据集选择或数据库选择,是分布式信息检索中的一个基本问题。本文考虑到不同的数据资源(数据集)之间存在的覆盖问题,基于集合覆盖理论,针对提问Q的检索结果在融合排序后位置的不同,对其赋以不同的权值,用来计算该项检索结果对其所在的数据集的贡献。若检索结果在先选的数据集中出现过,,则不再计入后选的数据集得分内。通过加权求和得到待选数据集的得分,从而确定资源选择的先后顺序。由此优选出的资源集合可用于检索与问题Q同类或类似的提问Q’,缩短由于数据库之间的覆盖而重复检索的时间。
     (2)构造适于文本相似计算的混合核函数,并将其应用在DIR结果融合。基于改进的潜在语义核(LSK)和复合方差核(ANOVA)构建了新的复合核(CLA核)用于计算文本相似度。此外提出一种新DIR融合方法,通过直接计算检索结果和提问之间相关度来对检索结果进行融合研究。将构造的新复合核用于DIR结果融合,实验结果表明：CLA核的融合精度和召回率分别仅略次于LSK和ANOVA核,但其综合评价指标F1优于其它核；其融合精度比经典的算法Round-robin、ComMNZ、Bayesian、Borda、 SDM、MEM和regression SVM等分别提高了16.79%、30.73%、20.37%、24.17%、14.25%、13.50%和7.53%。CLA核具有较好的融合表现,适用于DIR结果融合。
     (3)构造全新的文本相似度计算核函数,并将其应用于DIR结果融合中。为了进一步提高文本相似计算的表现,构造了全新的核函数S_Wang核函数。结合文本相似计算过程中的具体实际,将待比对的文本表示成向量,考虑通过两向量间的乘积和欧氏距离来描述向量之间的相似程度,从而构造了适合文本相似度计算的新的核函数。并根据Mercer定理证明了所构造的函数可以作为核函数。实验验证了新造的核函数在文本文档相似度计算中的表现,实验结果表明S_Wang核其相似度计算精度和综合指标均分别优于Cauchy核,潜在语义核(LSK)以及CLA复合核。S_Wang核适用于文本相似度计算。
     (4)分布式信息检索评价方法研究。资源选择和结果融合是DIR研究的两个主要步骤。检索的时间开销、精准率和召回率是IR也是DIR检索的三个主要指标。本文提出一种基于多变量的偏微分方程模型,从拉普拉斯方程出发,提出针对DIR的资源选择和结果融合的时间开销、精准率和召回率三指标的评价方法。实验评价了多种现有的资源选择和结果融合方法,验证了模型的有效性。基于50个主题的TREC实验结果表明该多变量偏微分方程模型在DIR评价方面有很好的表现和实际的应用。
With the rapid growth of the internet, digital libraries and other information source, data items are spreading across all the worldwide with heterogeneous data structure to nodal points. The connections of those nodal points build the distributed information systems. How to quickly present what a user needs from the "information ocean" with lower cost, higher precision and higher recall from the distributed information resources is a challenging issue. Distributed information retieval is a kind of information retrieval which focuses on the distributed heterogeneous inforamtion system. Within the information retrieval community, the problem of retrieving data items from a set of collections/databases (DBs) which are distributed in different servers is referred to as distributed information retrieval (DIR). Collection Selection and Result Merging are two main sub-problems in DIR. The text similarity computation is to compute or compare the similarity between two presented texts, which is a important issue in the fields of linguistics, psychology and information theory. It is also a basic issue in the fields of information retrieval, data mining, knowledge management, artificial teligentence and so on. It's a basic technology in the field of natual language processing, as well as in copy detection, novelty detection, information filtering and so on. It is key issue to how to improve the precision and recall of text similarity computation。This paper focused on how to retrieval the similarity texts in DIR with fast speed, high precison and high recall as possible as we can. The main work of this paper includes:
     (1) We proposed a resource selection method in DIR based on set covering. Resource selection, also called server selection, collection selection or database selection, is a foundational problem in distributed information retrieval (DIR). This paper introduced a set-covering-based algorithm for resource selection in DIR, with consideration of overlapping extent between resources. Give different document with different weight according to its position in merged results for query Q. Only results that have not appeared in some earlier selected resource are focused on in later selected resources. The score of each resource is decided by the total weights of those merged results included in, and only the resource with max score is selected in each selecting step. So, the selecting order is the actual rank of selected resources which are used to search the query Q', which is similar to question Q. The approach saves big searching time due to overlapping between databases and, at the same time, enhances the recall and precision.
     (2) Combined Kernel Function and Application to Result Merging in DIR. Improved latent semantic kernel (LSK) was combined with analysis of variance (ANOVA) kernel to calculate text similarity in this paper. To enhance the performance of result merging for distributed information retrieval (DIR), a new merging method was put forward, which was based on relevance between retrieved results and query. The combined kernel was used to calculate the relevance between the result and query. Experimental results showed that the result merging precision of the combination of LSK and ANOVA kernel (CLA) is16.79%,30.73%,20.37%,24.17%,14.25%,13.50%and7.53%higher than that of Round-robin, ComMNZ, Bayesian, Borda, SDM, MEM and regression SVM respectively. CLA kernel method has better performance for result merging and is a practical method for result merging in DIR.
     (3) New Kernel Function Construction and Application to Result Merging in DIR. To enhance the performance of detecting similar texts, a novel kernel function named S_Wang kernel was constructed. Based on the actual situation of text similarity computation, the S_Wang kernel was newly built with consideration of the Euclidean distance and product between vectors that represented the text documents to be compared. It was proved that the function can be constructed as a kernel function according to Mercer theorem. Experimental verification of the performance of the kernels in the text document similarity calculation was provided. The experimental results show that the S_Wang kernel is significantly better than the precision and F1performance of other kernels like Cauchy kernel, Latent Semantic Kernel (LSK) and CLA kernel. S_Wang kernel is suitable for text similarity detection.
     (4) Evaluation Methods on Distributed Information Retrieval. Collection selection and result merging are two major sub-problems in the field of DIR. Computing cost, retrieval precision and retrieval recall are three main evaluation indexes in DIR. This paper developed a multi-variable quantitative partial differential equation (PDE) model which was inspired by the Laplace equations, linking collection selection method and result merging method with cost, precision and recall indexes. Experiments were then conducted to determine the empirical and practical evaluate performance of the model. Experimental results on50topics of TREC indicate that the multi-variable PDE model of evaluation in DIR has a good performance and is a practical alternative.

引文

[1]宋玲.语义相似度计算及其应用研究,[博士学位论文].济南：山东大学,2009.
    [2]Hall P, Dowling G. Approximates ring matching [J]. Computing Survey,1980,12(4): 381-402.
    [3]Coelho T, Calado P, Souza L et al. Image retrieval using multiple evidence ranking [J]. IEEE Transaction s on Knowledge and Data Engineering,2004,16 (4):408-417.
    [4]Ko Y, Park J, Seo J. Improving text categorization using the import ance of sentences [J]. Information Processing and Management,2004,40(1):65-79.
    [5]Erkan G., Radev D. LexRank:Graph-based Centrality as Salience in Text Summarization [J]. Journal of Artificial Intelligence Research,2004,22(7):457-479.
    [6]Theobald M, Siddharth J, Paepcke A. SpotSigs:Robust and efficient near duplicate detection in large Web collections [C]//Myaeng S H. Proceedings of SIGIR'08:The 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM Press,2008:563-570.
    [7]Lee L W, Chen S M. New methods for text categorization based on a new feature selection method and a new similarity measure between documents [C]//Myaeng S H. Proceedings of SIGIR'08:Thel9th International Conference on Industrial, Engineering and Other Application of Applied Intelligent Systems. New York:ACM Press,2006: 1280-1289.
    [8]宋擒豹,杨向荣,沈钧毅等.数字商品非法复制的检测算法.计算机学报[J],2002,25(11)：1206-1211.
    [9]Salton G, Wong A, Yang C S. A vector space model for information retrieval [J]], Jornal of the American Society for Information Science,1975,18(11):613-620.
    [10]Salton G. The SMART retrieval system experiments in automatic document [C]//Muller H. Processing of SIGIR'71:The 14th annual international ACM SIGIR conference on research and development. New Jersey:Prentice Hall,1971:1-556.
    [11]霍华,冯博琴.基于压缩稀疏矩阵矢量相乘的文本相似度计算[J].小型微型计算机系统,2005,26(6)：988-990.
    [12]Ristad E S, Yianilos P N. Learning string-edit distance [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,1998,20(5):522-531.
    [13]车万翔,刘挺,秦兵,李生.基于改进编辑距离的中文相似句子检索[J].高技术通讯,2004,14(7)：15-19.
    [14]梅家驹,竺一鸣,高蕴琦等.同义词词林[M].上海：上海辞书出版社,1983.
    [15]Alves C E R, Caceres E N, Song S W. An all-substrings common subsequence algorithm [J], Discrete Applied Mathematics,2008,156(7):1025-1035.
    [16]Apostolico A, Guerra C. The longest common subsequence problem revisited [R]. TR-543, Purdue University:Purdue e-Pubs,1985.
    [17]Bergroth L, Hakonen H, Raita T. New approximation algorithms for longest common subsequences [C]//Werner E N. Proceedings of SPIRE'98:String Processing and Information Retrieval:A South American Symposium. Washington, DC, USA:IEEE Computer Society Press,1998:32-40.
    [18]Mihalcea R, Corley C. Corpus-based and knowledge-based measures of text semantic similarity [R]. Menlo Park, California:AAAI,2006:775-780.
    [19]Wang Y, Julia H. Document clustering with semantic analysis [C]//Coady Y. Proceedings HICSS'06:The 39th Hawaii International Conferences on System Sciences. Washington, DC, USA:IEEE Computer Society Press,2006:54-63.
    [20]Hotho A, Staab S, Stumme G. Wordnet improves text documment clustering [C]// Paques H. Proceedings of SIGIR'03:The Semantic Web Work shop at SIGIR 2003, 26th Annual International ACM SIGIR Conference. New York:ACM Press,2003: 541-550.
    [21]Mohler M, Mihalcea R. Text-to-text semantic similarity for automatic short answer grading [C]//Basili R. Proceedings of E ACL'09:The 12th Conference of the European Chapter of the Association for Computational Linguistics on EACL 09. Athens, Greece: Tehnografia Digital Press,2009:567-575.
    [22]Miller G A, Beckwith R, Fellbaum C, Gross D, Miller K. Wordnet:An on-line lexical database [J]. International Journal of Lexicography,1990,3(4):235-244.
    [23]杨震,范科峰,雷建军,郭军.基于语义的文本流形研究[J].电子学报,2009,37(3)：557-561.
    [24]张凯勇.基于WordNet的词语及短文本语义相似度算法研究[D].吉林长春：吉林大学,2011.
    [25]Abdalgader K, Skabar A. Short-text similarity measurement using word sense disambiguation and synonym expansion [C]//Li J. Proceeding of AI 2010:Advances in Artificial Intelligence, Lecture Notes in Computer Science, Berlin:Springer-Verlag Press,2011:435-444.
    [26]Mihalcea R, Corley C. Corpus-based and knowledge-based measures of text semantic similarity [C]//Japkowicz N. Proceeding of AAAI 06:the 21st national conference on Artificial intelligence. Menlo Park, California:AAAI Press,2006:775-780.
    [27]赵军,金千里,徐波.面向文本检索的语义计算[J].计算机学报,2005,28(12)：2068-2078.
    [28]黄承慧,印鉴,侯昉.一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J].计算机学报,2011,24(5)：856-894.
    [29]Deerwester S, Dumais S T, Furnas G W et al. Indexing by Latent Semantic Analysis [J]. Journal of the American Society of Information Science,1990,41(6):391-407.
    [30]Dumais S T, Furnas G. W, Landauer T K et al. Using latent semantic analysis to improve information retrieval [C]//Elliot S. Proceedings of ACM CHI'88:The Conference on Human Factors in Computing. New York:ACM Press,1988:281-285.
    [31]Foltz P W. Using latent semantic indexing for information filtering [C]//Allen B. Proceedings of COIS'90:The Conference on Office Information Systems. New York: ACM Press,1990:40-47.
    [32]Dumais S T, Landauer T K, Littman M L. Automatic cross-linguistic information retrieval using Latent Semantic Indexing [C]//Frei H P. Proceeding of SIGIR'96:the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM Press,1996:16-23.
    [33]Story R E. An explanation of the effectiveness of latent semantic indexing by means of a Bayesian regression mode [J]. Information Processing & Management,1996,32(03): 329-344.
    [34]Golub G H, Loan C. Matrix Computations [M].3rd Ed.Baltimore:John Hopkins University Press,1989.
    [35]Hofmann T. Probabilistic latent semantic indexing [C]//Hearst M. Proceedings of SIGIR'99:22nd International Conference on Research and Development in Information Retrieval. New York:ACM Press,1999:35-44.
    [36]Hofmann T. Probabilistic latent semantic indexing [C]//Hearst M. Proceedings of SIGIR'99:22nd International Conference on Research and Development in Information Retrieval. New York:ACM Press,1999:35-44.
    [37]潘谦红,王炬,史忠植.基于属性论的文本相似度计算[J].计算机学报,1999,22(6)：651-655.
    [38]Youmans G. A new tool for discourse analysis:The vocabulary-management profile [J], Language,1991,67(4):763-789.
    [39]Chen K H, Chen H H. A corpus-based approach to text partition [C]//Menzel W. Proceedings of International Conference of Recent Advances on Natural Language. Berlin:Springer-Verlag Press,1995:152-160.
    [40]Fragkou P, Petridis V, Kehagias A. A dynamic programming algorithm for linear text segmentation [J]. Journal of Intelligent Information Systems.2004,23(2):179-197.
    [41]Malioutov I, Barzilay R. Minimum cut model for spoken lecture segmentation [C]// Calzolari N. Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. New York:ACM Press,2006:25-32.
    [42]彭京,杨冬青,唐世渭等.一种基于语义内积空间模型的文本聚类算法[J].计算机学报,2007,30(8)：1354-1363.
    [43]Callan J. Distributed information retrieval [M]. In:Advances in information retrieval, Norwell, MA, US:Kluwer Academic Publishers,2000:127-150.
    [44]大千.分布式信息检索[J].国家图书馆学刊,2004,(02)：92-92.
    [45]刘炜,张晓林,曾蕾等.分布环境下信息系统的开放描述：信息资源集合开放性描述研究[R].研究报告,2002,7.
    [46]Lyman P, Varian H R. How much information? [R]. http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/printable_report .pdf,10/30/2003.
    [47]Terada M, Kuno H, Hanadate M. Copy prevention scheme for rights trading infrastructure [C]//Domingo-Ferrer J. Proceedings of the Fourth Working Conference on Smart Card Research and Advanced Applications. Norwell, MA, USA:Kluwer Academic Publishers,2000:20-22.
    [48]Brassil J, Low S, Maxemchuk N. Electronic marking and identification techniques to discourage document copying [J]. IEEE Journal on Selected Areas in Communications, 1995,13(8):1495-1504.
    [49]Manber U. Finding similar files in a large file system [C]//Proceedings of WTEC'94: The Winter USENIX Technical Conference. Berkeley, CA, USA:USENIX Association, 1994:1-10.
    [50]Brin S, Davis J, Garcia-Molina H. Copy detection mechanisms for digital documents [C]//Michael J. Proceedings of the ACM SIGMOD Annual Conference. New York: ACM Press,1995:398-409.
    [51]Garcia-Molina H, Gravano L, Shivakumar N. dSCAM:Finding document copies across multiple databases [C]//Proceedings of PDIS'96:The 4th International Conference on Parallel and Distributed Systems. Washington, DC, USA:IEEE Computer Society Press,1996:
    [52]Si A, Leong H V, Lau R W H. CHECK:A document plagiarism detection system [C]// Carroll J. Proceedings of SAC'97:The ACM Symposium for Applied Computing. Washington, DC, USA:IEEE Computer Society Press,1997:70-77.
    [53]Monostori K, Zaslavsky A, Schmidt H. Match detect reveal:Finding overlapping and similar digital documents [C]//Khosrowpour M. Proceedings of IRMA2000. The Information Resources Management Association International Conference.:IDEA Group 2000:
    [54]Stein B, Sven Meyer zu Eissen.2007. Intrinsic plagiarism analysis with metalearning [C]//Proceedings of SIGIR'07:The International Workshop on Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection. New York:ACM Press,2007:
    [55]Hemandez T, Kambhampati S. Improving text collection selection with coverage and overlap statistics [C]. Proceeding of WWW'05,2005:1-10.
    [56]Callan, J., Lu, Z., and Croft, W. B. Searching distributed collections with inference networks [C]//Frei H P. Proceeding of SIGIR'95:The 18th annual international ACM SIGIR conference on Research and development in information retrieval. New York: ACM Press,1995:21-28.
    [57]Gravano L, Garcia-Molina H, Tomasic A. GIOSS:Text-source discovery over the internet [J]. ACM Transactions on Database Systems,1999,24 (2):229-264.
    [58]French J C, Powell A L. Metrics for evaluating database selection techniques [J]. World Wide Web,2000,3 (3):153-163.
    [59]French J C, Powell A L, Callan J et al. Comparing the performance of database selection algorithms [C]//Hearst M. Proceeding of SIGIR'99:The 22nd ACM SIGIR Conference on Information Retrieval. New York:ACM Press,1999:238-245.
    [60]Powell A L, French J C. Comparing the performance of collection selection algorithms [J]. ACM Transactions on Information systems,2003,21(4):412-456.
    [61]French J C, Powell A L, Viles C L et al. Evaluating database selection techniques:a test-bed and experiment [C]//Hearst M. Proceeding of SIGIR'98:21st ACM SIGIR Conference on Information Retrieval. New York:ACM Press,1998:121-129.
    [62]Yuwono B, Lee D L. Server ranking for distributed text retrieval systems on the internet[C]//Tanaka K. Proceedings of the Fifth International Conference on Database Systems for Advanced Applications (DASFAA). Singapore:World Scientific Press, 1997:41-49.
    [63]French J C, Powell A L, Gey F et al. Exploiting manual indexing to improve collection selection and retrieval effectiveness [J]. Information Retrieval,2002,5 (4):323-351.
    [64]Powell A L, French J C. Comparing the performance of collection selection algorithms [J]. ACM Transactions on Information systems,2003,21(4):412-456.
    [65]Wu S, Crestani F. Distributed information retrieval:a multi-objective resource selection approach [J]. Fuzziness and Knowledge-Based Systems.2003,11(1):83-100.
    [66]Fuhr N. A decision-theoretic approach to database selection in networked IR [J]. ACM Transactions on Information Systems [J],1999,17(3):229-249.
    [67]Yang H, Zhang M. Two-stage statistical language models for text database selection [J] Information Retrieval,2006,9(1):5-31.
    [68]何莉,林鸿飞.分布式检索中基于主题的语言模型集合选择策略[J].微电子学与计算机,2009,26(9)：78-81.
    [69]Si L, Jin R, Callan J et al. A language modeling framework for resource selection and results merging [C]//Nicholas C. Proceeding of CIKM'02:The 11th International Conference on Information and knowledge management. New York:ACM Press,2002: 391-397.
    [70]D'Souza D, Thom J, Zobel J, Collection selection for managed distributed document databases[J]. Information Processing and Mangement,2004a,40 (3):527-546.
    [71]Si L, Callan J. Unified utility maximization framework for resource selection [C]// Grossman D. Proceeding of CIKM'04:The 13th ACM international conference on Information and knowledge management. New York:ACM Press,2004:32-41.
    [72]Bender M, Michel S, Triantafillou P et al. Improving collection selection with overlap awareness in P2P search engines [C]//Ricardo A. Proceeding of SIGIR'05:The 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM Press,2005:67-74.
    [73]Hernandez T. Improving text collection selection with coverage and overlap statistics [D]. Arizona:Arizona State University,2004.
    [74]Cover T M. Geometrical and atatistical properties of systems of linear inequalities with applications in pattern recognition [J]. IEEE Transactions on Electronic Computer, 1965,14(3):326-334.
    [75]Mercer J. Functions of positive and negative type, and their connection with the theory of integral equastions [J]. Philosophical Transactions of the Royal Society of London, Series A,1909,209 (441):415-446.
    [76]Nelson D, Schwarts J T. Bade W G et al. Linear operators part 2 spectral theory:self adjoint operators in Hilbert space[M]. Volume 7 of Pure and Applied Mathematics. New York:Wiley-Interscience,1963.
    [77]Wahba G. Support vector machines, reproducing kernel Hilbert spaces, and randomized GACV [J]. Advances in Kernel Methods:Support Vector Learning,1998:69-87.
    [78]Bernhard Scholkopf C, Burges J C, Alexander J S. Advances in dernel methods:support vector learning [M]. Cambridge:The MIT Press,1998.
    [79]Poggio T, Girosi F. Networks for approximation and learning [C]//Trew R J. Proceeding of the IEEE,1990,78(9):1481-1497.
    [80]Wahba G. Spline models for observational data [M]. Volume 59 of CBMS-NSF Regional Conference Series in Applied Mathematics. Philadelphia PA:Society for Industrial and Applied Mathematics (SIAM),1990.
    [81]Aizerman M A, Braverman E M, Rozonoer 11. Theoretical foundations of the potential function method in pattern recognition learning [J]. Automation and Remote Control, 1964,25(6):821-837.
    [82]Berg C, Christensen J P R, Ressel P. Harmonic Analysis on Semigroups:Theory of Positive Definite and Related Functions. Volume 100 of Graduate Texts in Mathematics. Berlin:Springer-Verlag Press,1984:302.
    [83]Boser B E, Guyon I M, Vapnik V N. A training algorithm for optimal margin classifiers [C]//Haussler D. Proceeding of COLT'92:The 5th Annual Workshop on Computational Learning Theory. New York:ACM Press,1992:144-152.
    [84]Schoenberg I J. Metric spaces and positive definite functions [J]. Transactions of the American Mathematical Society,1938,44(3):522-536.
    [85]Kolmogorov A N. Stationary sequences in Hilbert spaces [J]. Moscow University Mathematics Bulletin,1941,2(6):1-40. In Russian
    [86]Aronszajn N. Theory of reproducing kernels [J]. Transactions of the American Mathematical Society,1950,68(3):337-404.
    [87]Herbrich R. Learning kernel classiers:theory and algorithms [M]. Adaptive Computationand Machine Learning. Cambridge:The MIT Press,2001.
    [88]Saitoh S. Theory of reproducing kernels and its applications pitman research notes in mathematics [M]. Harlow:Longman Scientific & Technical,1998.
    [89]Haussler D. Convolution kernels on discrete structures [R]. Technical Report UCSC-CRL-99-10, SantaCruz:Universityof California,1999.
    [90]Watkins C. Dynamic Alignment Kernels [R].Technical Report CSD-TR-98-11, Egham: University of London Press,1999b.
    [91]Poggio T..On optimal nonlinear associative recall [J]. Biological Cybernetics,1975, 19(4):201-209.
    [92]任双桥,魏玺章,黎湘,庄钊文.基于特征可分性的核函数自适应构造[J].计算机学报,2008,(5)：803-809.
    [93]吴涛,贺汉根,贺明科.基于插值的核函数构造[J].计算机学报,2003,26(8)：990-996.
    [94]柳桂国,柳贺,黄道.模式分析的核函数设计方法及应用[J].华东理工大学学报(自然科学版),2007,(3)：405-409.
    [95]凯,王颖龙.支持向量机中Mercer核函数的构造研究[J].兵工自动化,2008,(11)：40-42.
    [96]Cristianini N, Campbell C, Shawe-Taylor J. Dynamically adapting kernels in support vector machines [C]//Kearns M S. Proceeding of Advances in Neural Information Processing Systems 11.Cambridge, MA:The MIT Press,1999:204-210.
    [97]Dristianini N, Shawe-Taylor J, Lodhi H. Latent semantic kernels[J]. Journal of Intelligent Information Systems,2002,18(2-3):127-152.
    [98]Joachims T. Text Categorization with Support Vector Machines:Learning with many Relevant Features [C]//Nedellec C. Proceeding of ECML'98:The 10th European Conferenceon Machine Learning, Lecture Notes in Computer Science. Berlin: Springer-Verlag,1998:137-142.
    [99]Leslie C, Eskin E, Weston J et al. The spectrum kernel:a string kernel for SVM protein classication//Kearns M S. Proceedings of the pacific symposium on biocomputing. Singapore:World Scientic Press,2002:564-575.
    [100]Watkins C. Kernels from Matching Operations [R]. Technical Report CSD-TR-98-07, Egham:University of London Press,1999a.
    [101]Bao J, Shen J, Liu X et al. Document copy detection based on kernel method[C] //Jaime G. Proceeding of NLPKE'03:the International Conference on natural Language Processing and Knowledge Engineering. New York:IEEE Press,2003:250-255.
    [102]Shawe-Taylor J, Cristianini N. Kernel methods for pattern analysis [M]. Lodon: Cambridge University Press,2004:9-10.
    [103]Cortes C, Haffner P, Mohri M. Rational kernels:theory and algorithms [J]. Journal of Machine Learning research,2004,4(5):1035-1062.
    [104]Sahami M, Heilman T D. A web-based kernel function for measuring the similarity of short text snippets [C]//Mizoguchi R. Proceeding of WWW'06:The 15th International Conference on World Wide Web. New York:ACM Press,2006:377-386.
    [105]Aseervatham S. A local latent semantic analysis-based kernel for document similarities [C]//Koppen M. Proceeding of IJCNN'08:The International Joint Conference on Neural Neworks. New York:IEEE Press,2008:214-219.
    [106]Lodhi H. Text Classication using String Kernels [J]. Journal of Machine Learning Research,2002,3(2):419-444.
    [107]Leslie C, Eskin E, Weston J et al. The spectrum kernel:a string kernel for SVM protein classication//Kearns M S. Proceedings of the pacific symposium on biocomputing. Singapore:World Scientic Press,2002:564-575.
    [108]VERT,Jean-Philippe,2002a. A Tree Kernel to Analyse Phylogenetic Proles. Bioinformatics,18(Supplementl), S276-S284
    [109]GUSFIELD,Dan,1997. Algorithms on Strings, Trees, and Sequences:Computer Science and Computational Biology. Cambridge:CambridgeUniversity Press. DURBIN,R., etal.,1999. Biological Sequence Analysis:Probabilistic Models of Proteins and NucleicAcids.Cambridge:Cambridge University Press.
    [110]王君,李舟军,胡侠,胡必云.一种新的复合核函数及在问句检索中的应用[J].电子与信息学报,2011,33(1)：129-135.
    [111]杨建武.基于核方法的XML文档自动分类[J].计算机学报,2011,34(2)：353-359.
    [112]Takimoto E, Warmuth M K. Path kernels and multiplicative updates [C]//Kivinen J. Proceeding of COLT'02:15th Annual Conference on Computational Learning Theory. Lecture Notes in Computer Science. Berlin:Springer-Verlag,2002:74-89.
    [113]Burges C J C, Vapnik V. A new method for constructing articial neural networks [R]. No.:N00014-94-C-0186, Holmdel, N J:AT&T Bell Laboratories,1995.
    [114]V Vapnik V. Statistical Learning Theory [M]. New York:Wiley-Interscience,1998.
    [115]Scholkopf B, Burges C J C, Smola A J. Support vector regression with ANOVA decomposition kernels. Advances in Kernel Methods:Surport Vector Learning. Cambridge:The MIT Press,Chapter17,1998:285-291.
    [116]Hofmann T, Scholkopf B, Smola A J. Kernel methods in machine learning [J]. The Annals of Statistics,2008,36(3):1171-1220..
    [117]李文波,孙乐,诺明花等.基于核方法的敏感信息过滤的研究[J].通信学报,2008,29(4)：57-63.
    [118]Fox E A, Shaw J. Combination of multiple searches[C]//Harman D. Proceedings of the Second Text Retrieval Conference (TREC-2). Gaithersburg:National Institute of Standards and Technology.1994:243-252.
    [119]Callan J. Advances in Information Retrieval[M]. Berlin:Springer-Verlag Press,2002.
    [120]Montague M, Aslam J A. Condorcet fusion for improved re trieval[C]//Nicholas C. Proceedings of CIKM'02:The 11th International Conference on Information and Knowledge Management, (CIKM)]. New York:ACM Press,2002:538-548.
    [121]Aslam J A, Montague M. Models for metasearch[C]//Paques H. Proceedings of the 24th Annual International ACM SIGIR Conference. Gaithersburg, New York:ACM Press,2003:24-37.
    [122]Rasolofo Y, Hawking D, Savoy J. Result merging strategies for a current news metasearcher[J]. Information Processing and Man agement,2003,39(4):581-609.
    [123]Wu S, Crestani F. Distributed information retrieval:a multi-objective resource selection approach[J]. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems,2003,11(Suppl):83-99.
    [124]Wu S, Crestani F. Shadow document methods of result merging[C]//Proceedings of ACM SAC'04:The 19th ACM Symposium on Applied Computing. New York:ACM Press,2004:1067-1072.
    [125]Wu S, Mcclean S. Result merging methods in distributed information retrieval with overlapping databases [J]. Information Retrieval,2007,10(3):297-319.
    [126]Kien-Tsoi T E, Tjin-JIN-KAM-JET. Result Merging for Efficient Distributed Information Retrieval[D]. University of Twente,2009.
    [127]De A, Diaz E. Hybrid fuzzy result merging for metasearch using analytic hierarchy process[C]//Bukovsky I. Proceeding of NAFIPS'09:Annual Meeting of the North American Fuzzy Information Processing Society. New York:IEEE Press,2009:265-270.
    [128]Groppe S, Groppe J, Muller D. Result merging technique for answering XPath query over XSLT transformed data[J]. IEEE Transactions on Knowledge and Data Engineering,2009,21(9):1328-1342.
    [129]Shokouhi M, Zobel J. Robust result merging using sample-based score estimates[J], ACM Transactions on Information Systems,2009,27(3):14:1-14:29.
    [130]Chowdhury G G. Introduction to Modern Information Retrieval. New York: Neal-Schuman Publishers,2010.
    [131]Text Retrieval Conference (TREC). http://trec.nist.gov/. Jan.4th,2012.
    [132]Jung J J. Consensus-based evaluation framework for distributed information retrieval systems [J]. Knowledge and Information Systems.2009,18(2):199-211.
    [133]Witschel H F, Holz F, Heinrich G, Teresniak S. An evaluation measure for distributed information retrieval systems [C]//Macdonald C. Proceedings of ECIR'08:The 30th European conference on information retrieval, Lecture Notes In Computer Science. Berlin:Springer-Verlag,2008:607-611.
    [134]Chan T F, Shen J. Variational PDE models in image processing [J]. Notices of the AMS, 2003,50(1):14-26.
    [135]Losada DE, Azzopardi L. Assessing multi-variate bernoulli models for information retrieval [J]. ACM Transactions on Information Systems,2008,26(3):1-46.
    [136]Jung J J. Consensus-based evaluation framework for distributed information retrieval systems [J]. Knowledge and Information Systems,2009,18(2):199-211.
    [137]The Lemur Project. http://www.lemurproject.org/
    [138]Wang X, Ju S. A set-covering-based approach for overlapping resource selection in distributed information retrieval [C]//Proceeding of the 2009 WRI World Congress on Computer Science and Information Engineering, Computer Science and Information Engineering. Washington, D C, USA:IEEE Computer Society Press,2009:272-276.
    [139]王秀红,鞠时光.基于混合核函数的分布式信息检索结果融合[J].通信学报,2011,32(4)：112-118,125.
    [140]王秀红.学术论文复制检测的研究进展及新方法[J].图书情报工作,2009,53(5)：111-114.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700