基于字数差别因子的中文文本相似度研究

英文题名：Study of Chinese Text Similarity Based on Number Difference Gene
作者：陈永超
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：文本相似度 ; 中文分词 ; 字数差别因子
英文关键词：context similarity ; Chinese cutting word ; number difference gene
学位年度：2011
导师：钮焱
学科代码：081203
学位授予单位：湖北工业大学
论文提交日期：2011-05-01

摘要

文本相似度计算在中文信息处理的应用中属于基础性的工作,一个优质的文本相似度计算方法,必须要达到准确和高效,即能够从文本的自然语言含义的层面进行比较,在充分理解作者或者文本出处语义的基础上,得出近似人工阅读时的相似度区分,同时,能够有一个高效的算法,在面对大量文本信息处理时,能够节约处理时间。
     微信息的传播是信息技术发展的新特征,结合微信息的特点,为解决长语料对短语料的文字覆盖性问题所造成的语义偏差问题,本文提出了一种基于字数差别的中文文本相似度算法,通过对国内外众多相关文献的研究,对相似度计算当前的情况做了深一步的分析和研究后,提出了提高相似度性能的新方法——将传统基于统计和狭义采用语义的方式相结合,用统计的高效和语义的准确相结合。将统计类和语义类的优点相结合,必须要面对克服两类办法的缺点。本文的尝试,以字数差别为切入点,以中文词语的字数多样性,将词语的词频和字数结合词语的语义,将基于知网的词汇相似度计算,成功拓展到文本间的相似度计算。
     最后,采用自建的小型文本集作为测试对象,在实验室环境下进行不同方法的相似度计算对比,说明基于字数差别的相似度方法,性能优于传统基于统计和语义的方法。通过人工对本课题的研究成果进行准确度和分词速度的测试上的比对。为中文文本相似度计算提供了新的思路。
Text similarity calculation in the use of Chinese information handling belongs to the fundamental work, a high-quality text similarity calculation method must acquire accuracy and efficiency, that is to say, it should be compared from the aspect of context’s natural language meaning, on the base of fully understanding for author or context source semantic, then get the similarity distinction of similar artificial reading. At the same time, it has an efficient calculation method to save time when face a large mount of in formations.
     The micro information's dissemination is the information technology development new characteristic, unifies the micro information the characteristic, to solve the long language materials the semantic deviation question which creates to the short language materials' writing spreadability question, this paper presents the Chinese context similarity calculation which based on the number difference. According to many related literatures of domestic and foreign researches, and after making a further analysis and research for the current condition of the similarity calculation, it puts forward a new method of improving the similarity function--- combining the way of traditional statistic and narrow semantic usage together, combing the statistic efficiency and semantic accuracy together, combining the advantage of statistic and semantic together. If necessary, it must encounter the disadvantage of overcoming the two methods. This article attempts to explore the inner context’s similarity calculation which start with the number difference, and the number diversity of Chinese words, the word frequency and the semantic of combination for word and number, and it also bases on the words similarity calculation of network.
     Finally, it adopts the small self-built text as the test object, and compares the similarity calculation of different method in the laboratory environment, indicating that the similarity methods based on words difference, its performance is better than traditional methods based on statistical and semantic. It provides a new way of thinking for the Chinese context similarity calculation through comparing the accuracy and the cutting word speed’s text of the topic’s research result.

引文

[1] Nirenburg S, Domashnev C, Grannes DJ.Two approaches to matching in example-based machine translation. In: Proceedings of TMI-93,Kyoto,Japan,1993,7:47-57.
    [2] E Spertus ParaSite;Mining structural information on the web.In:proceeding of The Sixth Intemational World Wide web Conference.1997.
    [3] Salton G and Mcgill.M, Introduntion to Modern Information Retrieval. New York: McGraw-Hill,1983
    [4] Salton G and Chris B. Term Weighting Approaches in Automatic Text. Retrieval Information Processing and Management,1988,24(5):513-523.
    [5]潘谦红,王炬,史忠植.基于属性论的文本相似度计算.计算机学报,1999,22(6):651-655.
    [6]张焕炯,王国胜,钟义信.基于汉明距离的文本相似度计算.计算机工程与应用,2001,37(19):21-22.
    [7]晋耀红.基于语境框架的文本相似度计算计算机工程与应用,2004,40(16):36-39.
    [8] Lambros C, Harris P, Stelios P.A Matching Technique in Example-based Machin Translation. In: Proceedings of COLING94,1994.
    [9] William Taylor, James Z. Wang. Concept Forest: A New Ontology-assisted Text Document Similarity Measurement Method. In: 2007 ACM International Conference on Web Intelligence.
    [10]刘群,李素建.基于《知网》的词汇语义相似度计算[A].第三届汉语词汇语义学研讨会论文集[C]. 2002. 59-76
    [11]车万翔,刘挺,秦兵,李生.面向双语句对检索的汉语句子相似度计算.全国第七届计算语言学联合学术会议论文集[C], 2003.8.
    [12]金博,史彦军,滕弘飞.基于语义理解的文本相似度算法.大连理工学报[J].2005.vol45,No2.291-297.
    [13]郭庆琳,李艳梅,唐琦.基于VSM的文本相似度计算的研究.计算机应用研究[J].2008.vol25,No2.3256-3258.
    [14] Salton G. Automatic text Processing: the transformation analysis and retrieval of information by computer[M].Boston:Addison-Wesley,1989.
    [15] S.T.Dumais, G.W.Furnas, T.K.Landauer, S.Deerwester, R.Harshman. Using latent semantic analysis to improve access to textual information [C].Pro of CHI88.NewYork: ACM, 1988:281-285.
    [16] Chris H. Q. Ding,Xiaofeng He,Hongyuan Zha,Ming Gu,Horst D. Simon.A Min-max Cut Algorithm for Graph Partitioning and Data Clustering.2001 IEEE International Conference on Data Mining[C].2001.107-114.
    [17] Kolda T G, Leary O. Large latent semantic indexing via a semi-discrete matrix decomposition, UMCP-CSD CS-TR-3713[R]. Maryland: University of Maryland,1996.
    [18] Furnas G W, Deerw ester S, Dumais S T, et al. Information retrieval using singular[C]. Proc of SIGIR88.New York:ACM,1998:465-480.
    [19]史忠植.高级人工智能[M].北京:科学出版社,1997.
    [20]冯嘉礼.基于属性抽取和整合的感觉神经检测模型.计算机研究与发展[J],1997,34(7):481-486.
    [21]黄曾阳.HNC(概念层次网络)理论[M].北京:清华大学出版社,1998.
    [22]郭武斌,周宽久,苏振魁.基于词序方法的文本相似度计算模型.情报学报[J]. 2008.vol27,No6.857-862.
    [23]王晓东.计算机算法设计与分析(第2版)[M].北京:电子工业出版社,2004.
    [24]余刚,裴仰军,朱征宇,陈华月.基于词汇语义计算的文本相似度研究.计算机工程与设计[J]. 2006.vol27,No2.241-244.
    [25]董振东,董强.知网简介[M].1999. http://www.keenage.com.
    [26]苏振魁.基于马尔科夫模型的文本相似度研究[D].大连理工大学.2007.
    [27]王梓坤.生灭过程与马尔科夫链[M].科学出版社,1980,1.
    [28] Zhang Chunxia,Hao Tianyong.The State of the Art and Difficulties in Automatic Chinese Word Segmentation[J].Journal of System Simulation,2005(10).
    [29] Peng, Fuchun, Fangfang Feng,Andrew McCallum.Chinese segmentation and new word detection using conditional random fields[C].COLING 2004,2004:562-568.
    [30]邢福义.汉语语法学[M].东北师范大学出版社.1996.
    [31]李腊腊.基于特殊标志符的中文分词算法研究[M].湖北工业大学.2010.
    [32] Dumais,Susan.Improving the retrieval of information from external sources. Behavior Research Methods,Instruments&Computers[J],23(2):229-236
    [33] Nicola Guarino. Formal ontology and Information Systems[M].1998.
    [34] Dong, Hongni; Zhao, Xiaohui; Wu, Jiang; Li, Yanfen。Study on the calculation of text similarity based on key-sentence[J]. Proceedings of the International Conference on E-Business and E-Government, ICEE 2010, p 1952-1955.
    [35]Chen Yao-Tsung; Chen MengChang. Using chi-square statistics to measure similarities for text categorization[J]. Expert Systems with Applications, v38, n 4, p 3085-3090, April 2011.
    [36]Dijkman, Remco; Dumas, Marlon; Van Dongen, Boudewijn; Krik, Reina; Mendling, Jan. Similarity of business process models: Metrics and evaluation[J]. Information Systems, v 36, n 2, p 498-516, April 2011.
    [38]Petrik, Stefan; Drexel, Christina; Fessler, Leo.Semantic and phonetic automatic reconstruction of medical dictations[J].Computer Speech and Language, v 25, n 2, p 363-385, April 2011.
    [39]Stan, Adriana; Yamagishi, Junichi; King, Simon. The Romanian speech synthesis (RSS) corpus: Building a high quality HMM-based speech synthesis system using a high sampling rate[J]. Speech Communication, v 53, n 3, p 442-450, March 2011
    [40]Ríos Gaona, Miguel Angel; Gelbukh, Alexander; Bandyopadhyay, Sivaji.Recognizing textual entailment using a machine learning approach[J]. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), v 6438 LNAI, n PART 2, p 177-185, 2010
    [41]Li, Yanping; Zhang, Linghua; Ding, Hui.Text-independent voice conversion based on kernel eigenvoice[J].Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), v 6319 LNAI, n PART 1, p 432-439, 2010
    [42]Aldeeb, A.; Pearce, D.M.; Crockett, K.. Sentence similarity measures to support workflow exception handling[J].ICEIS 2010 - 12th International Conference on Enterprise Information Systems, v 2 AIDSS, p 256-263, 2010.
    [43]Phuvipadawat, Swit; Murata, Tsuyoshi. Breaking news detection and tracking in Twitter [J]. 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Workshops, WI-IAT 2010, p 120-123, 2010
    [44]Ji, Cansheng; Zhou, Jingyu. A study on recommendation features for an RSS reader[J].2010 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, CyberC 2010, p 193-198, 2010
    [45]Launius, Roger D. Developing a spatial history of spaceflight: The smithsonian atlas of space exploration[J]. 48th AIAA Aerospace Sciences Meeting Including the New Horizons Forum and Aerospace Exposition, 2010, 48th AIAA Aerospace Sciences Meeting Including the New Horizons Forum and Aerospace Exposition
    [46]Mani, Senthil; Sinha, Vibha Singhal; Dhoolia, Pankaj. Automated support for repairing input-model faults[J]. ASE'10 - Proceedings of the IEEE/ACMInternational Conference on Automated Software Engineering, p 195-204, 2010
    [47]Wang, Xiaoyin; Lo, David; Cheng, Jiefeng. Matching dependence-related queries in the system dependence graph[J].ASE'10 - Proceedings of the IEEE/ACM International Conference on Automated Software Engineering, p 457-466, 2010.
    [48]Dong, Hongni; Zhao, Xiaohui; Wu, Jiang. Study on the calculation of text similarity based on key-sentence[J]. Proceedings of the International Conference on E-Business and E-Government, ICEE 2010, p 1952-1955, 2010.
    [49]Qiu, Mangxian; Ye, Lihua; Zhu, Rong. A retrieval model for cross-media objects based on semantic consistency[J].ICCASM 2010 - 2010 International Conference on Computer Application and System Modeling, Proceedings, v 8, p V8467-V8470, 2010
    [50]Abdalgader, Khaled; Skabar, Andrew. Short-text similarity measurement using word sense disambiguation and synonym expansion[J].Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), v 6464 LNAI, p 435-444, 2010.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700