中文词汇知识获取算法和语义计算研究及应用

英文题名：The Research of Knowledge Acquisition Algorithm and Emantics Computation for Chinese Vocabulary and It's Applications
作者：刘兴林
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：词汇知识 ; 合成词 ; 主题词 ; 语义计算 ; 文本相似度
英文关键词：Vocabulary Knowledge ; Compound-word ; Thematic Term ; Semantic
英文关键词：Computation ; Text Similarity
学位年度：2012
导师：郑启伦
学科代码：081203
学位授予单位：华南理工大学
论文提交日期：2012-04-01
答辩委员会主席：印鉴

摘要

互联网的飞速发展使其成为全球信息传播和共享的最重要资源，其数据成几何级数增长，然而要从互联网上获取有用的知识却非常困难，“数据爆炸，知识贫乏”已成为当前诸多专家学者需要迫切解决的问题。
     目前知识获取的大多数研究都是从单纯的计算机技术角度出发，采取诸如规则、句式等从语法逻辑结构层面来挖掘、提取知识，然而新概念的不断涌现，导致许多新词汇被创造出来。这些新词汇由多个语素或多个词组成，当前的分词系统，在收录这些词之前，会将它们切分成多个语素或词，而导致当前已有的知识获取方法无法正确识别，更难于在语义层面上进行比较。这将给知识获取带来新的难题，也使得当前以信息检索为主要技术的搜索引擎在处理网页时采取了“非语义”的关键词匹配的方式，以致于内容查找准确率低，语义计算的引入将有望改善这种状况。
     本文的主要研究工作有两部分：中文词汇知识获取算法和中文词汇语义计算方法。本文基于分词系统之上，进行合成词的识别，解决未登录词无法正确识别的问题；为合成词建立词性标注模型，对合成词进行词性标注，消除词性歧义，解决当前词性标注模型无法直接应用于合成词的词性标注的问题，同时修正分词结果。在实现合成词识别的基础上进行文本主题词的提取，建立词汇语义计算模型，使词与词之间可比较，用语义计算代替传统的关键词匹配，是实现智能信息检索的一个根本途径；同时也是构建词汇语义知识库、实现知识推理的一个关键基础性研究工作，具有重要的研究意义。
     本文最后实现了一个中文词汇知识获取和语义计算平台，通过应用上述算法，建立了一个包含中文词汇知识获取以及中文词汇语义计算的综合系统，验证了本文各项研究工作的意义和算法的有效性。
     本文的创新性工作主要有以下几点：
     1、针对当前未登录词识别的难点问题，提出了基于词性探测和词共现有向图的合成词识别算法CWRWCDG，该算法先采用词性探测从文本中获取词串，进而由获取到的词串生成词共现有向图，借鉴Bellman-Ford算法思想，从词共现有向图中搜索多源点长度最长且权重值满足给定条件的路径，则该路径所对应的词串为合成词。实验结果表明该算法要优于同类算法。
     2、中文合成词标注的难点在于词性的确定，针对该问题，提出了基于核心属性渗透理论的中文合成词词性标注算法，核心属性渗透理论最早由Lieber于1980年提出，他认为在英语中合成词的词性由合成词的核心成分决定，本文将该理论应用于中文合成词词性的标注，并根据实际情况需要提供显式标注和隐式标注两种方式。
     3、当前文本主题词提取算法主要从词频角度出发，基于TF/IDF值，然而对于词语分布较均衡的文本效果不理想，针对这种情况，提出了基于词位置权重和增量词集频率的主题词提取算法TTEITS。该算法认为同一个词在文本的不同位置出现，对该词是否成为主题词的影响是不一样的，同时，在确定一个候选主题词是否真正成为主题词时，不但计算该单个词的权重（频率），而且计算它对整个主题词集的增量权重（频率），若该增量大于某个给定的阈值，则判定该词为主题词，否则算法结束。该算法的优点在于当各候选主题词出现次数都比较低、较平均时，仍然能够提取出最合适的主题词。
     4、研究主题词集在自动文摘上的应用，提出了基于主题词集的中文自动文摘算法CASTTS。该算法先通过TTEITS算法提取文本主题词，再由主题词权重进行加权计算各主题词所在的句子权重，从而得出主题词集对应的每个句子的总权重，最后根据自动文摘比例选取句子权重较大的几个句子并按原文顺序输出文摘。实验结果表明，该方法所获得的文摘质量高，较接近于参考文摘，取得了良好的效果。
     5、针对现有词汇语义计算及文本相似度计算中存在的一些不足，基于知网，巧妙的将文本相似度计算转换为计算文本主题词集相似度，提出了基于主题词集的文本相似度计算方法TSCTTS。该方法先通过TTEITS算法提取文本主题词，然后在知网义原层次体系结中构获取两个词语的语义距离，经转换公式得到两个词语的语义相似度，最后由主题词集的语义相似度得到文本相似度。该算法应用于文本分类实验，结果表明该算法有较好的分类性能。
The Web has become the most important resource for information dissemination andsharing due to its rapid development. However, with the exponential data growth, it is noteasy to find useful knowledge on the Web.“Full of data, but lack of knowledge” has becomea most urgent problem to many researchers.
     Most research on Knowledge Acquisition is solely based on computer technology, suchas extracting knowledge on the level of grammar logic using rules or sentence-mode. But theoccurrence of new concepts creates many new vocabularies, which consist of several words ormorphemes. The existing word segmentation systems often split them into several singlewords or morphemes before collecting them. As a result, the existing knowledge acquisitionmethods can’t recognize them correctly, let alone semantic comparison. This will bring newproblems to Knowledge Acquisition, and also forces the search engine that using informationretrieval as a main technique in dealing with web pages to take “non-semantic” but keywordmatching manner, so that precision of funded content is lower; the application of semanticcomputation is expected to improve the situation.
     This paper mainly focuses on research and application of vocabulary knowledgeacquisition and vocabulary semantic computing. In particular, to solve the problem ofout-of-vocabulary recognition, it tries to recognize compound-words based on wordsegmentation system. Moreover, it builds a part-of-speech tagging model forcompound-words to eliminate lexicon ambiguity, which can not only solve the problem thatthe existing part-of-speech tagging model can’t directly apply for tagging compound-words,but also correct the word segmentation results. Based on the compound-words recognition, itextracts thematic words from text and builds a vocabulary semantic computing model, so thatwords can compare with each other. Replacing the traditional keyword matching approachwith semantic computing approach is fundamental for Intelligent Information Retrieval,building a vocabulary semantic knowledge base and knowledge reasoning.
     Finally, a platform for vocabulary knowledge acquisition and semantics calculation isimplemented. Based on the above proposed algorithms, it builds an integrated system containing vocabulary knowledge acquisition, vocabulary semantics calculation and avocabulary semantics knowledge base, and validates the meaning and effectiveness of theproposed algorithms.
     The main contributions of this paper include:
     1. A Chinese compound-word recognition algorithm CWRWCDG based onpart-of-speech detecting and word co-occurrence directed graph is proposed in this paper forsolving the out-of-vocabulary recognition problem. The algorithm firstly extracts wordsequence from a text using part-of-speech detecting, and then generates word co-occurrencedirected graph with these sequences. After that, inspired by the Bellman-Ford algorithm, itfinds the longest paths whose weight value satisfies the given condition for multiple startingpoints in the word co-occurrence directed graph, the word strings corresponding to the pathsare considered as compound-words. Experiment results show that the proposed algorithmoutperforms existing algorithms.
     2. The key problem in labeling Chinese compound-word is part-of-speech identification.To solve this problem, a part-of-speech tagging of Chinese compound-word algorithm basedon head-feature percolation theory is proposed in this paper. Lieber firstly introduced thetheory in1980, and he figured that the lexicon of compound-word is decided by keyattributions. This paper applies the theory on part-of-speech tagging for Chinesecompound-word, and provides two tagging methods: explicit and implicit.
     3. The existing thematic term extracting algorithms are often based on word frequency,such as TF/IDF value, and don’t really work on text with balance word distribution. To solvethe problem, a thematic term extraction algorithm TTEITS based on word position weight andincremental term set frequency is proposed in this paper. The algorithm considers thatdifferent positions of a word in a document suggest different importance of the term.Moreover, when distinguishing a thematic term, it not only calculates the weight of the singleword, but also calculates the incremental weight in the term set. As a result, the algorithm stillcan extract the most suitable thematic terms even when the candidate thematic terms haverelatively small or average frequency of occurrence.
     4. Based on the work of thematic term extraction, an automatic summarizationalgorithm CASTTS on Chinese texts based on thematic term set is proposed in this paper. The algorithm firstly utilizes the TTEITS algorithm to extract thematic terms, and then calculatesthe weights of the sentences which contain thematic terms to get the total weight of eachsentences corresponding to the thematic term set. Finally it selects a certain number ofsentences with the largest weight to form the summarization. Experiment results show that thealgorithm can generate high quality summarization, is very close to the original referencesummarization.
     5. A text similarity calculation method TSCTTS based on thematic term set is proposedin this paper, which transforms text similarity calculation into thematic term set similaritycalculation using HowNet. The algorithm firstly extracts thematic terms using TTEITSalgorithm, and then calculates the semantic distance between two words at the primitive levelstructure of HowNet. After that, it calculates the text similarity based on the semanticsimilarity between thematic terms. The algorithm was applied for text classification, andexperiment results prove its effectiveness.

引文

[1]张玉峰,胡凤,董坚峰.泛在知识环境中数据挖掘技术进展分析[J].情报学报,2010,29(2):202-207.
    [2]黄昌宁,赵海.中文分词十年回顾[J].中文信息学报,2007,21(3):8-19.
    [3]赵军,金千里,徐波.面向文本检索的语义计算[J].计算机学报,2005,28(12):2068-2078.
    [4]化柏林.国内外知识抽取研究进展综述[J].情报杂志.2008,(2):60-62.
    [5]李晓光,于戈,王大玲,等.基于信息论的潜在概念获取与文本聚类[J].软件学报,2008,19(9):2276-2284.
    [6]韦小丽,孙涌,张书奎,等.基于最大熵模型的本体概念获取方法[J].计算机工程,2009,35(24):114-116,120.
    [7]邱桃荣,刘清,黄厚宽.多值信息系统中基于粒计算的多级概念获取算法[J].模式识别与人工智能,2009,22(1):22-27.
    [8]余蕾,曹存根.基于Web语料的概念获取系统的研究与实现[J].计算机科学,2007,34(2):161-165.
    [9] Zhu Yao, Zang Liang-jun, Cao Ya-nan, et al. A Manual Experiment on CommonsenseKnowledge Acquisition from Web Corpora[A]. Proceedings of the7thInternationalConference on Machine Learning and Cybernetics[C]. Kunming, China:IEEE,2008:1564-1569.
    [10] Zhao Shi-qi, Wang Hai-feng, Liu Ting, et al. Pivot Approach for Extracting ParaphrasePatterns from Bilingual Corpora[A]. Proceedings of ACL-08: HLT[C]. Columbus,Ohio, USA:[s.n.],2008:780–788.
    [11] Turney P. D.. Coherent Keyphrase Extraction via Web Mining[A]. Proceedings of the18thInternational Joint Conference on Artificial Intelligence[C]. Acapulco, Mexico:NRC Publications Archive,2002:434-439.
    [12]刘菲，黄萱菁，吴立德.利用关联规则挖掘文本主题词的方法[J].计算机工程,2008,34(7):81-83.
    [13]耿焕同,蔡庆生,于琨,等.一种基于词共现图的文档主题词自动抽取方法[J].南京大学学报(自然科学版),2006,42(2):156-162.
    [14] Matsuo Y., Ishizuka M.. Keyword Extraction from a Single Document using WordCo-occurrence Statistical Information[J]. International Journal on ArtificialIntelligence Tools,2004,13(1):157-169.
    [15]赵鹏,蔡庆生,王清毅,等.一种基于复杂网络特征的中文文档关键词抽取算法[J].模式识别与人工智能,2007,20(6):817-831.
    [16] Zhang Kuo, Xu Hui, Tang Jie, et al. Keyword Extraction Using Support VectorMachine[J]. Lecture Notes in Computer Science,2006, Volume4016/2006:85-96.
    [17] Ercan G.., Cicekli I.. Using lexical chains for keyword extraction[J]. InformationProcessing&Management,2007,43(6):1705-1714.
    [18]用蕾,朱巧明.基于统计和规则的未登录词识别方法研究[J].计算机工程,2007,33(8):196-198.
    [19]于娟,党延忠.结合词性分析与串频统计的词语提取方法[J].系统工程理论与实践,2010,30(1):105-111.
    [20]李荣,郑家恒,郭梅英.基于遗传算法的隐马尔可夫模型在名词短语识别中的应用研究[J].计算机科学,2009,36(10):244-246.
    [21] Erik F., Sang T. K.. Noun Phrase Recognition by System Combination [A].Proceedings of the1stNorth American chapter of the Association for ComputationalLinguistics conference[C]. San Francisco, CA, USA: Morgan Kaufmann PublishersInc,2000:50-55.
    [22] Araujo L., Serrano J. I.. Highly accurate error-driven method for noun phrasedetection[J]. Pattern Recognition Letters,2008,29(4):547-557.
    [23] Wu Yu-Chieh, Lee Yue-Shi, Yang Jie-Chi. Robust and Efficient Multiclass SVMModels for Phrase Pattern Recognition[J]. Pattern Recognition,2008,41(9):2874-2889.
    [24]陈建超,郑启伦,李庆阳,等.基于词序列频率有向网的中文组合词提取算法[J].计算机应用研究,2009,26(10):3746-3749.
    [25]孙镇,王惠临.命名实体识别研究进展综述[J].现代情报技术,2010,(6):42-47.
    [26] Rau L. F.. Extracting Company Names from Text [A]. Proceedings of the7th IEEEConference on Artificial Intelligence Applications[C]. USA:IEEE,1991:29-32.
    [27] Mikheev A., Moens M., Grover C.. Named Entity Recognition without Gazetteers[A].Proceedings of the ninth conference on European chapter of the Association forComputational Linguistics[C]. Stroudsburg, PA, USA: Association for ComputationalLinguistics,1999:1-8.
    [28] Erik F., Sang T. K., Meulder F. D.. Introduction to the CoNLL-2003Shared Task:Language-independent Named Entity Recognition[A]. Proceedings of the seventhconference on Natural language learning at HLT-NAACL2003[C]. Stroudsburg, PA,USA: Association for Computational Linguistics,2003:142-147.
    [29]张晓艳,王挺,陈火旺.命名实体识别研究[J].计算机科学,2005,32(4):44-48.
    [30]俞鸿魁,张华平,刘群,等.基于层叠隐马尔可夫模型的中文命名实体识别[J].通信学报,2006,27(2):87-93.
    [31] Kronlund A., Bernstein D. M.. Unscrambling Words Increases Brand NameRecognition and Preference[J]. Applied Cognitive Psychology,2006,20(5):681-687.
    [32]张华平,刘群.基于角色标注的中国人名自动识别研究[J].计算机学报,2004,27(1):85-91.
    [33]张腾飞,王晓磊,王保云.基于场景信息融合的中文姓名识别方法研究[J].计算机工程与应用,2009,45(34):147-151.
    [34]贾宁,张全.基于最大熵模型的中文姓名识别[J].计算机工程,2007,33(9):31-33.
    [35]王源媛,何中市.基于词性探测的中文姓名识别算法[J].计算机科学,2005,32(4):84-86.
    [36]刘力科,陈蓉,张南,等.基于姓氏用字驱动的混合中文姓名识别算法[J].四川大学学报(自然科学版),2007,44(4):795-798.
    [37] Stenberg G., Hellman J., Johansson M., et al. Familiarity or Conceptual Priming:Event-related Potentials in Name Recognition[J]. Journal of Cognitive Neuroscience,2009,21(3):447-460.
    [38] Sugiura M., Sassa Y., Watanabe J., et al. Cortical Mechanisms of PersonRepresentation: Recognition of Famous and Personally Familiar Names[J].NeuroImage,2006,31(2):853-860.
    [39]杨晓东,晏立,尤慧丽. CCRF与规则相结合的中文机构名识别[J].计算机工程,2011,37(8):169-171,174.
    [40]冯冲,陈肇雄,黄河燕.采用主动学习策略的组织机构名识别[J].小型微型计算机系统,2006,27(4):710-714.
    [41]蔡月红,朱倩,程显毅.基于Tri-training半监督学习的中文组织机构名识别[J].计算机应用研究,2010,27(1):193-195.
    [42]冯鲸华,古丽拉·阿东别克,玛依来·哈帕尔.基于N-gram语言模型的哈萨克文机构名识别[J].计算机工程与应用,2010,46(31):135-138.
    [43] Luhn H. P.. The Automatic Creation of Literature Abstract[J]. IBM Journal of Researchand Development,1958,2(2):159-165.
    [44] Ye Shiren, Chua Tat-Seng, Kan Min-Yen, et al. Document Concept Lattice for TextUnderstanding and Summarization[J]. Information Processing&Management.2007,43(6):1643-1662.
    [45]龚书,瞿有利,田盛丰.基于语义的自动文摘研究综述[J].北京交通大学学报,2009,33(5):126-131.
    [46]秦兵,刘挺,李生.多文档自动文摘综述[J].中文信息学报,2005,19(6):13-20,56.
    [47] Baxendale E.. Machine-made Index for Technical Literature-an Experiment[J]. IBMJournal of Research and Development,1958,12(4):354-361.
    [48] Edmundson H. P.. New Methods in Automatic Extracting[J]. Journal of the ACM,1968,16(2):264-285.
    [49] Mathis B. A., Rush J. E.. Abstracting[M]. NewYork: Marcel Dekker Inc.,1975:102-142.
    [50] Mckeown K., Radev D. R.. Generating Summaries of Multiple News Articles[A].Proceedings of the18thAnnual International ACM SIGIR Conference on Research andDevelopment in Information Retrieval[C]. New York, USA:[ACM],1995:74-82.
    [51] Young S. R., Hayes P. J.. Automatic Classification and Summarization of BankingTelexes[A]. Proceedings of the2ndConference on Artificial IntelligenceApplication[C].1985:402-408.
    [52] Kupiec J., Pedersen J., Chen F.. A Trainable Document Summarizer[A]. Proceedings ofthe18thAnnual International ACM SIGIR Conference on Research and Developmentin Information Retrieval[C]. Seattle, WA, USA:ACM,1995:68-73.
    [53] Gong Y., Liu X.. Generic Text Summarization Using Relevance and Latent SemanticAnalysis[A]. Proceedings of the24thAnnual International ACM SIGIR Conference onResearch and Development in Information Retrieval[C]. New Orleans, LA, USA:ACM,2001:19-25.
    [54] Fattah M. A., Ren Fuji. GA, MR, FFNN, PNN and GMM Based Models for AutomaticText Summarization[J]. Computer Speech&Language,2009,23(1):126-144.
    [55] Leite D. S., Rino L. H. M.. Combining Multiple Features for Automatic TextSummarization Through Machine Learning[A]. Proceedings of the8thInternationalConference on Computational Processing of the Portuguese Language[C]. Aveiro,Portugal:Springer,2008:122-132.
    [56] Amini M. R., Usunier N.. Incorporating Prior Knowledge into A Transductive RankingAlgorithm for Multi-Document Summarization[A]. Proceedings of the32ndInternational ACM SIGIR Conference on Research and development in informationretrieval[C]. Boston, Massachusetts, USA:ACM,2009:704-705.
    [57]王永成,许慧敏. OA中文文献自动摘要系统[J].情报学报,1997,16(2):128-132.
    [58]李小滨,徐越.自动文摘系统EAAS[J].软件学报,1991,(4):12-18.
    [59]王建波.面向议论文理解的自动文摘系统研究与探讨[D],哈尔滨:哈尔滨工业大学,1992.
    [60]李俊杰.非受限域中文自动文摘系统的研究与实现[D].哈尔滨:哈尔滨工业大学,1995.
    [61]刘挺,吴岩,王开铸.中文自动文摘系统CAAS的研究与实现[J].哈尔滨工业大学学报,1999,31(6):59-62.
    [62]李蕾,郭祥昊,钟义信.面向特定领域的理解型中文自动文摘系统[J].计算机研究与发展,2000,37(4):6-10.
    [63]胡舜耕,刘晓宇,钟义信.基于多Agent技术的自动文摘系统的研究和设计[J].电子学报,2001,29(2):247-250.
    [64] Wan Xiaojun, Yang Jianwu. Multi-document Summarization Using Cluster-based LinkAnalysis[A]. Proceedings of the31st annual international ACM SIGIR conference onResearch and development in information retrieval[C]. New York, USA:ACM,2008:299-306.
    [65] Jones K. S., et al. Automatic Summarizing Factors and Directions Advance inAutomatic Tex t Summarization[M]. Cambridge MA: MIT Press:1998.
    [66] Jing H., McKeown K. R.. The Decomposition of Human-Written SummarySentences[A]. Proceedings of the22ndAnnual International ACM SIGIR Conferenceon Research and Development in Information Retrieval[C]. New York, USA:ACM,1999:129-136.
    [67] Marcu, D.. The Automatic Const ruction of Large-Scale Corpora for SummarizationResearch[A]. Proceedings of the22ndAnnual International ACM SIGIR Conference onResearch and Development in Information Retrieval[C]. New York, USA:ACM,1999:137-144.
    [68] Carletta J.. Assessing Agreement on Classification Tasks: The Kappa Statistic[J].Computational linguistics,1996,22(2):249-254.
    [69] Radev S. H., Tam D.. Summarization Evaluation Using Relative Utility[A].Proceedings of the12thInternational Conference on Information and KnowledgeManagement[C]. New York, USA:ACM,2003:508-511.
    [70] Saggion H., Radev D., Teufel S., et al. Meta-evaluation of Summaries in ACross-lingual Environment Using Content-based Metrics[A]. Proceedings of the19thInternational Conference on Computational Linguistics[C]. Taipei: Association forComputational Linguistics,2002:1-7.
    [71] Hansvan H. V., Teufel S.. Examining The Consensus Between Human Summaries:Initial Experiments with Factoid Analysis[A]. Proceedings of the HLT-NAACL03onText Summarization Workshop[C]. Association for Computational Linguistics,2003:57-64.
    [72]夏天.汉语词语语义相似度计算研究[J].计算机工程,2007,33(6):191-194.
    [73] Hliaoutakis A.. Semantic Similarity Measures in MeSH Ontology and TheirApplication to Information Retrieval on Medline[EB/OL].http://www.intelligence.tuc.gr/publications/Hliautakis.pdf,2007.
    [74] Sheu, Phillip C. Y.. Semantic Computing and Quality Software[A]. Proceedings of the7thInternational Conference on Quality Software[C]. Portland, OR, USA: IEEE,2007:3-3.
    [75]刘群,李素建.基于知网的词汇语义相似度的计算[J].计算语言学及中文信息处理,2002,(7):59-76.
    [76] Fellbaum, C. WordNet: an Electronic Lexical Database[M]. Cambridge: The MITPress,1998.
    [77]董振东,董强.知网[EB/OL]. http://www.keenage.com,2009.
    [78]许云,樊孝忠,张锋.基于知网的语义相关度计算[J].北京理工大学学报,2005,25(5):411-414.
    [79]葛斌,李芳芳,郭丝路,等.基于知网的词汇语义相似度计算方法研究[J].计算机应用研究,2010,27(9):3329-3333.
    [80]张晓孪,张蕾,王西锋.基于知识图的汉语词语语义相似度计算[J].计算机工程与应用,2007,43(8):160-163.
    [81]张瑞霞,朱贵良,杨国增.基于知识图的汉语词汇语义相似度计算[J].中文信息学报,2009,23(3):117-121.
    [82]张晓孪,王西锋.基于概念图的汉语语义计算的研究与实现[J].计算机工程与应用,2011,47(10):120-123,141.
    [83]罗慧慧,周经野,刘玲.语义神经网络与自然语言深层语义的计算[J].计算机工程与科学,2007,29(1):126-129.
    [84] Ishizuka M. A Common Concept Description of Natural Language Texts as theFoundation of Semantic Computing on the Web[A]. IEEE International Conference onSensor Networks, Ubiquitous and Trustworthy Computing[C]. Taichung: IEEE,2008:385-385.
    [85] Jensen D., Giraud-Carrier C., Davis N.. A Method for Computing Lexical SemanticDistance Using Linear Functionals[J]. Web Semantics: Science, Services and Agentson the World Wide Web.2008,6(2):99-108.
    [86] Davis N., Giraud-Carrier C., Jensen D.. A Topological Embedding of The Lexicon forSemantic Distance[J]. Natural Language Engineering,2010,16(3):245-275.
    [87] Liu Hui, Zhao Jinglei, Lu Ruzhan. Computing Semantic Similarities Based onMachine-Readable Dictionaries[A].2010International Conference on ArtificialIntelligence and Computational Intelligence[C]. Huangshan: IEEE,2010:388-392.
    [88] Strube M., Ponzetto S. P.. WikiRelate! Computing Semantic Relatedness UsingWikipedia[A]. Proceedings of the21stNational Conference on Artificial Intelligence
    [C]. Boston, USA: AAAI Press,2006:1419-1424.
    [89]盛志超,陶晓鹏.基于维基百科的语义相似度计算方法[J].计算机工程,2011,37(7):193-195.
    [90] Hatzivassiloglou V., Mckeyown K. R.. Predicting the Semantic Orientation ofAdjectives[A]. Proceedings of the35thAnnual Meeting of the ACL and the8thConference of the European Chapter of the ACL[C]. Stroudsburg, PA, USA:Association for Computational Linguistics,1997:174-181.
    [91] Kamps J., Marx M., Mokken R. J., et al. Using WordNet to Measure SemanticOrientations of Adjectives[A]. Proceedings of the4thInternational Conference onLanguage Resources and Evaluation[C]. Paris, France: European Language ResourcesAssociation,2004:1115-1118.
    [92]朱嫣岚,闵锦,周雅倩,等.基于HowNet的词汇语义倾向计算[J].中文信息学报,2006,20(1):14-20.
    [93]赵煜,蔡皖东,樊娜,等.利用词汇分布相似度的中文词汇语义倾向性计算[J].西安交通大学学报,2009,43(6):33-37.
    [94]杜伟夫,谭松波,云晓春,等.一种新的情感词汇语义倾向计算方法[J].计算机研究与发展,2009,46(10):1713-1720.
    [95]何莘,王琬芜.自然语言检索中的中文分词技术研究进展及应用[J].情报科学,2008,26(5):787-791.
    [96] Lieber, R. On the Organization of the Lexicon[D]. Massachusetts: MassachusettsInstitute of Technology, Dept. of Linguistics and Philosophy,1980.
    [97]陈建超.基于海量互联网网页文本的中文概念知识库构建算法研究及应用[D].广州:华南理工大学计算机科学与工程学院,2009.
    [98]杨梅.现代汉语合成词构词研究[D].南京:南京师范大学文学院,2006.
    [99]李学明,邢敏玲,张佳培.基于统计和未登录词碎片字典的未登录词识别方法[J].世界科技研究与发展,2011,33(4):574-577.
    [100]陈建超,郑启伦,李庆阳,严桂夺.基于词序列频率有向网的中文组合词提取算法[J].计算机应用研究,2009,26(10):3746-3749.
    [101]孙常龙,洪宇,葛运东,等.基于维基百科的未登录词译文挖掘[J].计算机研究与发展,2011,48(6):1067-1076.
    [102]张瑞霞,张蕾.基于知识图的汉语基本名词短语分析模型[J].中文信息学报,2004,18(3):47-53..
    [103] Erik F, Sang Tjong Kim. Noun Phrase Recognition by System Combination [A].Proceedings of the1stNorth American chapter of the Association for ComputationalLinguistics conference[C]. Stroudsburg, PA, USA: Association for ComputationalLinguistics,2000:50-55.
    [104]李钝,曹元大,万月亮. Internet中的新词识别[J].北京邮电大学学报,2008,31(1):26-30.
    [105]朱聪慧,赵铁军,郑德权.基于无向图序列标注模型的中文分词词性标注一体化系统[J].电子与信息学报,2010,32(3):700-705.
    [106]黄德根,焦世斗,周惠巍.基于子词的双层CRFs中文分词[J].计算机研究与发展,2010,45(7):962-968.
    [107]张海军,史树敏,丁溪源,等.基于分词提取重复串的未登录词遗漏量化模型[J].中文信息学报,2011,25(2):125-128.
    [108]龚敏.中文分词及词性标注中领域自适应的研究[D].南京:南京邮电大学,2010.
    [109]赵岩,王晓龙,刘秉权,等.融合聚类触发对特征的最大熵词性标注模型[J].计算机研究与发展,2006,43(2):268-274.
    [110]夏利玲.基于自然语言理解的中文分词和词性标注方法的研究[D].南京:南京邮电大学,2009.
    [111]孙静.基于平行语料库的无监督中文词性标注研究[D].苏州:苏州大学,2010.
    [112]姜维,王晓龙,关毅,等.应用粗糙集理论提取特征的词性标注模型[J].高技术通讯,2006,16(10):996-1000.
    [113]杜楠.文本中的组合词识别与分词修正的研究与实现[D].广州:华南理工大学,2009.
    [114]潘炜,沈超.面向层次分类标签的词性标注系统[J].计算机工程,2009,35(21):197-199.
    [115]刘远超,王晓龙,徐志明,等.基于粗集理论的中文关键词短语构成规则挖掘[J].电子学报,2007,35(2):371-374.
    [116]李素建,王厚峰,俞士汶,等.关键词自动标引的最大熵模型应用研究[J].计算机学报,2004,27(9):1192-1197.
    [117]索红光,刘玉树,曹淑英.一种基于词汇链的关键词抽取方法[J].中文信息学报,2006,20(6):25-30.
    [118]胡学钢,李星华,谢飞,等.基于词汇链的中文新闻网页关键词抽取方法[J].模式识别与人工智能,2010,23(1):45-51.
    [119]石晶,李万龙.基于LDA模型的主题词抽取方法[J].计算机工程,2010,36(19):81-83.
    [120] Hulth A., Karlgren J., Jonsson A., et al. Automatic Keyword Extraction Using DomainKnowledge[J]. Lecture Notes in Computer Science,2001, Volume2004/2001:472-482.
    [121] Zhang K., Xu H., Tang J., et al. Keyword Extraction Using Support Vector Machine[J].Lecture Notes in Computer Science,2006, Volume4016/2006:85-96.
    [122] Radev D. R., Teufel S., Saggion H., et al. Evaluation Challenges in large-scaledocument Summarization[A]. Proceedings of the41stAnnual Meeting of theAssociation for Computational Linguistics[C]. Stroudsburg, PA, USA: Association forComputational Linguistics,2003:375-382.
    [123]陶余会,周水庚,关佶红.一种基于文本单元关联网络的自动文摘方法[J].模式识别与人工智能,2009,22(3):440-444.
    [124]王志琪,王永成,刘传汉.互增强关系的自动文摘句子加权方法[J].上海交通大学学报,2007,41(8):1297-1300.
    [125] Ai D. M., Zheng Y. C. and Zhang D.Z.. Automatic text summarization based on latentsemantic indexing[J]. Artificial Life and Robotics,2010,15(1):25-29.
    [126] Wei F. R., Li W. J., Lu Q. et al. A document-sensitive graph model for multi-documentsummarization[J]. Knowledge and Information Systems,2010,22(2):245-259.
    [127] Li J. X., Li L. and Li T.. Multi-document summarization via submodularity[J]. AppliedIntelligence, Online First,8February2012.
    [128]刘茵,李弼程.自动文摘系统评测方法的回顾与展望[J].情报学报,2008,27(2):235-243.
    [129]吴江宁,刘巧凤.基于最大公共子图的文本相似度算法研究[J].情报学报,2010,29(5):785-791.
    [130] Oliva J., Serrano J. I., Del Castillo M. D., et al. SyMSS: A syntax-based measure forshort-text semantic similarity[J]. Data&Knowledge Engineering,2011,70(4):390-405.
    [131]郭武斌,周宽久,苏振魁.基于词序方法的文本相似度计算模型[J].情报学报,2008,27(6):857-862.
    [132]曹恬,周丽,张国煊.一种基于词共现的文本相似度计算[J].计算机工程与科学,2007,29(3):52-54.
    [133] Liu T. and Guo J.. Text Similarity Computing Based on Standard Deviation[J]. LectureNotes in Computer Science,2005, Vol.3644/2005:456-464.
    [134] Islam A. and Inkpen D.. Semantic Text Similarity Using Corpus-Based WordSimilarity and String Similarity[J]. ACM Transactions on Knowledge Discovery fromData,2008,2(2):10.1-10.25.
    [135]余刚,裴仰军,朱征宇,等.基于词汇语义计算的文本相似度研究[J].计算机工程与设计,2006,27(2):241-244.
    [136]彭京,杨冬青,唐世渭,等.基于概念相似度的文本相似计算[J].中国科学F辑:信息科学,2009,39(5):534-544.
    [137] Cheng X. Y., Sun P., Zhu Q., et al. The Research of Chinese Semantic SimilarityCalculation Introduced Punctuations[J]. Journal of Convergence InformationTechnology,2010,5(7):17-23.
    [138] Giannone C., Basili R., Naggar P., et al. Supervised semantic relation mining fromlinguistically noisy text documents[J]. International Journal on Document Analysis&Recognition,2011,14(2):213-228.
    [139]刘兴林,郑启伦,马千里.中文合成词识别及分词修正[J].计算机应用研究,2011,28(8):2905-2908.
    [140]刘兴林,郑启伦,马千里.基于词共现有向图的合成词提取算法[J].计算机工程,2011,37(23):177-180.
    [141] Luo Q. M., Chen E. H., Xiong H.. A semantic term weighting scheme for textcategorization[J]. Expert Systems with Applications,2011,38(10):12708-12716.
    [142] Li Z.X., Xiong Z. Y., Zhang Y. F., et al. Fast text categorization using concise semanticanalysis[J]. Pattern Recognition Letters,2011,32(3):441-448.
    [143] Kuang Q. Y., Xu X. M.. An Improved Feature Weighting Method for TextClassification[J]. Advances in Information Sciences and Service Sciences,2011,3(7):340-346.
    [144]陆汝钤.知识科学与计算科学[M].北京:清华大学出版社,2003.1.
    [145]董振东,董强,郝长伶.知网的理论发现[J].中文信息学报,2007,21(4):3-9.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700