基于动态流通语料库（DCC）的信息技术领域新术语自动提取研究

英文题名：Dynamic Circulation Corpus (DCC) Based Automatic Unlisted Term Extraction in the Field of Information Technology
作者：王强军
论文级别：博士
学科专业名称：语言学及应用语言学
中文关键词：动态语言知识更新 ; 动态流通语料库 ; 术语自动提取 ; 新术语 ; 接续指数 ; TFIDF ; 领域相减
英文关键词：Dynamic Updating of Language and Knowledge ; DCC (Dynamic Circulation Corpus) ; Automatic Term Extraction ; Unlisted Term ; Concatenation Index ; TFIDF ; Domains Subtracting
学位年度：2003
导师：张普
学科代码：050102
学位授予单位：北京语言文化大学
论文提交日期：2003-05-01

摘要

本文以动态语言知识更新理论为指导，以信息技术领域为实验对象，对基于大规模动态流通语料库的术语提取技术进行研究，提出了利用接续指数判断字符串词语度的方法，实现了“接续指数+TFIDF+领域相减”进行术语提取的技术路线和工作流程，初步形成了一个基于动态流通语料库的信息技术领域新术语提取系统。
     本文介绍了动态语言知识更新理论体系和基于动态流通语料库的研究框架，提出了动态流通语料库建设的扩展方案，使之在扩展研究范围和研究深度的同时保持与现有系统的全面兼容，并具有较好的可扩缩性。
     新术语首先是术语，它具有术语的三个基本特征：一般只在一个或几个特定的领域出现；是本领域的高流通度词语：在其他领域的流通度接近于0。基于此，本文的基本思路是通过研究已有术语在语料库中的分布情况，确定新术语在语料库中的可能分布情况，通过分析各种阈值条件下已有术语的提取结果，确定提取新术语的最佳阈值条件。
     新术语往往是未登录词语，所有未登录词语识别的困难在新术语提取中同样存在，经过传统的自动分词方法处理的语料对新术语的提取跟对未登录词语识别一样存在困难，因此，为了尽可能多的保留新术语，本文采用了全切分方法对语料进行前期处理。
     一个字符串在特定的上下文中成为术语的两个指标是词语度(unithood)和术语度(termhood)。本文提出接续指数的概念用于衡量一个字符串的词语度。实验表明接续指数对于判断一个字符串是不是一个完整的词语具有比较明显的效果。
     在提取方法上本文提出了“接续指数+TFIDF+领域相减”的方法。利用接续指数判断字符串的词语度，利用“TFIDF+领域相减”的方法判断字符串的术语度。该方法在动态流通语料库(DCC)的部分语料(目标语料1700万字，对照语料6亿字)上进行实验，结果表明，在基于大规模语料库的术语自动提取中，本论文所采用的语料处理方法和术语提取技术对新术语的发现有较为显著的效果，在较少人工干预的基础上，提取出较多新术语，部分地实现了传统分词方法难以完成的任务。
     另外，本文讨论了术语提取的两种工作模式：“文件+索引+统计结果”模式和“文件+数据库”模式，分析了两者的优缺点，指出后者是动态语言知识更新在语言监控方面较好的应用。
     综上所述，本文的创新之处有如下几个方面：
     1．提出了接续指数的概念。
     2．把接续指数用于衡量一个字符串的词语度。
     3．在术语提取方法上，提出了“接续指数+TFIDF+领域相减”的方法。
     本研究所形成的初步的术语提取系统可为专业领域术语提取、动态流通语料库建设提供原型和参考。
This research disserts automatic unlisted terms extraction in the field of Information Technology based on the large-scale DCC (Dynamic Circulation Corpus), under the theory of Dynamic Updating of Language and Knowledge. It proposes the concept of Concatenation Index to decide whether a character string is a word/phrase or not. It presents a method named "Concatenation Index + TFIDF + Domains Subtracting" for extracting unlisted terms. This research chose the IT domain as the experimental object in order to draw the primitive research flow based on the theory of the Dynamic Updating of Language and Knowledge.
    This research introduces the frame work of Dynamic Updating of Language and Knowledge, and suggests a schema to improve the Dynamic Circulation Corpus (DCC). The schema makes it possible to enlarge the DCC both in content and structure while keeping compatible to the existed system.
    There are three basic characteristics of terms. They are: Terms usually only show up in one or some specialized domains; Terms are the phrases with the high degree of the circulation in its domain; and its circulation is near 0 in other domains. Unlisted terms are terms, hence, in nature, they also bear these three characteristics. Based on this, the basic thinking behind this research is to ascertain unlisted terms' possible distributing in the corpus through examining the enlisted terms in the corpus; and to set the best threshold for extracting unlisted terms through analyzing the extracting result under the different thresholds.
    Unlisted terms usually are unlisted words. There exists the same difficulty in distinguishing unlisted words as in extracting unlisted terms. Furthermore, the corpus under the traditional word segmentation would show great difficulty in extracting unlisted terms as in distinguishing the unlisted words. Therefore, this research adopts the traversing word segmentation method in preprocessing the corpus.
    There are two indices used in indicating whether a character string can be a term in the certain context. They are: unithood and termhood. This research suggests that the Concatenation Index should be used in measuring the unithood of a character string. And the experimentation shows that the use of the Concatenation Index, indeed, has the better effect in determining if a character string is a whole integrated word/phrase.
    This research also presents a method named "Concatenation Index + TFIDF + Domains Subtracting" for extracting unlisted terms. By using the Concatenation Index, we can decide the unithood of a character string. And by using the method of "TFIDF + Domains Subtracting", we can decide the termhood of a character string. This method was experimented on the DCC. It shows that the methods and techniques adopted in this research have the outstanding effect in processing the corpus and in extracting unlisted terms. Under the less human's interference, there are more unlisted terms being extracted. As a result, it partly realized the intention objective of the word segmentation.


    It also discusses two different processing modes for extracting the unlisted terms: "text-index-statistics mode" and "text-database mode" and their strong points and flaws. And more, it points out the "text-database mode" is a better method in the Dynamic Updating of Language and Knowledge at the aspect of the language monitoring in this paper.
    Putting it in other words, the main innovation of this research can be summed up as follows::
    (1) It proposes the concept of Concatenation Index;
    (2) It applies the Concatenation Index in measuring the unithood of a character string;
    (3) It presents a method named "Concatenation Index + TFIDF + Domains Subtracting" for extracting unlisted terms.
    This research drew the primitive research flow based on the theory of the Dynamic Updating of Language and Knowledge. It can be used as a prototype and as the valuable reference in extracting unlisted terms in other domains; and in building and updating the DCC.

引文

[1] Cohen, J. D. 1995. Highlights: Language- and Domain-Independent Automatic Indexing Terms for Abstracting. Journal of the American Society for Information Science 46(3), 162-174.
    [2] Dagan, I. and Church, K. 1994. Termight: Identifying and Translating Technical Terminology. Proceedings of the Fourth Conference on Applied Natural Language Processing 34-40.
    [3] Daille, B., Gaussier, E. and Lang'e, J-M. 1994. Towards Automatic Extraction of Monolingual and Bilingual Terminology. Proceedings of COLING94 515-521.
    [4] Dias G. Guillor S. & Lopes J. G. P, 2000b. Combining Linguistics with Statistics for Multiword Term Extraction: A Fruitful Association?, In "Proceedings of Recherche d'Informations Assiste par Ordinateur (RIAO2000)", Collge de France, Paris, France.
    [5] Donald R. Morrison, 1968, PATRICIA-Practical Algorithm To Retrieve Information Coded in Alphanumeric. Journal of the ACM (JACM) Volume 15, Issue 4 (1968)
    [6] Hiroshi Nakagawa, Tatsunori Mori,2002. Automatic Term Recognition based on Statistics of Compound Noun and its Components http://www.r.dl.itc.u-tokyo.ac.jp/～nakagawa/academic-res/coling2002.pdf
    [7] Honglan Jin and Kam-Fai Wong. A Chinese Dictionary Construction Algorithm for Information Retrieval.http://www.se.cuhk.edu.hk/dn/TALIP-02-a35.doc
    [8] Jordi Vivaldi, Lluis Mrquez, Horacio Rodriguez: Improving Term Extraction by System Combination Using Boosting. ECML 2001: 515-526 http://www.lsi.upc.es/-Iluism/drafts/ecm101.ps.gz
    [9] Jorge Vivaldi et Horacio Rodriguez, 2001. Improving term extraction by combining different techniques, Terminology 7:1 (2001)
    [10] Justeson, J. S. and Katz, S. M. 1995. Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text. Natural Language Engineering 1(1), 9-27.
    [11] K. Kageura. 2002. Measuring the Distance of Terminological Structures: A Preliminary Study, Asian-Pacific Workshop on Terminology, March 2002.
    [12] K. Sparck Jones. 1972. A statistical interpretation of termspecificity and its application in retrieval. Journal of Documentation, 28(1): 11-21, 1972.
    [13] Keita Tsuji et Kyo Kageura, 2001. Extracting morpheme pairs from bilingual terminological corpora, Terminology 7:1 (2001 )
    [14] Kiyotaka Uchimoto, Satoshi Sekine, Masaki Murata, Hiromi Ozaku, Hitoshi Isahara 1999. Term Recognition by Using Different Field Corpora Published in the Proceedings of the Proceedings of the First NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition. 1999; Tokyo Japan
    [15] Kyo Kageura and Bin Umino 1996. Methods of automatic term recognition:A review, Terminology 3:2 (1996)
    [16] Lee-Feng, Chien 1997, PAT-Free-Based Keyword Extraction for Chinese Information Retrieval. Proceedings of the 1997 ACM SIGIR. Philadelphia. PA. USA, pp. 50-58
    [17] Masao Utiyama and Hitoshi Isahara 2001. Tools for Exploring Natural Language. NLPRS-2001. pp. 779-780. (2001).
    [18] Mikio Yamamoto and Kenneth W. Church. 2001. Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics, Vol.27. No.1, pp.1-30
    [19] Nobata, Chikashi, Nigel Collier and Jun'ichi Tsujii. 1999. Automatic Term Identificanon and Classification in Biology Texts In the Proceedings of the fifth Natural Language Processing Pacific Rim Symposium (NLPRS). Beijin, China. pp, 369—374.


    [20] Ohata Hirokazu, Nakagawa Hiroshi, Automatic Term Recognition by the Relation between Compound Nouns and Basic Nouns http://www.ipsj.or.jp/members/SIGNotes/Eng/14/1999/057/article016.html.
    [21] T. Lahtinen. 2000. Automatic indexing: an approach using an index term corpus and combining linguistic and statistical methods. PhD thesis, Department of general linguistics, University of Helsinki. 2000
    [22] Thian-Huat Ong and Hsinchun Chen, 1999. Updateable PAT-Tree Approach to Chinese Key Phrase Extraction using Mutual Information: A Linguistic Foundation for Knowledge Management, Proceedings tot the Second Asian Digital Library Conference, Novermber 8-9, 1999, pp. 63-84
    [23] U. Manber and G. Myers. 1993. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935—948, Oct. 1993.
    [24] CCRL介绍，面向语言教学研究的汉语语料库检索工具(CCRL)，北京语言大学计算机科学系主页，http://www.blcu.edu.cn/jsjdepart/fenci/ccrl.asp
    [25] G．隆多1985，术语学概论，科学出版社，1985年4月
    [26] GB/T 10112—1999，中华人民共和国国家标准术语工作原则与方法，http://www.cnterm.org/bzlaw/sybz.asp
    [27] GB/T 15237.1—2000，中华人民共和国国家标准术语工作词汇第1部分：理论与应用，http://www.cnterm.org/bzlaw/sybz.asp
    [28] V-galaxy论坛，全文检索的一种实现方法，http://www.v-galaxy.com/faq/vbvc/58384.shtml
    [29] 北京语言学院语言教学研究所，现代汉语频率词典，北京语言学院出版社，1986年6月
    [30] 常宝儒1989，现代汉语频率词典的研制，现代汉语定量分析，上海教育出版社
    [31] 陈小荷2000，现代汉语自动分析，北京语言文化大学出版社
    [32] 陈小荷2002，语料库检索方法的研究与实现，E-learning与对外汉语教学，清华大学出版社，2002年7月，401～409页。
    [33] 陈原1988，陈原语言学论著，辽宁教育出版社
    [34] 崔巍1999，数据库系统及应用，高等教育出版社
    [35] 大百科知识树，台湾：百科知识网，www.wordpedia.com/tree/tree．asp
    [36] 戴昭铭1998，规范语言学探索，上海三联书店
    [37] 冯志伟1997，现代术语学引论，语文出版社，1997年8月
    [38] 冯志伟1999，应用语言学综论，广东教育出版社
    [39] 韩宝成2000，外语教学科研中的统计方法，外语教学与研究出版社，2000年1月
    [40] 韩客松，王永成，陈桂林1999，汉语语言的无词典分词模型系统，计算机应用研究，1999年10期
    [41] 黄伯荣、廖序东1997，现代汉语(增订二版)，高等教育出版社，1997年第2版
    [42] 黄昌宁、李涓子2002，语料库语言学，商务印书馆，2002年4月
    [43] 黄昌宁1993，关于大规模真实文本的谈话，语言文字应用，1993年第2期
    [44] 黄纯敏、杨存一、邱立丰2002，TFIDF与GBP方法于重要句子撷取绩效评估，第十三届国际资讯管理学术研讨会(ICIM2002)，2002年5月
    [45] 黄萱菁、吴立德等1996，基于机器学习的无需人工编制词典的切词系统，模式识别与人工智能，9卷4期(1996)，297-304
    [46] 李东、张湘辉，汉语分词在中文软件中的广泛应用，http://www.microsoft.com.china/rdcenter/info/result/chinese.asp
    [47] 李芸、王强军、张普2001，信息技术领域术语自动提取和动态更新研究，辉煌二十年——中国中文信息学会二十周年学术会议论文集
    [48] 刘桐菊、于浩、杨沐昀2002，基于TFIDF的专业领域词汇获取研究，第一届学生计算语言学研讨会论文集


    [49] 全国科学技术名词审定委员会发布试用新词，信息科技(三) 多媒体技术部分新词，科技术语研究，2001年第3期
    [50] 全如瑊2001，术语的理论与实践，术语标准化与信息技术，2001年第1期起连载至今
    [51] 施水才等2001，TRS中文文本信息检索技术的发展——从全文检索到基于自然语言处理的知识检索，http://www.trs.com.cn/shuicai/doc/t20021009_1377.htm
    [52] 史忠植2002，知识发现，清华大学出版社，2002年1月
    [53] 隋岩、张普1999，1997年中文报纸媒体流通度分析，黄昌宁，计算语言学文集，清华大学出版社，1999年10月
    [54] 穗志方2002a，信息科学技术领域术语自动识别策略，第二届中日自然语言处理专家研讨会论文集
    [55] 穗志方2002b，穗志方、谌贻荣、胡俊峰、吴云芳、俞士汶，信息科学技术领域术语自动提取研究，第五届东巫术语论坛
    [56] 孙宏林2001，现代汉语非受限文本的实语块分析，北京大学博士学位论文
    [57] 孙茂松、邹嘉彦2001，汉语自动分词研究评述，当代语言学第3卷，2001年第1期，12-32页，北京
    [58] 王建华2000，英汉双语术语的自动获取研究，中国矿业大学硕士论文
    [59] 王晓龙、王开铸等1989，最少分词问题及其解法，科学通报，第13期，1030-1032页
    [60] 吴立德1997，大规模中文文本处理，复旦大学出版社，1997年7月
    [61] 邢红兵2000a，信息领域汉英术语的特征及其在语料中的分布规律，术语标准化与信息技术，2000年第3期。
    [62] 邢红兵2000b，基于第三代语料库的信息领域术语动态更新，语言文字应用，2000年第2期
    [63] 姚天顺等2002，自然语言理解，清华大学出版社、广西科学技术出版社，2002年10第2版
    [64] 尹斌庸、方世增1994，词频统计的新概念和新方法，语言文字应用，1994年第2期
    [65] 于根元1996，二十世纪的中国语言应用研究，书海出版社
    [66] 俞盘祥、沈金发1998，数据库系统原理，清华大学出版社，1988年11月
    [67] 苑春法、黄昌宁等1995，新一代语料库的建设与管理，陈力为、袁琦，中文信息处理应用平台，电子工业出版社
    [68] 岳炳词2001，面向语言学研究的大规模汉语生语料库检索工具——CCRLT，北京语言文化大学硕士论文
    [69] 张普1997，语言的多媒体性和多媒体语言知识的作用，语言工程，清华大学出版社
    [70] 张普1998，关于大规模真实文本语料库的几点理论思考，语言文字应用，1999年第1期。
    [71] 张普1999a，关于第三代大规模真实文本语料库的几点理论思考，自然科学基金重点项目结题报告(项目号：69433010)(内部)
    [72] 张普1999b，关于网络时代语言规划的思考，语言研究，1999年3期。
    [73] 张普1999c，关于语感与流通度的思考，语言教学与研究，1999年第2期。
    [74] 张普2000a，信息处理用动态语言知识更新的总体思考，语言文字应用，2002年第2期
    [75] 张普2000b，中文信息处理专题研究：主持人的话，语言文字应用，2000年第2期
    [76] 张普2001a，关于控制论与动态语言知识更新的思考，语言文字应用，2001年第4期—2002年第1期。
    [77] 张普2001b，流通度在IT术语识别中的应用分析——关于术语、术语学、术语数据库的研究，辉煌二十年——中国中文信息学会二十周年学术会议论文集
    [78] 张普2002，论历时中包含有共时与共时中包含有历时，首届社会语言学国际会议，2002年9月
    [79] 中国大百科全书·语言卷，中国大百科全书出版社。
    [80] 中国大百科全书·自动控制与系统工程，中国大百科全书出版社。

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700