中文逗号分类方法研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

中文逗号分类方法研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on Chinese Comma Classification
作者：徐生芹
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：中文分句 ; 基本语篇单元识别 ; 最大熵模型 ; 句法分析树 ; 整数线性规划
英文关键词：Chinese Sentence Segmentation ; Maximum Entropy Models ; Parsing tree ; Integer Linear Programming
学位年度：2013
导师：钱培德 ; 李培峰
学科代码：081203
学位授予单位：苏州大学
论文提交日期：2013-05-01

摘要

在自然语言处理中对于标点符号的研究越来越重视，其中，逗号在汉语及大多数外语中都是使用频率最高的标点符号。逗号的用途广泛、用法灵活，因此很难掌握。本文主要对中文逗号在句子中的作用进行了研究，重点研究了逗号在中文分句和基本语篇单元识别中的分类方法。具体的研究内容归纳如下：
     （1）对CTB6.0（Penn Chinese Treebank）的语料进行了详细的统计和分析，归纳总结了两种逗号分类的标准。分类标准之一是将逗号视为句子的边界，将逗号分成两类：可以分句的逗号EOS（End Of a Sentence）和不可以分句的逗号Non-EOS（Notthe End Of a Sentence）。另一种分类标准是将逗号分为七类，并将其视为基本语篇单元边界的同时标记逗号分隔单元之间的关系。
     （2）提出了一个基于层次结构的中文分句方法。该方法首先根据各种有效句法的特征对逗号进行第一层分类；在此基础上，对第一层分类器中可信度低的样本利用新的特征进行第二层分类。实验证明基于层次结构的中文分句识别方法取得了良好的效果。
     （3）提出了基于逗号的中文基本语篇单元识别及其优化方法。首先，研究了用于基本语篇单元识别的有效句法特征，并分别用最大熵模型和条件随机域模型进行识别。其次，进一步从以上2个分类器中捕获全局与局部特征，用序列化和统计相结合的方法来提高基本语篇单元识别的性能。实验表明，本文提出的方法提高了系统的性能。
     本文提出了一种基于逗号的中文分句和基本语篇单元识别方法，实验证明了该方法的有效性，将有益于基于篇章知识的自然语言处理技术的发展。
The research of punctuation has been paid to more and more attention in naturallanguage processing. The comma is the most frequently used punctuation in Chinese andmost foreign language. The comma has the most wilder and flexible usage, so it is verydifficult to use or understand its function. This paper mainly studies the usages of Chinesecomma, and focuses on the different classification methods of comma used in Chinesesentence segmentation and discourse unit recognition. The main contents are as follows:
     (1) Two classification methods are summarized according to the statistic and analysison the CTB6.0(Penn Chinese Treebank). One of the classification methods is to considercomma as a sign of the sentence boundary, and then divides it into two major types, i.e.,EOS (End Of a Sentence) and Non-EOS (Not the End Of a Sentence). The otherclassification method is to consider comma as the boundary of the discourse units and alsoto anchor discourse relations between units separated by comma, and then divides it intoseven major types based on syntactic patterns.
     (2) The framework of Chinese sentence segmentation based on comma is described indetail. Firstly, it uses the first layer classifier to classify each comma according to variouseffective syntactic features. And then, the commas with low confidence produced by thefirst layer classifier are classified by the second layer classifier, according to the newfeatures. The experimental results prove that our hierarchical model achieves a higherperformance than that of the baseline.
     (3) A Chinese element discourse unit recognition and optimization model based oncomma is proposed. Firstly, it selects a set of effective syntactic features and constructs aMaximum Entropy model and a Conditional Random Field model to recognize the elementdiscourse unit respectively. Then, to capture the local and global information, it combinesthe sequence model and the probability method to improve the performance. The experimental results also prove that our model achieves a higher performance than that ofthe baseline.
     This paper proposes a Chinese sentence segmentation and element discourse unitrecognition approach based on comma. The experimental results prove the validity of themethod. It is conducive to the development of Natural Language Processing technologiesbased on discourse analysis.

引文

[1]王晓龙,关毅等.计算语言处[M].北京:清华大学出版社,2005.
    [2]国家技监督局.标点符号用法[M].中国标准出版社,1995.
    [3] Jeffrey Reynar, Adwait Ratnaparkhi. A Maximum Entropy approach to identifyingsentence boundaries [A]. In Proceedings of the Fifth Conference on Applied NaturalLanguage Processing [C],1997:16-19.
    [4] Charles Meyer. A linguistic study of American punctuation [M]. Peter Lang: New York.1987.
    [5] Geoffrey Nunberg. The linguistics of punctuation [M]. CSLI Lecture Notes,1990.
    [6] Jones Bernard. What’s the Point? A computational theory of punctuations [D]. Centrefor Cognitive Science, University of Edinburgh, UK,1997.
    [7] Edward Briscoe. The syntax and semantics of punctuation and its use in interpretation[A]. In Proceedings of the ACL/SIGPARSE International Meeting on Punctuation inComputational Linguistics [C],1996:1-7.
    [8] Xing Li, Chengqing Zong. A hierarchical parsing approach with punctuationprocessing for long complex Chinese sentences [A]. In Companion Volume to theProceedings of Conference including Posters/Demos and Tutorial Abstracts [C],2005:9-14.
    [9] Meixun Jin, Mi-Young Kim, Dong-Il Kim, and Jong-Hyeok Lee. Segmentation ofChinese long sentences using commas [A]. In Proceedings of the SIGHANNWorkshop on Chinese Language Processing [C].2004.
    [10]Steven Abney. Part-of-speech tagging and partial parsing[M]. Kluwer AcademicPublishers.1996:1-9.
    [11]Erik Tjong Kim Sang, Sabine Buchholz. Introduction to the CoNLL-2000shared task:chunking [A]. In Proceeding of CoNLL-2000[C],2000:127-132.
    [12]Jorn Veenstra, Antal wan den Bosch. Single-classifier memory-based phrasechunking[A]．In Proceeding of CoNLL-2000[C],2000:157-159.
    [13]珩,杨峰,朱靖,姚天顺.于增的马尔科夫模型的本组块分析[J].计算科学2004,31(2):152-514.
    [14]周强.汉语本块描体系[J].中信息学报.2007,21(3):21-27.
    [15]素建，刘，杨志峰.于最大熵模型的组块分析[J].计算学报2003,12(26):1722-1727.
    [16]David Phaner, Marti Hearst. Adaptive sentence boundary disambiguation [A].Proceeding of the1994Conference on Applied Natural Language Processing [C],1994:78-83.
    [17]Neha Agarwal, Kelley Herndon Ford, Max Shneider. Sentence boundary detectionusing a maxEnt classifier [EB/OL]. http://nlp.stanford.edu/courses/cs224n/2005/agarwal_herndon_shneider_final.pdf
    [18]黄河燕，陈肇雄.基于多策略分析的复杂长句翻译处理算法[J].中文信息学报2002,16(3):1-7.
    [19]Nianwen Xue, Yaqin Yang. Chinese sentence segmentation as comma classifcation[A]. Proceedings of the49th Annual Meeting of the Association for ComputationalLinguistics [C],2010:631-635.
    [20]John Hobbs. Coherence and coreference [J]. Cognitive Science,1979,3(1):67-90.
    [21]John Hobbs. Information, intention and structure in discourse: A first draft [A]. InBurning Issues in Discourse [C],1993:41-66.
    [22]Talmy Givon. Topic continuity in discourse [M]. Amsterdam: John Benjamins.1983.
    [23]Harvey Sacks, Emanuel Schegloff, Gail Jefferson. A simplest systematic for theorganization of turn-taking in conversation [J]. Language,1974(50):696-735.
    [24]Michael Polanyi. Personal knowledge: towards a post-critical philosophy [M].Psychology Press,1998.
    [25]Barbara Grosz, Candace Sidner. Attention, intentions and the structure of discourse [J].Computational Linguistics,1986,12(3):175-204.
    [26]William Mann, Sandra Thompson. Relational propositions in discourse [J]. DiscourseProcessing,1986,9(1):57-90.
    [27]William Mann, Sandra Thompson. Rhetorical structure theory: toward a functionaltheory of text organization [J]. Text,1988,8(3):243-281.
    [28]Daniel Marcu. From discourse structures to text summaries[A]. In Proceedings of theACL'97Workshop on Intelligent Scalable Text Summarization [C],1997:82-88.
    [29]乐明.汉语财经评论的修辞结构标注及篇章研究[D]，中国传媒大学士论，2006.
    [30]乐明.汉语篇章修辞结构的标注研究[J].中信息学报，2008,22(4):19-23.
    [31]陈莉萍.汉语篇结构标注论与实践[D].上海外国语大学士论，2006.
    [32]邢福.汉语复句研究[M].商务书馆，2001.
    [33]Ross Quinlan. Induction of decision tree [J].Machine Learning,1986(11):81-106.
    [34]国正，王猛，曾华军译.支持向量导论[M].电工业出版社，2004.
    [35]Tom Mitchell. Machine learning[M].械工业出版社，2003.
    [36]曲炜.信息论础及应用[M].清华大学出版社，2005.
    [37]黄红选.运筹学：数学规划[M].清华大学出版社，2011.
    [38]金龙，人凯.线性规划论与模型应用[M].科学出版社，2003.
    [39]栗建安,王纪宪,苏炳华等.多类别多评估者的kappa分析[J].中国卫生统计,1995,12(6):20-22.
    [40]耿修林,谢兆茹.应用统计学[M].科学出版社,2002.
    [41]Lester Kirehner, Jon Lemke. Simultaneous estimation of intrarater and interrateragreement for multiple raters under order restrictions for a binary trait [J]. Statists inMedicine.2002(21):1761-1772.
    [42]Jason Liao. An improved concordance correlation coefficient [J]. PharmaceuticalStatists.2003(2):253-261.
    [43]Reinhold Muller, Petra Buttner. A critical discussion of intraclass correlationcoefficients [J]. Statists in Medicine.1997(13):2465-76.
    [44]Sarall White, Nykne van den Broek. Methods for assessing reliability and validity for ameasurement tool: a case study and critique using the WHO hemoglobin color scale [J].Statists in Medicine.2004(23):1603-1619.
    [45]Ronir Raggio Luiz, Moyses Szklo. More than one statistical strategy to assessagreement of quantitative measurements may usefully be reported [J]. Journal ofClinical Epidemiology.2005(58):215-216.
    [46]William Grove. Statistical methods for rates and proportions [J]. The American Journalof Psychiatry.1981(138):1644-1654.
    [47]John Bartko. The intraclass correlation coefficient as a measure of reliability [J].Psychological Reports.1966(19):3-11.
    [48]周强.汉语句法库标注体系[J].中信息学报,2004,18(4):1-8.
    [49]Nianwen Xue, Fei Xia, Fu-Dong Chiou, Martha Palmer. The Penn Chinese Treebank:phrase structure annotation of a large corpus [J]. Natural Language Engineering,2005,11(2):207-238.
    [50]Noam Chomsky. Lectures on government and binding [M]. Dordrecht: Foris,1981.
    [51]Noam Chomsky． The minimalist program [M]．Massachusetts：MIT Press，1995.
    [52]吕叔湘.汉语语法分析问题[M].商务书馆，1979.
    [53]Andrew Kachites McCallum. Mallet: A machine learning for language toolkit,2002.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700