基于70年报刊语料的现代汉语历时稳态词抽取与考察

英文篇名：Extraction and Investigation of State Steady Words from 70 Years Newspapers
作者：饶高琦 ; 李宇明
英文作者：RAO Gaoqi;LI Yuming;Center for Studies of Chinese as a Second Language,Beijing Language and Culture University;Institute for Chinese Language Policies and Standards,Beijing Language and Culture University;
关键词：稳态词 ; 历时语料库 ; 语言监测
英文关键词：steady-state word;;diachronic corpus;;language monitoring
中文刊名：MESS
英文刊名：Journal of Chinese Information Processing
机构：北京语言大学对外汉语研究中心;北京语言大学语言政策与标准研究所;
出版日期：2016-11-15
出版单位：中文信息学报
年：2016
期：v.30
基金：国家社科基金(12&ZD173);国家社科基金(16AYY007);; 国家语委科研项目(YB125-42;ZDI135-3);; 863计划重点项目(SQ2015AA0100074);; 教育部人文社科重点研究基地重大项目(16JJD740004)
语种：中文;
页：MESS201606007
页数：10
CN：06
ISSN：11-2325/N
分类号：67-76

摘要

该文基于70年跨度的历时报刊语料库,使用九种统计方法计算了词语历年的使用情况,并通过对稳定性、覆盖度和时间区分性能的考察筛选获得了规模为3 013词的历时稳态词候选词集。该词集中动词与名词各占约三分之一(其余为形容词、副词与虚词),平均词长约1.7字,前密后疏地分布于历时语料库总频序表的前7 609位,覆盖了总语料的近九成。该部分词语中包含大量构造句子结构的核心词语。它们塑造了稳态词在词长和词类上的特性。稳态词的提取可以加深对语言生活底层与基础词汇的认识,对汉语教学、中文信息处理和语言规划都具有重要意义。
Based on the diachronic corpus of modern Chinese newspaper across 70 years,statistical measures are applied to detect the state-steady words.Altogether,3 013 words are decided as the candidates according to their corpus coverage,time sensitivity and diachronic classification.Among them,verbs and nouns cover one third,respectively,and the rest consists of adjectives and function words.The average word length is 1.7characters,distributed within top 7 609 in frequency list,and covering 90% of corpus.Basic morphemes and core words shape the features of the set in POS and length.

引文

[1]张普.论语言的稳态[J].郑州大学学报(哲学社会科学版),2008(02):105-109.
    [2]Fukumoto F,Suzuki Y,Takasu A.Timeline adaptation for text classification[C]//Proceedings of ACM International Conference on Information&Knowledge Management.2013:1517-1520.
    [3]Degaetanoortlieb S.Feature Discovery for Diachronic Register Analysis:a Semi-Automatic Approach[C]//Proceedings of International Conference on Language Resources and Evaluation(LREC′12).2012:2786-2790.
    [4]谢晓燕.基于26年《深圳特区报》的稳态词语提取与考察研究[D].北京语言大学博士学位论文,2010.
    [5]荀恩东,饶高琦,肖晓悦,等.大数据背景下BCC语料库的研制[J].语料库语言学,2016,3(1):93-118.
    [6]荀恩东,饶高琦,谢佳莉,等.现代汉语词汇历时检索系统与应用研究[J],中文信息学报,2015(3):169-176.
    [7]K Sparck-Jones.A statistical interpretation of term specificity and its application in retrieval[J].Journal of documentation,1972,28(1):11-21.
    [8]S E Robertson,K S Jones.Relevance weighting of search terms[J].Journal of American Society of Information Science,27(3):129-146.
    [9]C E Shannon,A mathematical theory of communication[J].Bell System Technical Journal,1948,27:379-423,623-656.
    [10]T M Cover,J A Thomas,Elements of Information Theory[M].John Wiley&Sons,New Jersey.1991:96-99.
    [11]Xu Y,Jones G J F,Li J T,et al.A study on mutual information-based feature selection for text categorization[J].Journal of Computational Information Systems,2007,3(3):1007-1012.
    [12]顾益军,樊孝忠,王建华,等.中文停用词表的自动选取[J].北京理工大学学报,2005,25(4):337-340.
    [13]关高娃.蒙古文停用词和英文停用词比较研究[J].中文信息学报,2011,25(4):35-38.
    [14]Lo T W,He B,Ounis I.Automatically Building a Stopword List for an Information Retrieval System.[J].Journal of Digital Information Management,2005,3(1):3-8.
    [15]冯志伟,胡凤国.数理语言学[M].北京:商务印书馆,2012:255.
    [16]I Rosengren.The quantitive concept of language and its relation to the structure of frequency dictionaries[J].Etudes de Linguistiques Applique,1971(1):103-127.
    [17]Huarui Zhang,Churen Huang,Shiwen Y.Distributional Consistency:A general method for defining a core lexicon[C]//Proceedings of International Conference on Language Resources and Evaluation(LREC′04),2004.
    [18]教育部语言文字信息管理司.中国语言生活状况报告[M],北京:商务印书馆,2015.
    [19]Ian H Witten,Eibe Frank,Mark A Hall.Data Mining:Practical Machine Learning Tools and Techniques(3rd Edition)[M].Burlington,Massachusetts:Press Morgan Kaufmann.2005:151-162.
    [20]国家汉语水平考试委员会《汉语水平词汇等级大纲》[M],北京:经济科学出版社,2001.
    (1)http://bcc.blcu.edu.cn/hc
    (2)由于种种原因,本文实验过程中没有获得2003年到2008年的《人民日报》语料,该部分由实验室积累的相应年份的《贵州日报》替补。
    (1)这一划分方法是针对本任务诸方法进行的,因而没有完备划分所有可能情况[
    (1)本文使用weka数据挖掘平台19]实现的朴素贝叶斯分类算法,版本3.6.13

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700