藏文停用词选取与自动处理方法研究

英文篇名：Research on Tibetan Stop Words Selection and Automatic Processing Method
作者：珠杰 ; 李天瑞
英文作者：ZHU Jie;LI Tianrui;School of Information Science and Technology,Southwest Jiaotong University;Department of Computer Science,Tibet University;
关键词：藏文停用词 ; 词频统计 ; 文档频数 ; 熵
英文关键词：Tibetan stop word;;TF;;DF;;entropy
中文刊名：MESS
英文刊名：Journal of Chinese Information Processing
机构：西南交通大学信息科学与技术学院;西藏大学工学院计算机科学系;
出版日期：2015-03-15
出版单位：中文信息学报
年：2015
期：v.29
基金：国家自然基金(61262058,60763010);; CCF中文信息技术开放基金项目(CCF2012-02-01);; 藏文信息技术教育部“长江学者与创新团队发展计划”(IRT0975)
语种：中文;
页：MESS201502016
页数：8
CN：02
ISSN：11-2325/N
分类号：129-136

摘要

停用词的处理是文本挖掘中一个关键的预处理步骤。该文结合现有停用词的处理技术,研究了基于统计的藏文停用词选取方法,通过实验分析了词项频率、文档频率、熵等方法的藏文停用词选用情况,提出了藏文虚词、特殊动词和自动处理方法相结合的藏文停用词选取方法。实验结果表明,该方法可以确定一个较合理的藏文停用词表。
Stop words processing is a key preprocessing step in the text mining.In this paper,the selection method of stop words in Tibetan based on statistics is studied by combining with the existing techniques.Through experiments,TF,DF,and entropy calculation methods in the selection of Tibetan stop words are analyzed.An approach for the selection of Tibetan stop words is presented by the combination of Tibetan function words,special verb and automatic approach.The experimental results show that the proposed method can determine a reasonable Tibetan stop words list.

引文

[1]Ho T K.Stop Word Location and Identification for Adaptive Text Recognition[J].International Journal on Document Analysis and Recognition,2000,3(1):16-26.
    [2]Van Rijsbergen C J.Information retrieval[M].London:Butterworths Scientific Publication,1975.
    [3]Fox C.Lexical analysis and Stop list,Information Retrieval:Data Structures and Algorithms,Upper Saddle River[M].New Jersey:Prentice Hall,1992.
    [4]周茜,赵明生,扈旻.中文文本分类中的特征选择研究[J].中文信息学报,2003,18(3):17-23.
    [5]Hao L,Hao L.Automatic Identification of Stop Words in Chinese Text Classification[C]//Proceedings of the 2008International Conference on Computer Science and Software Engineering Wuhan,China:IEEE Computer,2008:718-722.
    [6]顾益军,樊孝忠,王建华等.中文停用词表的自动选取[J].北京理工大学学报,2005,25(4):337-340.
    [7]Zou F,Wang F L,Deng X T,et al.Automatic Construction of Chinese Stop Word List[C]//Proceedings of the 5th WSEAS International Conference on Applied Computer Science,Hangzhou,China.2006,4:1010-1015
    [8]Stop Word List-Words Filtered out by Search Engine Spiders[EB/OL].http://www.seo-innovation.com/support-files/stop word list.pdf.2007.
    [9]周钦强,孙炳达,王义.文本自动分类系统文本预处理方法的研究[J].计算机应用研究,2005,2:85-86.
    [10]罗杰,陈力,夏德麟等.基于新的关键词提取方法的快速文本分类系统[J].计算机应用研究,2006,4:32-34.
    [11]Silva C,Ribeiro B.The importance of stop word removal on recall values in text categorization[J].Neural Networks,2003,3:20-24.
    [12]Yang Y.Pedersen J.A comparative study on feature selection in text categorization[C]//Proceedings of ICML-97,14th International Conference on Machine Learning.San Francisco:Morgan Kaufmann Publishers Inc.1997:412-420.
    [13]攻政,关高娃.蒙古文停用词和英文停用词比较研究[J].中文信息学报,2011,25(4):35-38.
    [14]格桑居冕,格桑央京.实用藏文文法教程[M].成都:四川民族出版社,2004.
    [15]游荣彦,邓志才,李传宏.向量空间模型中特征词的区分度的定量研究[J].中文信息学报,2011,16(3):15-19.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700