摘要
关键词提取是信息检索、自然语言处理、本体构建等技术的基础。论文基于信息熵的方法,构建了一种无监督的关键词提取算法。该算法在运算过程中不需要字典、分词等先验知识,同时,能较好地识别出未登录词和支持多语种混合的语料环境。实验结果表明,该算法具有较好的准确率和较高的召回率,取得了令人满意的结果。
Keywords extraction is the basis for techniques of information retrieval,natural language processing,ontology andso on. The paper introduces a new unsupervised method of keywords extraction based on information entropy. The algorithm can analyze texts without prior knowledge such as domain dictionary and word segmentation,can well recognize word out of vocabulary andcan deal multi-language condition text. An experimental indicates that the algorithm can achieve good precision rate and recall rateand it has achieved satisfactory results.
引文
[1]刘伟权.自然语言理解与汉语文本信息处理理论研究[D].北京:北京邮电大学,1997.LIU Weiquan. The Research in Natural Language Understanding and Chinese Text Information Processing Theories[D]. Beijing:Beijing University of Posts and Telecommunications,1997.
[2]王灿辉,张敏,等.基于相邻词的中文关键词自动抽取[J].广西师范大学学报,2007,25(2):161-164.WANG Canhui,ZHANG Min,et al. Chinese Keyword Extraction Algorithm Based on Neighbour Words[J]. Journalof Guangxi Normal University,2007,25(2):161-164.
[3]CHIEN Leefeng. PAT-tree-based keyword extraction forChinese information retrieval[C]//Proceedings of the20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. NewYork,NY:ACM Press,1997:50-58.
[4]马颖华,王永成,苏贵洋,等.一种基于字同现频率的汉语文本主题抽取方法[J].计算机研究与发展,2003,40(6):874-878.MA Yinghua,WANG Yongcheng,SU Guiyang,et al. ANovel Chinese Text Subject Extraction Method Based onCharacter Co-occurrence[J]. Journal of Computer Research and Development,2003,40(6):874-878.
[5]陈小荷.现代汉语自动分析[M].北京:北京语言文化大学出版社,2000.CHEN Xiaohe. Automatic Analysis of Contemporary Chinese[M]. Beijing:Beijing Language and Culture University Press.
[6]吕锋,王虹,刘皓春,等.信息理论与编码[M].北京:人民邮电出版社,2004.LV Feng,WANG Hong,LIU Haochun,et al. InformationTheory and Coding[M]. Beijing:Posts and Telecom Press,2004.
[7]成卫青,唐旋.一种基于改进互信息和信息熵的文本特征选择方法[J].南京邮电大学学报(自然科学版),2013,33(5):63-68.CHENG Weiqing,TANG Xuan. A Text Feature SelectionMethod Using the Improved Mutual Information and Information Entropy[J]. Journal of Nanjing University of Postsand Telecommunications(Natural Science),2013,33(5):63-68.
[8]张振海,李士宁,李志刚,等.一类基于信息熵的多标签特征选择算法[J].计算机研究与发展,2013,50(6):1177-1184.ZHANG Zhenhai,LI Shining,LI Zhigang,et al. Multi-label Feature Selection Algorithm Based on Information Entropy[J]. Journal of Computer Research and Development,2013,50(6):1177-1184.
[9]郑家恒,卢娇丽.关键词抽取方法的研究[J].计算机工程,2005,31(18):194-196.ZHENG Jiaheng,LU Jiaoli. Study of An Improved Keywords Distillation Method[J]. Computer Engineering,2005,31(18):194-196.
[10]现代汉语语料库的多级标注.北京大学计算语言学研究所[EB/OL]. http://www.icl.pku.edu.cn/Introduction/corpustagging.html.Multi-leve Annotation of Modern Chinese Corpus. Institute of Computing Linguistics,Peking University[EB/OL]. http://www.icl.pku.edu.cn/Introduction/corpustagging.html.
[11]ZHAO Zheng,WANG Lei,LIU Huan,et al. On Similarity Preserving Feature Selection[J]. IEEE Transactionson Knowledge and Data Engineering,2013,25(3):619-632.
[12]董振东.汉语分词研究漫谈[J].语言文字应用,1997,(1):107-112.DONG Zhendong. About Chinese Word Segmentation[J]. Journal of Application of Language and Text,1997,(1):107-112.
[13]HUANG Lei,WU Yanpeng,ZHU Qunfeng. Research andImprovement of Keyword Automatic Extraction Method[J]. Computer Science,2014,41(6):204-207.
[14]MA Hongyi,LU Pei,ZHAN Zhiqun. Research on Complex Network Characteristics of Micro-blog Language[J]. Computer Engineering and Applications,2015,51(19):119-124.
[15]Natalie Schluter. Centrality Measures for Non-Contextual Graph-Based Unsupervised Single Document Keyword Extraction[J]. TALN,2014,92(2):455-460.