基于Python语言的中文分词技术的研究

英文篇名：Chinese Word Segmentation Technology based on Python Language
作者：祝永志 ; 荆静
英文作者：ZHU Yong-zhi;JING Jing;School of Information Science and Engineering, Qufu Normal University;
关键词：python ; 文本分词 ; jieba ; 词云 ; 数据可视化
英文关键词：Python;;text segmentation;;jieba;;word cloud;;data visualization
中文刊名：TXJS
英文刊名：Communications Technology
机构：曲阜师范大学信息科学与工程学院;
出版日期：2019-07-10
出版单位：通信技术
年：2019
期：v.52;No.331
基金：山东省自然科学基金项目(No.ZR2013FL015);; 山东省研究生教育创新资助计划(No.SDYY12060)~~
语种：中文;
页：TXJS201907012
页数：8
CN：07
ISSN：51-1167/TN
分类号：70-77

摘要

Python作为一种解释性高级编程语言,已经深入大数据、人工智能等热门领域。Python在数据科学领域具有广泛的应用,比如Python爬虫、数据挖掘等等。将连续的字序列划分为具有一定规范的词序列的过程称为分词。在英文中,空格是单词间的分界符,然而中文比较复杂。一般来说对字、句子和段落的划分比较简单,但中文中词的划分没有明显的标志,所以对中文文本进行分词的难度较大。运用Python爬虫对网页数据进行抓取作为实验文本数据,使用python强大的分词库jieba对中文文本进行分词处理。对分词结果分别采用TF-IDF算法和TextRank算法进行提取关键词,实验结果明显优于基于词频的分词算法。最后采用词云的方式对关键词进行展现,使得分词结果一目了然。
As an interpreted high-level programming language, Python has penetrated into popular fields such as big data and artificial intelligence. Python has a wide range of applications in data science, such as Python crawlers, data mining, etc. Word segmentation is the process of recombining consecutive subsequences into word sequences in accordance with certain specifications. In English, spaces are delimiters between words,but Chinese is fairly complicated. Generally speaking, the division of words, sentences and paragraphs is relatively simple, but the division of words in Chinese has no obvious signs, so it is more difficult to segment Chinese words.Python crawlers are used to crawl web page data as experimental text data. Python's powerful word segmentation library jieba is used for word segmentation of Chinese text. The TF-IDF algorithm and the TextRank algorithm are used to extract keywords for the word segmentation results. The experimental results are obviously better than the word frequency-based word segmentation algorithm. Finally, the word cloud is used to display the keywords,thus making the word segmentation results clear at a glance.

引文

[1]谢克武.大数据环境下基于python的网络爬虫技术[J].电子制作,2017(09):44-45.XIE Ke-wu.Python-based Network Crawler Technology in Large Data Environment[J].Practical Electronics,2017(09):44-45.
    [2]管华.对当今Python快速发展的研究与展望[J].信息系统工程,2015(12):114,116.GUAN Hua.Research and Prospect on the Rapid Development of Python[J].Information System Engineering,2015(12):114,116.
    [3]郭丽蓉.基于Python的网络爬虫程序设计[J].电子技术与软件工程,2017(23):248-249.GUO Li-rong.Programming of Web Crawler Based on Python[J].Electronic Technology and Software Engineering,2017(23):248-249.
    [4]于重重,操镭,尹蔚彬等.吕苏语口语标注语料的自动分词方法研究[J].计算机应用研究,2017,34(05):1325-1328.YU Chong-chong,CAO Lei,YIN Wei-bin,ZHANG Zeyu,ZHENG Ya.Automatic word segmentation on Lizu spoken annotation corpus[J].Application,Research of Computers.2017,34(05):1325-1328.
    [5]王志超,孙建斌,秦瑞丽.基于分词的关联规则预测系统研究[J].计算机应用与软件,2018,35(12):140-143.WANG Zhi-chao,SUN Jian-bin,Qin Rui-li.Association Rule Prediction System Based on Word Segmentation[J].Computer Applications and SoftWare,2018,35(12):140-143.81.
    [6]安子建.基于Scrapy框架的网络爬虫实现与数据抓取分析[D].长春:吉林大学,2017.AN Zi-jian.Scrapy Frameework-based Web Crawler Achieve Data Capture and Analysis[D].Changchun:Jilin university,2017.
    [7]李康康,龙华.基于词的关联特征的中文分词方法[J].通信技术,2018,51(10):2343-2349.LI Kang-kang,LONG Hua.Chinese Word Segmentation Algorithm based on Word Association Characteristics[J].Communications Technology,2018,51(10):2343-2349.
    [8]吴帅,潘海珍.基于隐马尔可夫模型的中文分词[J].现代计算机(专业版),2018(33):25-28.WU Shuai,PAN Hai-Zhen.Chinese Word Segmentation Based on Hidden Markov Model[J].Modern Computer,2018(33):25-28.
    [9]严明,郑昌兴.Python环境下的文本分词与词云制作[J].现代计算机(专业版),2018(34):86-89.YAN Ming,ZHENG Chang-xing.Word Segmentation and Word Cloud Production in Python Environment[J].Modern Computer,2018(34):86-89.
    [10]唐明,朱磊,邹显春.基于Word2Vec的一种文档向量表示[J].计算机科学,2016,43(06):214-217,269.TANG Ming,ZHU LEi,ZOU Xian-chun.Document Vector Representation Based on Word2Vec[J].Computer Scien ce,2016,43(06):214-217,269.
    [11]曹洋.基于TextRank算法的单文档自动文摘研究[D].南京:南京大学,2016.CAO Yang.Single Document Automatic Summarization Based on TextRank Algorithm[D].Nanjing:Nanjing university,2016.
    [12]顾益军,夏天.融合LDA与TextRank的关键词抽取研究[J].现代图书情报技术,2014(Z1):41-47.GU Yi-jun,XIA Tian.Research on Keyword Extraction Based on LDA and TextRank[J].New Technology of Library and Information Service,2014(Z1):41-47.
    [13]吴丹露,魏彤,许家清.R语言环境下的文本可视化及主题分析-以社会服务平台数据为例[J].宁波工程学院学报,2015,27(01):19-25.WU Dan-Lu,WEI Tong,XU Jia-qing.Text Visualization and Theme Analysis in R Language Environment:ACase Study of Social Service Platform Data[J].Journal of Ningbo University of Technology,2015,27(01):19-25.
    [14]唐家渝,孙茂松.新媒体中的词云:内容简明表达的一种可视化形式[J].中国传媒科技,2013(11):18-19.TANG Jia-yu,SUN Mao-song.Word Cloud in New Media:AVisual Form of Concise Expressions[J].Science&Technology for China’s Mass Media,2013(11):18-19.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700