基于广泛相似度的维吾尔语文档分类方案

英文篇名：Uygur document classification scheme based on extensive similarity
作者：如先姑力·阿布都热西提 ; 亚森·艾则孜 ; 年梅
英文作者：Ruxianguli·ABUDUREXITI;Yasen·AIZEZI;NIAN Mei;Department of Information Security Engineering,Xinjiang Police College;School of Computer Science and Technology,Xinjiang Normal University;
关键词：维吾尔语 ; 文档分类 ; 广泛相似度 ; K-means聚类 ; 词频-逆向文档频率
英文关键词：Uygur;;document classification;;extensive similarity;;K-means clustering;;term frequency-inverse document frequency
中文刊名：SJSJ
英文刊名：Computer Engineering and Design
机构：新疆警察学院信息安全工程系;新疆师范大学计算机科学技术学院;
出版日期：2017-06-16
出版单位：计算机工程与设计
年：2017
期：v.38;No.366
基金：新疆维吾尔自治区自然科学基金科研基金项目(2015211A016)
语种：中文;
页：SJSJ201706052
页数：6
CN：06
ISSN：11-1775/TP
分类号：294-299

摘要

针对维吾尔语文档自动分类问题,提出一种基于广泛相似度度量和K-means聚类的文档分类方案。将维吾尔语文档进行预处理,通过词频-逆向文档频率(TF-IDF)算法获得关键词集合;利用提出的广泛相似度度量,通过考虑与语料库中其它文档之间的距离,计算文档间的相似度;基于广泛相似度构建一个集群距离矩阵,获得一组基础集群;将基础集群的中心作为K-means聚类的初始中心,完成所有文档的聚类。实验结果表明,该方案具有较高的分类精度和较低的计算时间。
For the issue of the automatic classification of Uyghur documents,a Uygur document classification scheme based on extensive similarity and K-means clustering was proposed.Uighur documents were preprocessed,and term frequency-inverse document frequency(TF-IDF)algorithm was used to get a set of keywords.The extensive similarity was used to calculate the similarity between the documents by considering the distance between the other documents in the corpus.A cluster distance matrix was constructed based on the extensive similarity to obtain a set of basic clusters.The center of the base cluster was used as the initial center of the K-means clustering,so as to make all the documents be clustered.Experimental results show that the proposed scheme has higher classification accuracy and lower computation time.

引文

[1]Rayila Parhat,Meng Xiangtao,Askar Hamdulla.Uyghur text sentiment classification based on discriminative keyword model[J].Computer Engineering,2014,40(10):132-136(in Chinese).[热依莱木·帕尔哈提,孟祥涛,艾斯卡尔·艾木都拉.基于区分性关键词模型的维吾尔文档情感分类[J].计算机工程,2014,40(10):132-136.]
    [2]LI Xiang,Tuergen Yibulayin,Kahaerjiang Abideresiti,et al.Emotion analysis of active learning based on SVM in uyghur language[J].Journal of Xinjiang University(Natural Science Edition,2015,32(4):447-452(in Chinese).[李响,吐尔根·依布拉音,卡哈尔江·阿比的热西提,等.基于主动学习的SVM维吾尔语情感分析研究[J].新疆大学学报:自然科学版,2015,32(4):447-452.]
    [3]Zhang W,Yoshida T,Tang X.A comparative study of TF*IDF,LSI and multi-words for text classification[J].Expert Systems with Applications,2011,38(3):2758-2765.
    [4]Chen R,Chen F,Sun Y.Research on automatic text classification algorithm based on ITF-IDF and KNN[J].Applied Mechanics&Materials,2015,71(3):1830-1834.
    [5]Modh J S,Brijesh S,K S S.A new K-mean color image segmentation with cosine distance for satellite images[J].International Journal of Engineering&Advanced Technology,2012,24(5):27-30.
    [6]YU Feng,YU Zhengtao,YANG Jianfeng,et al.Expert recommendation method for project evaluation based on tonic information[J].Computer Engineering,2014,40(6):201-205(in Chinese).[余峰,余正涛,杨剑锋,等.基于主题信息的项目评审专家推荐方法[J].计算机工程,2014,40(6):201-205.]
    [7]Maimaitiyiming Hasimu,Wushouer Silamu,Weinila Mushajiang,et al.Research N-gram based Uyghur text classification technique[J].Application Research of Computers,2015,32(7):1986-1988(in Chinese).[买买提依明·哈斯木,吾守尔·斯拉木,维尼拉·木沙江,等.基于N元模型的维吾尔文文档分类技术研究[J].计算机应用研究,2015,32(7):1986-1988.]
    [8]Turdi Tohti,Ahmatjan Ablat,Muyassar Aniwar,et al.Combined algorithm of GAAC and K-means for Uyghur text clustering[J].Computer Engineering and Science,2013,35(7):149-155(in Chinese).[吐尔地·托合提,艾海麦提江·阿布来提,米也塞·艾尼玩,等.一种结合GAAC和Kmeans的维吾尔文文档聚类算法[J].计算机工程与科学,2013,35(7):149-155.]
    [9]Turdi Tohti,Akbar Pattar,Askar Hamdulla.Semantics-based feature extraction and its application in Uyghur text classification[J].Journal of Chinese Information Processing,2014,28(6):140-144(in Chinese).[吐尔地·托合提,艾克白尔·帕塔尔,艾斯卡尔·艾木都拉.语义词特征提取及其在维吾尔文文档分类中的应用[J].中文信息学报,2014,28(6):140-144.]
    [10]Mairehaba·Aili,JIANG Wenbin,WANG Zhiyang,et al.Directed graph model of Uyghur morphological analysis[J].Journal of Software,2012,23(12):94-100(in Chinese).[麦热哈巴·艾力,姜文斌,王志洋,等.维吾尔语词法分析的有向图模型[J].软件学报,2012,23(12):94-100.]
    [11]Mikawa K,Ishida T,Goto M.A proposal of extended cosine measure for distance metric learning in text classification[C]//Proc of the IEEE International Conference on Systems,Man,and Cybernetics.Anchorage:IEEE,2011:1741-1746.
    [12]Gayathri B.Analysis of text clusters based on fuzzy and rough K-means strategies[J].Data Mining&Knowledge Engineering,2014,6(8):231-243.
    [13]Alimjan Aysa,Turgun Ibrahim,Kurban Obul,et al.Research of Uyghur language text categorization based on SVM[J].Computer Engineering and Science,2012,34(12):140-144(in Chinese).[阿力木江·艾沙,吐尔根·依布拉音,库尔班·吾布力,等.基于SVM的维吾尔文文档分类研究[J].计算机工程与科学,2012,34(12):140-144.]

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700