基于关键词加权的法律文本主题模型研究

英文篇名：Research on Topic Model of Legal Texts Based on Keyword Weighting
作者：张扬武 ; 李国和 ; 王立梅
英文作者：ZHANG Yangwu;LI Guohe;WANG Limei;College of Geophysics and Information Engineering,China University of Petroleum-Beijing;School of Information Management for Law,China University of Political Science and Law;Beijing Key Lab of Data Mining for Petroleum Data,China University of Petroleum-Beijing;
关键词：主题模型 ; 法律文本 ; 关键词 ; 加权 ; 困惑度
英文关键词：topic model;;legal text;;keywords;;weighting;;perplexity
中文刊名：JSSG
英文刊名：Computer & Digital Engineering
机构：中国石油大学(北京)地球物理与信息工程学院;中国政法大学法治信息学院;中国石油大学(北京)石油数据挖掘北京市重点实验室;
出版日期：2019-05-20
出版单位：计算机与数字工程
年：2019
期：v.47;No.355
基金：国家科技重大专项项目(编号:2018YFC0831202);; 国家自然科学基金项目(编号:60473125);; 中国石油大学(北京)克拉玛依校区科研启动基金(编号:RCYJ2016B-03-001)资助
语种：中文;
页：JSSG201905030
页数：6
CN：05
ISSN：42-1372/TP
分类号：161-165+219

摘要

为了降低法律文本中的无关词语对分类的影响和突出法律关键词汇的作用,采用主题模型建立一种基于法律词汇加权的文本分类模型。针对不同类别的法律文本的关键词的不同,在主题模型中提出了按关键词标记词到主题的文本集,并进行权值学习,用权值更新文档到主题的分布,从而提高了文档相似度计算的准确性。通过在Westlaw真实数据集上的计算分析,与传统的主题模型相比,加权的主题模型可以获得较好的困惑度和文本相似度。
In order to reduce dimensionality of legal text and remove irrelevant words in the legal text classification,the topic model is used to establish a text classification model based on legal term weighting. According to the keywords difference of different categories of legal texts,a keywords marked distribution from words to topics is proposed in the topic model. And then learning for weights is carried out,weights are used to update the distribution of documents to topics,thereby improving the accuracy of calculation on document similarity. Compared with the traditional topic model,the weighted topic model can get better perplexity and text similarity on the Westlaw database.

引文

[1]David M.Blei,Andrew Y.Ng,Michael I.Jordan.Latent Dirichlet Allocation[J].Journal of Machine Learning Research,2003,3:993-1022.
    [2]David M.Blei,J.Lafferty.Correlated Topic Models[C]//The Proceeding of International Conference on Neural Information Processing Systems,2005,18:147-154.
    [3]David M.Blei,J.Lafferty,D.John.Dynamic Topic Models[C]//The Proceedings of the International Conference Machine Learning,2006:113-120.
    [4]程锦宝,石琴,陈一锴,等.基于树增强朴素贝叶斯分类器的出租车制动系统安全状态预测[J].计算机与数字工程,2017,12:2465-2469.CHENG Jinbao,SHI Qin,CHEN Yikai,et al.Prediction of the Working Condition of Taxi's Braking System based on Tree Augmented Naive Bayesian Classifier[J].Computer and Digital Engineering,2017,12:2465-2469.
    [5]Y.Yao,Q.Li.Term Weighting Schemes for Emerging Event Detection[C]//The IEEE International Conference on Web Intelligence&Intelligent Agent Technology,2013,1:105-112.
    [6]P.A.Chew.Terms Weighting Schemes for Latent Dirichlet Allocation[C]//The Proceeding of the North American Chapters of the Association for Computation Linguistics,2010,3:465-473.
    [7]Bo Huang,Yan Yang,A.Mahmood,et al.Micoblog Topic Detection Based on LDA Model and Single-Pass Clustering[C]//Proceedings of 7th International Conference on Rough Sets and Current Trends in Computing,Chengdu,China,2012:352-359.
    [8]蒋权,郑山红,刘凯,等.DOLDA模型设计与主题演化分析[J].计算机工程与设计,2018(2):446-451.JIANG Quan,ZHENG Shanhong,LIU Kai,et al.Design of DOLDA model and analysis of theme evolution[J].Computer Engineering and Design,2018(2):446-451.
    [9]许银洁,孙春华,刘业政.考虑用户特征的主题情感联合模型[J].计算机应用,2018(5):1261-1266.XU Yinjie,SUN Chunhua,LIU Yezheng.Joint sentiment/topic model integrating user characteristics[J].Journal of Computer Applications,2018(5):1261-1266.
    [10]Y.Ko.A study of term weighting schemes using class information for text classification[C]//The ACM Sigir International Conference on Research and Development in Information Retrieval,2012:1029-1031.
    [11]郭蓝天,李扬,等.一种基于LDA主题模型的话题发现方法[J].西北工业大学学报,2016,34(4):697-701.GUO Lantian,LI Yang,et al.A LDA Model Based Topic Detection Method.Journal of Northwestern Polytechnical University,2016,34(4):697-701.
    [12]Abdur Rehman,Kashif Javed,Haroon A.Babri.Feature selection based on a normalized difference measure for text classification[J].Information Processing&Management,2017,53(2):473-489.
    [13]Wei Zong,Feng Wu,Lap-Keung Chu,Domenic Sculli.Adiscriminative and semantic feature selection method for text categorization[J].International Journal of Production Economics,2015,165:215-222.
    [14]Z Fan,S Chen,J Yang,J Yang.A Text Clustering Approach of Chinese News Based on Neural Network Language Model[J].International Journal of Parallel Programming,2016,44(1):198-206.
    [15]M Tan,L Tan,S Dara,C Mayeux.Online Defect Prediction for Imbalanced Data[C]//IEEE/ACM IEEE International Conference on Software Engineering,2015,2:99-108.
    [16]G Paltoglou,M Thelwall.A Study of Information Retrieval Weighting Schemes for Sentiment Analysis[C]//The Proceeding of the Association for Computational Linguistics,2010:1386-1395.
    [17]邹冲,蔡敦波,刘莹.组合SVM分类器在行人检测中的研究[J].计算机科学,2017,21:188-191.ZOU Chong,CAI Dunbo,LIU Ying.Research of Combination SVM Classifier in Pedestrian Detection[J].Computer Science,2017,21:188-191.
    [18]S Maji,AC Berg,J Malik.Classification using intersection kernel support vector machines is efficient[C]//IEEE Conference on Computer Vision&Pattern Recognition,2008,21:1-8.
    [19]黄健.基于文档相似度的双语文档排序学习[J].计算机与数字工程,2017,10:1986-1989.HUANG Jian.Learning to Rank Bilingual Document Based on Document Similarity[J].Computer and Digital Engineering,2017,10:1986-1989.
    [20]A Muhi?,J Rupnik,P?kraba.Cross-lingual document similarity[C]//International Conference on Information Technology Interfaces,2012:387-392.
    [21]YS Lin,JY Jiang,SJ Lee.A Similarity Measure for Text Classification and Clustering[J].IEEE Transactions on Knowledge&Data Engineering,2014,7:1575-1590.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700