面向文本数据的正则化交叉验证方法

英文篇名：Regularized Cross-validation Method for Text Data Sets
作者：王瑞波 ; 王钰 ; 李济洪
英文作者：WANG Ruibo;WANG Yu;LI Jihong;School of Software,Shanxi University;
关键词：文本数据 ; 正则化 ; 交叉验证 ; 信噪比
英文关键词：text data sets;;regularization;;cross-validation;;signal-to-noise ratio
中文刊名：MESS
英文刊名：Journal of Chinese Information Processing
机构：山西大学软件学院;
出版日期：2019-05-15
出版单位：中文信息学报
年：2019
期：v.33
基金：国家社会科学基金(16BTJ034)
语种：中文;
页：MESS201905007
页数：12
CN：05
ISSN：11-2325/N
分类号：59-70

摘要

面向文本数据建模时,交叉验证方法是特征选择及模型比较任务中的常用方法。许多研究表明,文本数据模型的性能估计对交叉验证的数据切分方式较为敏感,不合理的切分方式可能会导致不稳定的性能估计值,使得实验结果可复现性差。该文试图论证基于多次重复(m次)的2折交叉验证,通过引入对训练集、验证集分布差异的约束,所构造的正则化m×2交叉验证方法(简记为m×2BCV)可以改善模型的性能指标的估计,适宜于模型比较。该文首先针对文本数据引入训练集与验证集分布差异的卡方度量,基于该度量构建数据切分的正则化条件,以最大化模型性能指标的信噪比为目标,给出了满足正则化条件的m×2BCV的数据切分优化算法。最后,以自然语言处理中汉语框架语义角色标注任务为例,验证了基于m×2BCV方法的有效性。
When building models on text data sets,cross-validation is a commonly used method in the tasks of feature selection and model comparison.Many studies have revealed that the estimation of performance of models on text data sets is sensitive to the data partitioning used in a cross-validation method.Unreasonable partitioning would lead to a less reliable estimation of the performance,as well as experimental results not repeatable by other researchers.This paper aims to improve the estimation and comparison of the performances by constructing a regularized m×2 cross-validation method(abbreviated as m×2 BCV).The method performs mtimes of two-fold cross-validation partitioning,and simultaneously introduces the constraints of divergence of distributions of training set and validation set into the partitioning.Specifically,the chi-square statistic is employed to measure the divergence of difference of distributions of the training set and the validation set.Then,the measurement is used to construct regularization conditions for data partitioning.Furthermore,by aiming to maximize signal-to-noise ratio of the estimation of the performance,the data partitioning of m×2 BCV is constructed through filtering out the partitions that satisfy all the preset regularization conditions.In experiments,models in semantic role labeling tasks of Chinese Framenet are investigated to compare different cross-validation methods.All experimental results validate the effectiveness of the proposed m×2 BCV method.

引文

[1]Arlot S,Celisse A.A survey of cross-validation procedures for model selection[J].Statistics surveys,2010,4:40-79.
    [2]Dietterich T G.Approximate statistical tests for comparing supervised classification learning algorithms[J].Neural computation,1998,10(7):1895-1923.
    [3]Alpaydin E.Combined 5×2cv F-test for comparing supervised classification learning algorithms[J].Neural Computation,1999,11(8):1885-1892.
    [4]Yildiz O T.Omnivariate rule induction using a novel pairwise statistical test[J].IEEE Transactions on Knowledge and Data Engineering,2013,25(9):2105-2118.
    [5]Wang Y,Wang R,Jia H,et al.Blocked 3×2cross-validated t-test for comparing supervised classification learning algorithms[J].Neural computation,2014,26(1):208-235.
    [6]Sgaard A,Johannsen A,Plank B,et al.What's in a pvalue in NLP?[C]//Proceedings of the Eighteenth Conference on Computational Natural Language Learning,2014:1-10.
    [7]Berg-Kirkpatrick T,Burkett D,Klein D.An empirical investigation of statistical significance in NLP[C]//Proceedings of Emnlp,2012:995-1005.
    [8]李济洪,王瑞波,王蔚林,等.汉语框架语义角色的自动标注.软件学报[J],2010,21(4):597-611.
    [9]宋毅君,王瑞波,史立校.中文分词任务中标注集合的选择方法[J].山西大学学报:自然科学版,2016,39(2):204-209.
    [10]Markatou M,Tian H,Biswas S,et al.Analysis of variance of cross-validation estimators of the generalization error[J].Journal of Machine Learning Research,2005,6:1127-1168.
    [11]Wang R,Wang Y,Li J,et al.Block-regularized m×2cross-validated estimator of the generalization error[J].Neural Computation,2017,29(2):519-554.
    [12]Wang Y,Li J,Li Y.Measure for data partitioning in m×2cross-validation[J].Pattern Recog-nition Letters,2015,65:211-217.
    [13]杨柳,王钰.泛化误差的各种交叉验证估计方法综述[J].计算机应用研究,2015,32(5):1287-1290.
    [14]Friedman J,Hastie T,Tibshirani R.The elements of statistical learning[M].Springer-Verlag Press,2001.
    [15]McCarthy P J.The use of balanced half-sample replication in cross-validation studies[J].Journal of the American Statistical Association,1976,71(355):596-604.
    [16]Burman P.A comparative study of ordinary crossvalidation,v-fold cross-validation and the repeated learning-testing methods[J].Biometrika,1989,76(3):503-514.
    [17]Kohavi R.A study of cross-validation and bootstrap for accuracy estimation and model selection[C]//Proceedings of Ijcai,1995:1137-1143.
    [18]Nadeau C,Bengio Y.Inference for the gener-alization error[J].Machine Learning,2003,52(3):239-281.
    [19]Bengio Y,Grandvalet Y.No unbiased estimator of the variance of K-fold cross-validation[J].Journal of Machine Learning Research,2004,5:1089-1105.
    [20]Rodríguez J D,Perez A,Lozano J A.Sensitivity analysis of k-fold cross validation in prediction error estimation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2010,32(3):569-575.
    [21]Manning C D,Schütze H.Foundations of statistical natural language processing[M].MIT press,1999.
    [22]Halberstadt A K.Heterogeneous acoustic meas-urements and multiple classifiers for speech recognition[D].Cambridge:Massachusetts In-stitute of Technology,1998.
    [23]Gillick L,Cox S J.Some statistical issues in the comparison of speech recognition algorithms[C]//Proceedings of International Conference on Acoustics,Speech and Signal Processing,1989:532-535.
    [24]Daelemans W,Hoste V.Evaluation of machine learning methods for natural language processing tasks[C]//Proceedings of the International conference on Lerc,2002.
    [25]Yeh A.More accurate tests for the statistical significance of result differences[C]//Proceedings of the18th conference on Computational linguistics,2000:947-953.
    [26]李国臣,党帅兵,王瑞波,等.基于字的分布表征的汉语基本块识别[J].中文信息学报,2014,28(6):18-25.
    [27]王瑞波,李济洪,李国臣,等.基于Dropout正则化的汉语框架语义角色识别[J].中文信息学报,2017,31(1):93-99.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700