基于Tri-training与噪声过滤的弱监督关系抽取

英文篇名：Weakly Supervised Relation Extraction Based on Tri-training and Noise Filtering
作者：贾真 ; 冶忠林 ; 尹红风 ; 何大可
英文作者：JIA Zhen;YE Zhonglin;YIN Hongfeng;HE Dake;School of Information and Science Technology,Southwest Jiaotong University;DOCOMO Innovations Inc.;
关键词：关系抽取 ; 弱监督学习 ; Tri-training ; 数据编辑
英文关键词：relation extraction;;weakly supervised learning;;Tri-training;;data editing
中文刊名：MESS
英文刊名：Journal of Chinese Information Processing
机构：西南交通大学信息科学与技术学院;DOCOMO Innovations公司;
出版日期：2016-07-15
出版单位：中文信息学报
年：2016
期：v.30
基金：国家自然科学基金(61170111,61202043,61262058)
语种：中文;
页：MESS201604020
页数：9
CN：04
ISSN：11-2325/N
分类号：146-153+162

摘要

弱监督关系抽取利用已有关系实体对从文本集中自动获取训练数据,有效解决了训练数据不足的问题。针对弱监督训练数据存在噪声、特征不足和不平衡,导致关系抽取性能不高的问题,文中提出NF-Tri-training(Tritraining with Noise Filtering)弱监督关系抽取算法。它利用欠采样解决样本不平衡问题,基于Tri-training从未标注数据中迭代学习新的样本,提高分类器的泛化能力,采用数据编辑技术识别并移除初始训练数据和每次迭代产生的错标样本。在互动百科采集数据集上实验结果表明NF-Tri-training算法能够有效提升关系分类器的性能。
Weakly supervised relation extraction utilizes entity pairs to obtain training data from texts automatically,which can effectively deal with the problem of inadequate training data.However,there are many problems in the weakly supervised training data such as noise,inadequate features,and imbalance samples,leading to low performance of relation extraction.In this paper,a weakly supervised relation extraction algorithm named NF-Tri-training(Tri-training with Noise Filtering)is proposed.NF-Tri-training employs an under-sampling approach to solve the problem of imbalance samples,learns new samples iteratively from unlabeled data and uses a data editing technique to identify and discard possible mislabeled samples both in initial training data and in new samples generating at each iteration.The experiment on dataset of Hudong encyclopedia indicates the proposed method can improve the performance of relation classifiers.

引文

[1]陈立玮,冯岩松,赵东岩.基于弱监督学习的海量网络数据关系抽取[J].计算机研究与发展.2013,50(9):1825-1835.
    [2]Riedel S,Yao Limin,Mccallum A.Modeling relations and their mentions without labeled text[J].Machine Learning and Knowledge Discovery in Databases.2010,6323:148-163.
    [3]Zhou Z H,Li M.Tri-training:exploiting unlabeled data using three classifiers[J].IEEE Transactions on Knowledge and Data Engineering,2005,17(11):1529-1541.
    [4]Craven M,Kumlien J.Constructing biological knowledgebases by extracting information from text sources[C]//Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology(ISMB1999).Palo Alto,USA.1999:77-86.
    [5]Wu F,Daniel Sw.Autonomously semantifying Wikipedia[C]//Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management(CIKM2007).Lisbon,Portugal.2007:41-50.
    [6]Bunescu R,Mooney R.Learning to extract relations from the Web using minimal supervision[C]//Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics.Stroudsburg(ACL2007),USA.2007,45(1):567-583.
    [7]Mintz M,Bills S,Snow R,et al.Distant supervision for relation extraction without labeled data[C]//Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics(ACL2009).Singapore.2009:1003-1011.
    [8]Yao L M,Riedel S,Mccallum A.Collective cross document relation extraction without labeled data[C]//Proceedings of 2010 Conference on Empirical Methods in Natural Language Processing(EMNLP2010).Massachusetts,USA.2010:1013-1023.
    [9]Takamatsu S,Sato I,Nakagawa H.Reducing wrong labels in distant supervision for relation extraction[C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics(ACL2012).Jeju Island,Korea.2012:721-729.
    [10]Surdanu M,Mcclosky D,Tibshirani J,et al.A simple distant supervision approach for the TAC-KBP slot filling task[C]//Proceedings of the TAC-KBP2010 Workshop,USA,2010:1-5.
    [11]杨宇飞,戴齐,贾真,等.基于弱监督的属性关系抽取方法[J].计算机应用,2014,34(1):64-68.
    [12]欧阳丹彤,瞿剑峰,叶育鑫.关系抽取中基于本体的远监督样本扩充[J].软件学报.2014,25(9):2088-2101.
    [13]Blum A,Mitchell T.Combining labeled and unlabeled data with co-training[C]//Proceedings of the11th annual conference on Computational Learning Theory(COLT1998).Wisconsin,USA,1998:92-100.
    [14]Goldman S,Zhou Y.Enhancing supervised learning with un-labeled data[C]//Proceedings of the 17th International Conference on Machine Learning(ICML2000).California,USA,2000:327-334.
    [15]Nigam K,Mccallum Ak,Thrun S,et al.Text classification from labeled and unlabeled documents using EM[J].Machine Learning,2000,39(223):103-134.
    [16]Blum A,Chawla S.Learning from labeled and unlabeled data using graph min cuts[C]//Proceedings of the 18th International Conference on Machine Learning(ICML2001).Williamstown,MA,2001:19-26.
    [17]Li M,Zhou ZH.SETRED:Self-training with editing[C]//Proceedings of the 9th Pacific-Asia Conference on Knowledge Discovery and Data Mining(PAKDD2005).Hanoi,Vietnam,2005:611-621
    [18]Nigam K,Ghani R.Analyzing the effectiveness and applicability of co-training[C]//Proceedings of the ACM 9th Conference on Information and Knowledge Management(CIKM2000).Washington,DC,2000:86-93
    [19]Muhlenbach F,Lallich S,Zighed Da.Identifying and handling mislabeled instances[J].Journal of Intelligent Information Systems,2004,22(1):89-109.
    [20]邓超,郭茂祖.基于Tri-Training和数据剪辑的半监督聚类算法[J].软件学报.2008,19(3):663-673.
    [21]邓超,郭茂祖.基于自适应数据剪辑策略的Tritraining算法[J].计算机学报.2007,30(8):1213-1226.
    [22]Yen S,Lee Y.Cluster-based under-sampling approaches for imbalanced data distributions[J].Expert Systems with Applications,2009,36:5718-5727.
    [23]王中卿,李寿山,朱巧明,等.基于不平衡数据的中文情感分类[J].中文信息学报.2012,26(3):33-37,64.
    [24]尹红风,贾真,李天瑞,等.西南交通大学中文分词[OL].http://ics.swjtu.edu.cn
    (1)www.freebase.com

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700