基于迁移学习的文本分类算法研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
迁移学习技术因其领域间知识、技能和经验的迁移能力,已成为跨领域文本分类的重要手段和研究热点。本文通过总结迁移学习在文本分类中的应用与发展情况,针对目前该领域存在的一些问题、难点进行了分析和研究,并提出几种新的迁移学习算法。其中,针对文本分类中普遍存在的维数灾难及特征词义不明确,而易导致分类精度过低及过拟合等问题,提出了集特征选取与抽取为一体的特征降维方法—HLK;针对文本分类中源领域与目标领域间文本数据在数量及相似性等方面的特点,分别提出两种基于实例的迁移学习方法-CGTL与IDRTAT;针对源领域数据集与目标领域数据集中数据分布差异过大,提出一种基于特征的迁移学习算法-BFRTL,并通过实验验证了各算法有效性。
Transfer-learning technique has become an important means of cross-cutting text classification and research focus, for its migration ability of knowledge, skills and experience. In this paper, by summarizing the application and development of transfer learning in text classification, the discussion and research are conducted for the problems and difficulties exist in the field, and several new transfer-learning algorithms are promoted. For the dimensionality disaster and undefined meaning of feature word exist in common text classification and easily lead to low classification accuracy and over-fitting problems, a feature dimension reduction algorithm HLK. based on feature selection and extraction was promoted. For the amount and similarity feature between texts of the original field and the target areas, two instance transfer learning methods CGTL and IDRTAT were proposed; for the significant difference in distribution between the data set of the original field and the target field, a feature representation transfer learning algorithm based on the feature was promoted, which was called BFRTL, and the feasibility of all algorithms were verified by experiments.
引文
1.杜俊卫.基于聚类的文本迁移学习算法研究及应用[D].山西财经大学.2011.
    2. Fabrizio Sebastiani. Classification of text, automatic.The Encyclopedia of Language and Linguistics, 2006,14.
    3. Fabrizio Sebastiani.Machine learning in automated text categorization. ACM Computing Surveys, 2002.34(1):1-47.
    4. M. E. Maron. Automatic indexing:An experimental inquiry. ACM 1961,8(3):404-417.
    5. Frank Rosenblatt. Principles of neurodynamics:perceptrons and the theory of brain mechanisms.American Journal of Psychology,1963,76(4):705-707.
    6. G. Salton, A. Wong, C. S. Yang. A vector space model for automatic indexing. Communications of the ACM,1975,18(11):613-620.
    7. Gerard Salton, Michael J. McGill. Introduction to Modern Information Retrieval,1986.
    8. Philip J. Hayes, Steven P. Weinstein. A system for content-based indexing of a database of news stories. Proceedings of IAAI-90,2nd Conference on Innovative Applications of Artificial Intelligence, 1990:49-66.
    9. Tom Mitchell. Machine Learning. McGraw-Hill, Inc., New York, NY, USA,1996.
    10. Roberto Basili, Alessandro Moschitti, Maria Teresa Pazienza. A text classifier based on linguistic processing. Proceedings of IJCAI99, Machine Learning for Information Filtering,1999.
    11. David D. Lewis. An evaluation of phrasal and clustered representations on a text categorization task. 15th ACM International Conference on Research and Development in Information Retrieval, 1992:37-50.
    12. J. Schmidhuber. On learning how to learn learning strategies. Technical Report FKI,1994:198-94.
    13. S. Thrun. Is learning the n-th thing any easier than learning the first? In Advances in Neural Information Processing Systems 8,1996:640-646.
    14. R. Caruana. Multitask learning. Machine Learning,1997,28(1):41-75.
    15.孟佳娜.迁移学习在文本分类中的应用研究[D].大连理工大学,2011:6-8
    16. Xue X, Dai W, Yang Q, et al. Topic-bridged PLSA for cross-domain text classification. Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrival Singapore:ACM.2008.627-634.
    17. Dai W, Jin O, Xue G, et al. Eigen Transfer:A unified framework for transfer learning. Proceedings of the 26th Annual International Conference on Machine Learning, Canada,2009:192-200.
    18. Yi Yao, Gianfranco Doretto. Boosting for transfer learning with multiple sources. IEEE Conference on Computer Vision and Pattern Recocnition,2010:1855-1858.
    19. Du W, Tan S, Chen X, et al. Adapting information bottleneck method for automatic construction of domain-oriented sentiment lexicon. Proceedings of the 3rd ACM Intenational Conference on Web Search and Data Mining,2010:111-120.
    20. Yang T, Jin R, Jain K, et al. Unsupervised transfer classification:application to text categorization. Proceedings of the 16th ACM SIGKDD Conference on Knowledge and Data Mining Discovery, 2010:1159-1168.
    21. Chen B, Lam W, Tsang I, et al.Extracting discriminative concepts for domain adaptation in text mining. Proceedings of the Fifteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,2009:179-187.
    22. Bickel S, Scheffer T. Dirichlet-enhanced spam filtering based on biased samples. Advances in Neural Information Proces-sing System,2007:161-168.
    23. Zhang X, Dai W, Xue G, et al.Adaptive email spam filtering based on information theory. Proceedings of the 8th International Conference on Web Information Systems Engineering,2007:159-170.
    24. Li T, Sindhwani V, Ding C, et al. Knowledge transformation from for cross-domain sentiment classification. Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,2009:432-439.
    25. Sinno Jialin Pan, Xiaochuan Ni, Jian-Tao Sun, et al. Cross-domain sentiment classification via spectral feature alignment. Proceedings of the19th international conference on World wide web,2010.
    26.张紫琼,叶强,李一军.互联网商品评论情感分析研究综述[J].管理科学学报,2010,13(6):87.
    27. Blitzer J;Dredze M;Pereira F. Biographies, Bollywood,boom-boxes and blenders:Domain adaptation for sentiment classification. Proceedings of the45th Annual Meeting of the Association of Computational Linguistics,2007:432-439.
    28. Salton G, Wong A, Yang C.S. A vector space model for auto matie indexing, Conununieations of the ACM,1975,18(11):613-620.
    29. Dumais S.T, Platt J, Heckerman D, et al. Inductive learning algorithms and representations for text categorization, Proceedings of CIKM-98,7th ACM International Conference on Information and Knowledge Management,1998:148-155.
    30. Nunzio G.M.D.A bidimensional view of doeuments for text eategorization, InProeeeding of the 26th European Conference on Information Retrieval Researeh,2004:112-126.
    31. Bigi B.Using Kullback-Leibler distance for text categorization.In:Sebastiani F,ed.Proc.of the 25th European Conf.on Information Retrieval (ECIR-03).Pisa:Springer-Verlag,2003.305-319.
    32. Caropreso M F, Matwin S, Sebastiani F.A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization.Text Databases and Document Management: Theory and Practice,2001:78-102.
    33.解冲锋,李星.基于序列的文本自动分类算法[J].软件学报,2002,13(4):783-789.
    34. Kehagias A, Petridis V, Kaburlasos V.G, et al.A Comparison of Word-and Sense-Based Text Categorization Using Several Classification Algorithms. Journal of Intelligent Information Systems,2003,21(3):227-247.
    35. A. Moschitti and R. Basili. Complex linguistic features for text classication:a comprehensive study. Proceedings of ECIR-04,26th European Conference on Information Retrieval,2004:181-196.
    36. Debole F, Sebastiani F. Supervised term weighting for automated text categorization. Proceedings of the 18th ACM Symp.on Applied Computing (SAC-03),2003:784-788.
    37. Xue D, Sun M.Chinese text categorization based on the binary weighting model with non-binary smoothing.In:Sebastiani F, ed.Proc.of the 25th European Conf.on Information Retrieval (ECIR-03).Pisa:Springer-Verlag,2003:408-419.
    38. Lertnattee V, Theeramunkong T. Effect of term distributions on centroid-based text categorization. Information Sciences,2004,158(1):89-115.
    39. Me Callum A, et.al. A Comparison of Event Models for Naive Bayes Text Classification.AAAI-98 Workshop on Learning for Text Categoriza-tion,1998:509-516.
    40. Belur V Dasarathy.Nearest Neighbor (NN) Norms:NN Pattern Classifica-tion Techniques,1991.
    41. Burges J.C. A Tutorial on Support Vector Machines for Pattern Recognition. Bell Laboratories, Lucent Technologies.1997.
    42. Rijsbergen. Information Retrieval. Butterworths:London,1979.
    43. Ling C X, Huang J, Zhang H. AUC:A statistically consistent and more diseriminating measure than accuracy. In Proeeedings of 18th International Conference on Artifieial Intelligence,2003:329-341.
    44. Jones R. Learning to extract entities from labeled and unlabeled text (Technical Report CMU-LTI-05-191),2005.
    45. Njgam K. McCallum A K. Thrun S, et al. Text classification from labeled and unlabeled documents using EM Machine Learning,2000,39(2):103-134.
    46. Mojdeh M, Cormack G V. Semi-supervised spam filtering:does it work? In Proceedings of the 31st Annual International ACM SIGIR Conferance on Research and Developmenr in Information Retrieval. Singapore,2008
    47. Le A.C, Shimazu A, Huynh V.N, et al. semi-supervised learning integrated with classifier combination for word sense disambiguation. Computer Speech and Language,2008,22(4):330-345.
    48. Li K, Guan C. Joint feature re-extraction and classification using an iterative semi-supervised support vector machine algorithm. Machine Learning,2008,71(1):33-53.
    49. Lu K, Zhao J D. Cai D. An algorithm for semi-supervised learning in image retrieval Pattern Recognition,2006,39(4):717-720.
    50. Zhong S, Chosh J. A unified framework for model-based clustering. Machine Learning.2003, 4(10):1001-1037.
    51. Chung S. Jun J, McLeod D. Mining gene expression datasets using density-based clustering. Proceedings of the 13th ACM International Conference on information and Knowledge Management, 2004:150-151.
    52. Su T, Dy J. A deterministic method for initializing k-means clustering. Proceedigs of the 16th IEEE International Conference on Tools with Artificial Intelligence,2004:784-786.
    53. Liv J, Lee J P Y, Li L G, et al. Online clustering algorithms for radar emitter classification. IEEE Transactions on Pattern Analysis and Machine Intelligence,2005,27(8):1185-1196.
    54. Yang Y, Zhang J P, Yang J. Grid-based hierarchical spatial clustering algorithm in presence of Obstacle and Constraints. Proceedings of the 2008 International Conference on Internet Computing in Science and Engineering. Harbin, Leilongjiang, China,2008:383-388.
    55. Bandyopadhyay S, Coyle E J. An energy efficient hierarchical clustering algorithm for wireless sensor networks. In Proceeding of 22nd Annual Joint Conference of the IEEE Computer and Communications Societies. San Francisco, California, USA,2003:1713-1723.
    56. Parunak H V D, Rohwer R. Belding T C, et al. Dynamic decentralized any-time hierarchical clustering. Lecture Notes Computer Science,2007,4335:66-81.
    57. Otoo E J. Shoshani A, Hwang S. Cluszering hjgh dimensional massive scientific datasets. Proceedings of the 13th ACM International Conference on Scientific and Statistical Database Management. Fairfax Virginia,2001:147-157.
    58. Bickel S, Bruckner M, Scheffer T.Discriminative learning for differing training and test distributions.Proceedings of the 24th international conference on Machine learning,2007:81-88.
    59. Dai W, Xue G-R, Yang Q, et al. Transferring naive bayes classifiers for text classification. Proceedings of the 22nd national conference on Artificial intelligence.2001,1:540-545
    60. Dai W, Xue G-R, Yang Q, et al. Boosting for transfer learning. Proceedings of the 24th international conference on Machine learning,2007:193-200.
    61. Fan W, Davidson J, Zadrozny B, et al. An Improved Categorization of Classifier's Sensitivity on Sample Selection Bias. Proceedings of the Fifth IEEE International Conference on Date Mining, 2005:605-608.
    62. Huang J, Gretton A, Scholkopf B, et al.Correcting sample selection bias by unlabeleddata. NIPS 2007.
    63. Ando R K, and Zhang T. A high-performance semi-supervised learning method for text chunking. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics,2005:1-9.
    64. Andreas Argyriou, Theodoros Evgeniou, Massimiliano Pontil. Multi-task feature learning. Proceedings of the 19th Annual Conference on Neural Information Processing Systems,2007:41-48.
    65. Argyriou A, Micchelli C, Pontil M, et al. A spectral regularization framework for multi-task structure learning,2008:25-32.
    66. Blitzer J, McDonald R, Pereira F. Domain adaptation with structural correspondence learning. Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, 2006:120-128.
    67. John Blitzer, Mark Dredze, Fernando, et al.Domain adaptation for sentiment classification. Proceedings of the Association for Computational Linguistics (ACL),2007:432-439.
    68. Bonilla E. Chai K. Williams C. Multi-task Gaussian process prediction. NIPS 2008:153-160.
    69. Evgeniou T, Pontil M.Regularized multi-task learning.Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining,2004:109-117.
    70. Cao J, Fan w, Jiang J, et al. Knowledge transfer via multiple model local structure mapping. In Proceeding of the 14th ACM SIGKDD international conferencen on Knowledge discovery and data mining,2008:283-291.
    71. Davis J. Domingos P. Deep transfer via second-order markov logic. in the AAAI-2008 Workshop on Transfer Learning for Complex Tasks.2008.
    72. Menczer F, Pant G, Srinivasan P.Topical Web crawlers:Evaluating adaptive algorithms. ACM Transactions on Internet Technology,2004:608-614.
    73. Mihalkova L, Mooney R J. Transfer learning by mapping with mimmal target data. in the AAAI-2008 Workshop on Transfer Learntng for Complex Tasks,2008.
    74. Lilyana Mihalkova, Tuyen Huynh, Raymond J. Mooney. Mapping and revising markov logic networks for transfer learning. Proceedings of th 22nd AAAI Conference on Artificial Intelligence,2007.
    75. Matthew Richardson, Pdro Domingos. Markov logic networks. Machine Learning,2006,62:107-136.
    76. Michael T. Rosenstein, Zvika Marx, Leslie Pack Kaelbling. To Transfer or Not To Transfer. In NIPS-05 Workshop on Inductive Transfer:10 Years Later,2005.
    77.庞雅丽.基于目标迁移的文本分类技术[J].科技信息,2008:15.
    78.庄福振,罗平,何清,史忠植.基于混合正则化的无标签领域的归纳迁移学习[J].科学通报,2009,54(11):1618-1625.
    79. Bart Bakker, Tom Heskes. Task Clustering and Gating for Bayesian Multitask Learning.Journal of Machine Learning Research,2003:83-99.
    80. Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, Andrew Y. Ng. Self-taught Learning: Transfer learning from unlabeled data. Proceedings of the 24th International Conference on Machine Learning,2007.
    81. Hal Daum,, Daniel Marku. Domain Adaptation for Statistical Classifiers. Journal of Artificial Intelligence Research,2006,26.
    82. Bianca Zadrozny. Learning and Evaluating Classifiers under Sample Selection Bias. Proceedings of the 21st International Conference on Machine Learning,2004.
    83. Dai Wenyuan, Yang Qiang, Xue Guirong, Yu Yong. Self-taught clustering. Proceddings of the Twenty-fifth Annual International Conference on Machine Learning,2008.
    84. Pan S J, Yang Q. A Survey on Transfer Learning IEEE Transactions on Knowledge and Data Engineering.IEEE Computer Society,2009:99.
    85. Santana L E A,de Oliveira D F,Canuto A M P, et al.A comparative analysis of feature selection methods for ensembles with different combination methods. Proc. of International Joint Conf. on Neural Networks (IJCNN 2007).2007:643-648.
    86. LI Fang-tao,GUAN Tao,ZHANG,Xian,et al.An aggressive featureselection method based on rough set theory. Proc of the 2nd International Conference on Innovative Conputing,Information and Con-trol. 2007:176-179.
    87. Hinrich Schiitze, David A. Hull, Jan O. Pedersen. A comparison of classifiers and document representations for the routing problem. In SIGIR'95:Proceedings ofthe 18th annual international ACM SIGIR conference on Research and development in information retrieval,1995:229-237.
    88. David D.Lewis. An evaluation of phrasal and clustered representations on a text categorization task. In Nicholas J. Belkin, Peter Ingwersen, and Annelise Mark Pejtersen, editors, Proceedings of SIGIR-92,15th ACM International Conference on Research and Development in Information Retrieval, 1992:37-50.
    89. Ron Bekkerman, Ran El-Yaniv, Naftali Tishby, et al. Distributional word clusters VS. words for text categorization. The Journal of Machine Learning Research,2003,3:1183-1208.
    90. Y. H. Li, A. K. Jain. Classification of text documents. The Computer Journal,41(8).
    91. Yiming Yang. Noise reduction in a statistical approach to text categorization. In SIGIR'95 Proceedings of the 18th annual international ACM SIGIR conference on Research and developmentin information retrieval,1995:256-263.
    92. George Karypis, Eui-Hong (Sam) Han. Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval. In CIKM' 00:Proceedings of the ninth international conference on Information and knowledge management,2000:12-19.
    93. Sarah Zelikovitz, Haym Hirsh. Using lsi for text classification in the presence of background text. In CIKM'01:Proceedings of the tenth international conference on Information and knowledge management,2001:113-118.
    94. David Hull. Improving text retrieval for the routing problem using latent semantic indexing. In SIGIR' 94:Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval,1994:282-291.
    95. Kari Torkkola. Linear discriminant analysis in document classification. In IEEE ICDM-2001 Workshop on Text Mining (TextDM'2001),2001.
    96. Rie Kubota Ando. Latent semantic space:iterative scaling improves precision of inter-document similarity measurement. In SIGIR'00:Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval,2000:216-223.
    97. Xiaofei He, Deng Cai, Haifeng Liu, et al. Locality preserving indexing for document representation. In SIGIR'04:Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval,2004:96-103.
    98. Daniel D.Lee, H. Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature,1999,401:788-791.
    99. Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization. In Advances Neural Information Processing System 13,2001:556-562.
    100.廖一星,潘雪增.面向不平衡文本的特征选择方法.电子科技大学学报.2012,41(4):594.
    101. Torrey L, Shavlik J, Natarajan S, et al. Transfer in reinforcement learning via Markov logic networks. Association for the Advancement of Artificial Intelligence,2008.
    102. Dai Wen-yuan, Yang Qiang, Xue Gui-rong. et al. Boosting for transfer learning. ACM International Conference Proceeding Series,2007,227:193-200.
    103.刘伟,张化祥.数据集动态重构的集成迁移学习.计算机工程与应用.2010,46(12):126-128.
    104. Mei Canhua, Zhang Yuhong, Hu Xuegang, and Li Peipei.A Weighted Algorithm of Inductive Transfer Learning Based on Maximum. Journal of Computer Research and Development,2011, 48(9):1722-1728.
    105. Kaian Wang, Hai Zhao, Bao-Liang Lu. Task decomposition using geometric relation for min-max modular svms,2005:887-892.
    106. A. L. Berger, S. Pietra, V. Pletra A maximum entropy approach to natural language processing. Computational Linguistics,1996,22(1):38-73.
    107. J. Darroch, D. Ratcliff. Generalized iterative scaling for log-linear models. Annals of Mathematical Statistics,1972,43:1470-1480.
    108. Jebara T. Multi-task feature and kernel selection for SVMs. Proceedings of the twenty-first international conference on Machine learning. Banff.Alberta, Canada:ACM,2004:55.
    109. Lee, V. Chatalbashe, D. Vickrey, et a 1.Learning a Meta-Level Prior for Feature Relevance from Multiple Related Tasks. Proceedings ofthe 24th international conference on Machine learning, 2007:489-496.
    110. Quionero-Candela J, Sugiyama M, Schwaighofer A. et al. Dataset Shift in MachineLearning:The MIT Press,2009:248.
    111. Ruckefl U and Kramer S. Kernel-Based Inductive Transfer. Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases-PartⅡ,2008:220-233.
    112. He X, Cai, Han J. Learning a maximum margin subspace forimage retrieval [J]. IEEE Transactions on Knowledge and Data Engineering,2008,20(2):189-201.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700