面向内容安全的文本分类研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

读者指南

学术客户端

NSTL服务站

科技查新

面向内容安全的文本分类研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on Text Categorization Method Oriented to Content Security
作者：张博锋
论文级别：博士
学科专业名称：计算机科学与技术
中文关键词：信息安全 ; 内容安全 ; 文本分类 ; 支持向量机 ; 多层次分类 ; 半监督学习 ; 协同训练 ; 多标签学习 ; 特征集分割 ; 标签状态向量空间
英文关键词：information security ; content security ; support vector machine ; hierarchical categorization ; semi-supervised learning ; co-training ; multi-label learning ; feature set splitting ; label status vector space
学位年度：2007
导师：苏金树
学科代码：081203
学位授予单位：国防科学技术大学
论文提交日期：2007-10-01

摘要

随着互联网应用技术的发展,滥用信息所造成的政治、经济、军事、社会和文化等诸多方面的问题引起人们的关注,内容安全逐渐成为信息安全的一项基本内容。文本分类是根据内容对相关信息进行组织、管理、识别及过滤的有力手段和核心技术之一,面向互联网内容安全的需求对文本分类技术提出新的挑战。
     信息内容的安全必须对异常的内容实施高效监控和及时响应,因此需要分类系统对通过的文本进行高速实时的检测。互联网上的内容多样且更新频繁,某些情况下必须以较大代价,甚至无法为分类器的训练提供感兴趣内容的更多标注样本,成为分类系统构建的主要瓶颈,因此通过少量标注样本和大量未标注样本进行分类器训练的半监督学习方法成为研究的热点。内容的多样性和各种主题的相互交叉,还使得内容安全不同领域的关注者可能希望对类似或完全相同的内容作出响应,多标签学习主要解决这种实例可能同时属于多个类别的问题,成为一个新的研究方向。
     本文围绕互联网内容安全需求背景下的文本分类这一主题,主要针对高效率的文本分类训练和预测方法、缓解标注瓶颈的半监督学习,以及多标签的文本分类三个问题展开深入研究,取得的主要成果与创新工作概括如下:
     1.高效率的SVM多类学习方法研究。提出了与Rocchio级联的SVM多类方法Roc-SVM,通过Rocchio分类器高速准确的过滤大部分不相关类别,大幅减少所需的二值SVM判别次数,将“一对一”和“一对剩余”两种SVM多类方法实验中的分类时间降低约一个数量级,分类的准确性却基本不受损害。为了优化一对剩余SVM多类训练的过程,提出一种简洁的类增量式SVM多类方法CI-SVM。实验表明,其训练时间相对一对剩余多类方法大幅减少,分类过程的效率也显著提高。
     2.通过类别层次对na(l|¨)ve Bayes分类器准确性的改进。Na(l|¨)ve Baye方法的训练效果受主观选择的训练数据关于类别全局分布的影响。利用层次式分类的特点,通过在类别的后验概率计算中引入新的概率条件,并在每个内部类别所属的子类局部数据中进行决策的方法,对na(l|¨)ve Bayes分类器进行改进。改进方法EHNB降低了全局数据分布对分类器的影响,部分缓解了样本关于类别分布不均衡的问题,使得na(l|¨)ve Bayes方法在层次式分类中的效果有较明显的提高。
     3.基于自训练与EM方法集成训练的半监督学习方法。提出将激进的对未标注样本进行标注的自训练,与保守调整未标注样本标签状态的EM两种方法训练过程进行集成的思想,并提供ESTM和SEMT两种半监督学习方法。ESTM在EM的迭代中利用中间结果进行确定性标注,而SEMT在自训练过程中,以半监督的EM方法代替na(l|¨)ve Bayes监督学习方法。实验表明,ESTM和SEMT有效结合了自训练和EM的优点,具有更好的利用未标注样本提高分类器准确性的能力。
     4.面向协同训练的特征集分割。给出了特征子集间条件独立性度量的定义,并证明了特征子集分组合并时独立性的保持性质。以此为根据,提出对每个类别的局部特征子集分别进行分割,再分组进行合并的局部化分割策略,同时给出基于样本局部自适应聚类和特征关联图分块的分割方法,两种方法均以尽量保持子集间的条件独立性为前提。在两个数据集上的测试中,所获得的特征集分割使得协同训练利用未标注样本,更好的提高了na(l|¨)ve Bayes方法的分类效果,拓展了基于特征集分割的协同训练方法的适用性。
     5.基于标签状态向量的多标签学习方法。通过在排位(ranking)方法的标签状态向量空间LSVS中,二次挖掘标签状态值关联中所蕴含的多标签信息,提出基于标签状态向量的两阶段多标签学习框架。在此框架下,给出kNN LSVS上的BOL(bag of labels)模型和Bayes多标签学习方法,并在LSVS上改进ML-kNN方法。在na(l|¨)ve Bayes LSVS上,我们采用线性最小方差拟合(LLSF)进行多标签的训练和预测,并证明了LLSF的方差可以给出分类器Hamming训练损失的一个上界。在11个多标签分类问题上的应用表明,两阶段框架下,各种多标签方法训练所得的分类器具有较好的多标签分类效果。
With the development of Internet application technology,problems induced by information technique abuse in politics,economy,military,society,culture and so on have drawn more and more attention.The content security has become one of the basic issues in information security.Text categorization is one of the powerful means and key techniques for information organization,management,recognition and filtering,for which the need of the Internet content security poses new challenges.
     To ensure the security of information,abnormal content must be monitored efficiently and responded in time.So the fast and real-time inspecting of texts passing is necessary.Due to the variety and frequent movement of the content in Internet,it is difficult,perhaps impossible,to provide enough labeled samples of interest for the training of the classifier.This becomes the bottleneck in construction of a classification system.Therefore,semi-supervised learning method training with a few labeled and lots of unlabeled samples turned into a research hotspot.Variety of content and cross of topics also makes watchers from different areas pay attention to similar or even identical content.Multi-label learning appears to solve the above problem of an instance belonging to more than one class,and becomes a new research area.
     Aimed at the topic of the requiring background from the Internet content security, this dissertation studies three questions,namely,efficient training and prediction for text categorization,semi-supervised learning to alleviate the labeling bottleneck and multi-label text categorization.The main work and contributions of this dissertation are shown as follows:
     1.Efficient multi-class SVM learning method.A multi-class method cascading Rocchio with SVM is proposed.The Rocchio classifier filters most of the irrelevant class and enormously reduces the need of the judgments by SVM.The cascading method decreased the time of the 1 vs.1 and 1 vs.rest method for the test experiments by a quantity level respectively.A concise class-incremental multi-class SVM method CI-SVM is also presented.According to the experiment, the training time of the method was reduced and the testing efficiency was also improved significantly.
     2.Enhancement of the na(l|¨)ve Bayes classifier under class hierarchy.The performance of text categorization method of na(l|¨)ve Bayes highly depends on the global distribution of subjectively-selected sample correlating with classes.It can be enhanced by taking advantage of hierarchical characteristics and by introducing the conditional probability.This enhancement makes decisions in the local data belonging to child-classes of an internal class,thus lightening the influence of global data distribution and partially overcome the problem of date skewness. Experiments showed that the enhanced method improved the effectiveness of hierarchical categorization with na(l|¨)ve Bayes notably.
     3.Semi-supervised learning method based on self-training and EM integration. The method of integrating the training process of EM,which conservatively adjusts the label status for samples,and self-training,which labels the samples directly,is proposed.Two semi-supervised learning methods named ESTM and SEMT are provided.ESTM decisively labels some samples by the middle result in the iteration of EM,and SEMT substitutes the supervised na(l|¨)ve Bayes by semi-supervised EM method.Experiments demonstrated that ESTM and SEMT integrated the advantages of self-training and EM,and improved the classifier by unlabeled samples much more.
     4.Feature set splitting for co-training in text categorization.This dissertation presents the quantitative definition of the conditional independence of feature subsets given the class and suggests a strategy for splitting feature set locally in this sense.The property of holding independence when two groups of feature sets are united is also proven.Two methods respectively base on locally adaptive clustering and relevancy graph partitioning for feature set splitting in the precondition of independence are proposed.Applications to two data sets show that,using the feature divisions produced by our methods,the combined effectiveness of the co-trained na(l|¨)ve Bayes classifiers is improved by applying the unlabeled samples.As a result,the applicability of the co-training method is extended.
     5.Multi-label learning method based on label status vector(LSV).A Two-stage learning frame based on label status vector is proposed.It re-mines the multi-label information contained between label status values in the label status vector space (LSVS) of ranking methods.Under this frame,this dissertation presents the bag of labels(BOL) model in the kNN LSVS,proposes the Bayes method for that model and improves the ML-kNN method.In the na(l|¨)ve Bayes LSVS,linear least square fit(LLSF) for multi-label training and prediction is provided.The upper bounding of the Hamming training loss by the square of LLSF is also proven.Applications to 11 multi-label problems have shown that the two-stage frame and above learning methods was effective in multi-label classification.

引文

[1]方滨兴.信息安全及其关键技术探讨.国家网络与信息安全中心.2005.http://pact518.hit.edu.cn/viewpoint/index.htm
    [2]北京大学公共政策研究所.我国互联网信息内容安全及治理模式研究研究报告.2007.http://www.pkuppi.com/Upfiles/20074477579.doc
    [3]中国电子学会.电子信息学科发展研究报告中文简本.2007.http://www.cie-xh.cn:8000/cie/viewContent.jsp?tableName=1&id=1155
    [4]方滨兴.信息安全四要素:诠释信息安全.哈尔滨工业大学计算机网络与信息安全技术研究中心.2006.http://pact518.hit.edu.cn/viewpoint/index.htm
    [5]IDC:垃圾邮件毁了Email成全IM.2007.http://info.3see.com/snapshot/2007/04/11/98253.shtml
    [6]中国互联网络信息中心.第19次中国互联网络发展状况统计报告.2007.http://www.cnnic.net.cn/index/0E/00/11/index.htm
    [7]TIPSTER Text Program.2001.http://www.itl.nist.gov/iaui/894.02/related_projects/tipster/
    [8]TREC Home Page.2007.http://trec.nist.gov/
    [9]Spam Track.2007.http://trec.nist.gov/data/spam.html
    [10]Lynam T.R.,Cormack G.V.On-line spam filter fusion,in SIGIR-06,29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.2006,Seattle,Washington:ACM Press,123-130.
    [11]国家计算机网络与信息内容安全重点实验室.2007.http://pact518.hit.edu.cn/
    [12]中国IT实验室.还我一片澄清天空—浅析网络内容过滤技术.2007.http://www.cnetnews.com.cn/2007/0922/517969.shtml
    [13]王琨玥.内容安全渐成潮流.ZDNet China.2007.http://net.zdnet.com.cn/network_security_zone/2007/0612/409310.shtml
    [14]许洪波,程学旗,王斌,骆卫华.文本挖掘与机器学习,信息技术快报.2005,3(2):1-13.
    [15]Hayes P.J.,Weinstein S.P.Construe/Tis:A system for content-based indexing of a database of news stories,in Proceedings of IAAI-90,2nd Conference on Innovative Applications of Artificial Intelligence.1990,AAAI Press,49-66.
    [16]Sebastiani F.Machine learning in automated text categorization,ACM Computing Surveys.2002,34(1):1-47.
    [17]Dumais S.T.,Platt J.,Heckerman D.,Sahami M.Inductive learning algorithms and representations for text categorization,in Proceedings of CIKM-98,7th ACM International Conference on Information and Knowledge Management.1998,ACM Press:Bethesda,US,148-155.
    [18]黄萱菁,夏迎炬,吴立德.基于向量空间模型的文本过滤系统,软件学报.2003,14(3):435-442.
    [19]Bigi B.Using Kullback-Leibler distance for text categorization,in Proceedings of ECIR-03,25th European Conference on Information Retrieval.2003,Springer Vedag:Pisa,IT,305-319.
    [20]Nunzio G.M.D.A bidimensional view of documents for text categorisation,in ECIR-04,26th European Conference on Information Retrieval Research.2004,Sunderland,UK:Springer Verlag,112-126.
    [21]解冲锋,李星.基于序列的文本自动分类算法,软件学报.2002,13(4):783-789.
    [22]Caropreso M.F.,Matwin S.,Sebastiani F.A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization,in Text Databases and Document Management:Theory and Practice.2001,Idea Group Publishing:Hershey,US,78-102.
    [23]Jacobs P.S.Joining statistics with NLP for text categorization,in Proceedings of ANLP-92,3rd Conference on Applied Natural Language Processing.1992,Association for Computational Linguistics,Morristown,US:Trento,IT,178-185.
    [24]Basili R.,Moschitti A.,Pazienza M.T.NLP-driven IR:Evaluating performances over a text classification task,in Proceeding of IJCAI-01,17th International Joint Conference on Artificial Intelligence.2001:Seattle,US,1286-1291.
    [25]Debole F.,Sebastiani F.Supervised term weighting for automated text categorization,in SAC-03,18th ACM Symposium on Applied Computing.2003,Melboume,US:ACM Press,784-788.
    [26]Xue D.,Sun M.Chinese text categorization based on the binary weighting model with non-binary smoothing,in ECIR-03,25th European Conference on Information Retrieval.2003,Pisa,IT:Springer Verlag,408-419.
    [27]Lertnattee V.,Theeramunkong T.Effect of term distributions on centroid-based text categorization, Information Sciences. 2004,158(1): 89-115.

    [28] Moschitti A., Basili R. Complex linguistic features for text classification: A comprehensive study, in ECIR-04, 26th European Conference on Information Retrieval Research. 2004, Sunderland, UK: Springer Verlag, 181-196.

    [29] Kehagias A., Petridis V., Kaburlasos V.G, Fragkou P. A comparison of word-and sense-based text categorization using several classification algorithms,Journal of Intelligent Information Systems. 2003,21(3): 227-247.

    [30] Apte C, Damerau F.J., Weiss S.M. Automated learning of decision rules for text categorization, ACM Transactions on Information Systems. 1994, 12(3):233-251.

    [31] Li Y.H., Jain A.K. Classification of text documents, The Computer Journal. 1998,41(8): 537-546.

    [32] Ng H.T., Goh W.B., Low K.L. Feature selection, perceptron learning, and a usability case study for text categorization, in SIGIR-97, 20th ACM International Conference on Research and Development in Information Retrieval. 1997, Philadelphia, US: ACM Press, 67-73.

    [33] Sable C.L., Hatzivassiloglou V. Text-based approaches for non-topical image categorization, International Journal of Digital Libraries. 2000, 3(3): 261-275.

    [34] Schutze H., Hull D.A., Pedersen J.O. A comparison of classifiers and document representations for the routing problem, in SIGIR-95, 18th ACM International Conference on Research and Development in Information Retrieval. 1995,Seattle, US: ACM Press, 229-237.

    [35] Mladenic D. Feature subset selection in text learning, in Proceedings of ECML-98, 10th European Conference on Machine Learning. 1998, Springer Verlag: Heidelberg, 95-100.

    [36] Yang Y. An evaluation of statistical approaches to text categorization,Information Retrieval. 1999,1(1/2): 69-90.

    [37] Yang Y, Pedersen J.O. A comparative study on feature selection in text categorization, in Proceedings of ICML-97, 14th International Conference on Machine Learning. 1997, Morgan Kaufmann Publishers: Nashville, US,412-420.

    [38] Slonim N., Tishby N. The power of word clusters for text classification, in ECIR-01, 23rd European Colloquium on Information Retrieval Research. 2001,Darmstadt, DE: Springer Verlag.
    [39] Bekkerman R., Yaniv R.E., Tishby N., Winter Y. Distributional word clusters vs.words for text categorization, Journal of Machine Learning Research. 2003, 3:1183-1208.

    [40] Baker L.D., McCallum A.K. Distributional clustering of words for text classification, in SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval. 1998, Melbourne, AU: ACM Press,96-103.

    [41] Zelikovitz S., Hirsh H. Using LSI for Text Classification in the Presence of Background Text, in CIKM-01, 10th ACM International Conference on Information and Knowledge Management. 2001, Atlanta, US: ACM Press,113-118.

    [42] Chen L., Tokuda N., Nagai A. A new differential LSI space-based probabilistic document classifier, Information Processing Letters. 2003,88(5): 203-212.

    [43] Forman G. An extensive empirical study of feature selection metrics for text classification, Journal of Machine Learning Research. 2003,3(1): 1533-7928.

    [44] Kim H., Howland P., Park H. Dimension reduction in text classification with support vector machines, Journal of Machine Learning Research. 2005, 6:37-53.

    [45] Makrehchi M., Kamel M.S. Text classification using small number of features,in MLDM-05,4th International Conference Machine Learning and Data Mining in Pattern Recognition. 2005, Leipzig, Germany, 580-589.

    [46] Gabrilovich E., Markovitch S. Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5, in ICML-04,21st International Conference on Machine Learning. 2004, Banff, CA:Morgan Kaufmann Publishers, 41-42.

    [47] Soucy P., Mineau G.W. Feature selection strategies for text categorization, in CSCSI-03, 16th Conference of the Canadian Society for Computational Studies of Intelligence. 2003, Halifax, CA, 505-509.

    [48] Chen W., Chang X., Wang H., Zhu J., Tianshun: Y. Automatic word clustering for text categorization using global information, in AIRS-04, Information Retrieval Technology, Asia Information Retrieval Symposium. 2004, Beijing:Springer Verlag, 1-11.

    [49] Rogati M., Yang Y. High-performing feature selection for text classification, in CIKM-02, 11th ACM International Conference on Information and Knowledge Management. 2002, McLean, Virginia, USA: ACM Press, 659-661.
    [50] Combarro E.F., Montanes E., Diaz I., Ranilla J., Mones R. Introducing a family of linear measures for feature selection in text categorization, IEEE Transactions on Knowledge and Data Engineering. 2005,17(9): 1223-1232.

    [51] Lee C, Lee G.G Information gain and divergence-based feature selection for machine learning-based text categorization, Information Processing and Management. 2006,42(1): 155-165.

    [52] Mitchell T.M. Machine Learing. 1996, New York: McGraw Hill.

    [53] Li H., Yamanishi K. Text classification using ESC-based stochastic decision lists,Information Processing and Management. 2002, 38(3): 343-361.

    [54] Joachims T. Text categorization with support vector machines: learning with many relevant features, in Proceedings of ECML-98, 10th European Conference on Machine Learning. 1998, Springer Verlag: Chemnitz, DE, 137-142.

    [55] Yang Y., Liu X. A re-examination of text categorization methods, in Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval. 1999, ACM Press: Berkeley, US, 42-49.

    [56] Ruiz M., Srinivasan P. Hierarchical text classification using neural networks,Information Retrieval. 2002, 5(1): 87-118.

    [57] Selamat A., Omatu S. Web page feature selection and classification using neural networks, Information Sciences. 2004,158(1): 69-88.

    [58] Zhang M.-L., Zhou Z.-H. Multilabel neural networks with applications to functional genomics and text categorization, IEEE Transactions on Knowledge and Data Engineering 2006,18(10): 1338-1351.

    [59] Zhou S., Guan J. Chinese documents classification based on N-grams, in CICLING-02, 3rd International Conference on Computational Linguistics and Intelligent Text Processing. 2002, Mexico City: Springer Verlag, 405-414.

    [60] McCallum A.K., Rosenfeld R., Mitchell T.M., Ng A.Y. Improving text classification by shrinkage in a hierarchy of classes, in ICML-98, 15th International Conference on Machine Learning. 1998, Madison, US: Morgan Kaufmann Publishers, 359-367.

    [61] McCallum A., Nigam K. A comparison of event models for Naive Bayes text classification, in Proceedings of AAAI-98, Workshop on Learning for Text Categorization. 1998.

    [62] Joachims T. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization, in Proceedings of ICML-97, 14th International Conference on Machine Learning. 1997, Nashville, US: Morgan Kaufmann Publishers,143-151.

    [63] Kim S.-B., Han K.-S., Rim H.-C, Myaeng S.H. Some effective techniques for naive bayes text classification, IEEE Transactions on Knowledge and Data Engineering. 2006,18(11): 1457-1466.

    [64] Nigam K.: Using unlabeled data to improve text classification, [Ph.D.Dissertation]. 2001, Pittsburgh, US, Computer Science Department, Carnegie Mellon University.

    [65] Yang Y., Chute C.G. An application of least squares fit mapping to text information retrieval, in Proceedings of SIGIR-93, 16th ACM International Conference on Research and Development in Information Retrieval. 1993, ACM Press: New York US, 281-290.

    [66] Zhang T., Oles F.J. Text categorization based on regularized linear classification methods, Information Retrieval. 2001,4(1): 5-31.

    [67] Dagan I., Karov Y., Roth D. Mistake-driven learning in text categorization, in Proceedings of EMNLP-97, 2nd Conference on Empirical Methods in Natural Language Processing. 1997, Association for Computational Linguistics:Morristown US, 55-63.

    [68] Ragas H., Koster C.H. Four text classification algorithms compared on a Dutch corpus, in SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval. 1998, Melbourne, AU: ACM Press,369-370.

    [69] Siefkes C, Assis F., Chhabra S., Yerazunis W.S. Combining Winnow and orthogonal sparse bigrams for incremental spam filtering, in Knowledge Discovery in Databases: Pkdd 2004, Proceedings, Lecture Notes in Artificial Intelligence 3202. 2004, Springer Verlag: Heidelberg, 410-421.

    [70] Koster C.H.A., Seutter M., Beney J. Multi-classification of patent applications with Winnow, in Perspectives of System Informatics, Lecture Notes in Computer Science 2890.2003, 546-555.

    [71] Schapire R.E., Singer Y., Singhal A. Boosting and Rocchio applied to text filtering, in SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval. 1998, Melbourne, AU: ACM Press,215-223.

    [72] Crammer K., Singer Y. A new family of online algorithms for category ranking,in SIGIR-02,25th ACM International Conference on Research and Development in Information Retrieval 2002, Tampere, FI: ACM Press, 151-158.

    [73] Moschitti A. A study on optimal parameter tuning for Rocchio text classifier, in Proceedings of ECIR-03, 25th European Conference on Information Retrieval.2003, Springer Verlag: Pisa, IT, 420-435.

    [74] Xu X., Zhang B., Zhong Q. Text categorization using SVMs with Rocchio ensemble for internet information classification, in ICCNMC-05, Networking and Mobile Computing: 3rd International Conference. 2005, Zhangjiajie, CN:Springer Verlag, 1022-1031.

    [75] Han E.-H., Karypis G., Kumar V. Text categorization using weight-adjusted k-nearest neighbor classification, in Proceedings of PAKDD-01, 5th Pacific-Asia Conferenece on Knowledge Discovery and Data Mining. 2001,Springer Verlag: Hong Kong, CN, 53-65.

    [76] Soucy P., Mineau GW. A Simple kNN algorithm for text categorization, in ICDM-01, IEEE International Conference on Data Mining. 2001, San Jose, CA:IEEE Computer Society Press, 647-648.

    [77] Guo G., Wang H., Bell D.A., Bi Y., Greer K. An kNN model-based approach and its application in text categorization, in CICLING-04, 5th International Conference on Computational Linguistics and Intelligent Text Processing. 2004,Seoul, KO: Springer Verlag, 559-570.

    [78] Rahal I., Perrizo W. An optimized approach for KNN text categorization using P-trees, in SAC-04, 9th ACM Symposium on Applied Computing. 2004, Nicosia,CY.

    [79] Tsay J.-J., Wang J.-D. Improving linear classifier for Chinese text categorization,Information Processing and Management. 2004,40(2): 223-237.

    [80] Debole F., Sebastiani F. An analysis of the relative hardness of Reuters-21578 subsets, Journal of the American Society for Information Science and Technology. 2004, 56(6): 584-596.

    [81] Lewis D.D., Li F., Rose T., Yang Y. RCV1: A new benchmark collection for text categorization research, Journal of Machine Learning Research. 2004, 5:361-397.

    [82] Forman G, Cohen I. Learning from little: Comparison of classifiers given little training, in Proceedings of PKDD-04, 8th European Conference on Principles of Data Mining and Knowledge Discovery. 2004, Springer Verlag: Pisa, IT,161-172.
    [83]Kazama J.,Tsujii J.Maximum entropy models with inequality constraints:a case study on text categorization,Machine Learning.2005,60(1-3):159-194.
    [84]李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类,计算机研究与发展.2005,42(1):94-101.
    [85]Liu W.Y.,Song N.A fuzzy approach to classification of text documents,Journal of Computer Science and Technology.2003,18(5):640-647.
    [86]Widyantoro D.H.,Yen J.A fuzzy similarity approach in text classification task,in Proceedings of Fuzz-IEEE-00,9th IEEE International Conference on Fuzzy Systems,Vols 1 and 2.2000,653-658.
    [87]王建会,王洪伟,申展,胡运发.一种实用高效的文本分类算法,计算机研究与发展.2005,42(1):85-93.
    [88]Wu H.,Phang T.H.,Liu B.,Li X.A refinement approach to handling model misfit in text categorization,in SIGKDD-02,8th ACM international conference on Knowledge discovery and data mining.2002,Edmonton,Alberta,Canada:ACM Press,207-216.
    [89]Tan S.,Cheng X.,Ghanem M.M.,Wang B.,Xu H.A novel refinement approach for text categorization,in CIKM-05,14th ACM Conference on Information and Knowledge Management.2005,Bremen,Germany:ACM Press,469-476.
    [90]Chakrabarti S.,Roy S.,Soundalgekar M.Fast and accurate text classification via multiple linear discriminant projections,International Journal on Very Large Data Bases.2003,12(2):170-185.
    [91]Lam W.,Lai K.-Y.Automatic textual document categorization based on generalized instance sets and a metamodel,IEEE Transactions on Pattern Analysis and Machine Intelligence.2003,25(5):628-633.
    [92]Wei Y.-G,Tsay J.-J.:A study of multiple classifier systems in automated text categorization,[Master Dissertation].2002,Chiayi,Taiwan,College of Engineering National Chung Cheng University.
    [93]Breiman L.Bagging predictiors,Machine learning.1996,24(2):123-140.
    [94]Aas K.,Eikvil L.Text categorization:A survey,Technical Report,NR 941.1999,Oslo,Norwegian Computing Center.
    [95]刁力力,胡可云,陆玉昌,石纯一.用Boosting方法组合增强Stumps进行文本分类,软件学报.2002,13(8):1361-1367.
    [96]Schapire R.E.,Singer Y.BoosTexter:A boosting-based system for text categorization,Machine Learning.2000,39(2-3):135-168.
    [97]Kim Y.-H.,Hahn S.-Y.,Zhang B.-T.Text filtering by boosting naive Bayes classifiers,in SIGIR-00,23rd ACM International Conference on Research and Development in Information Retrieval.2000,Athens,GR:ACM Press,168-175.
    [98]Nardiello P.,Sebastiani F.,Sperduti A.Discretizing continuous attributes in AdaBoost for text categorization,in Proceedings of ECIR-03,25th European Conference on Information Retrieval.2003,Springer Verlag:Pisa,IT,320-334.
    [99]Kim H.J.,Kim J.Combining active learning and boosting for Naive Bayes text classifiers,in Advances in Web-Age Information Management:Proceedings,Lecture Notes in Computer Science 3129.2004,Springer Verlag:Heidelberg,519-527.
    [100]Diao L.L.,Hu K.Y.,Lu Y.C.,Shi C.Y.Boosting simple decision trees with Bayesian learning for text categorization,in Proceedings of the 4th World Congress on Intelligent Control and Automation,Vols 1-4.2002,321-325.
    [101]Esuli A.,Fagni T.,Sebastiani F.TreeBoost.MH:A boosting algorithm for multi-label hierarchical text categorization,in SPIRE 2006,Lecture Notes in Computer Science 4209.2006,Springer Verlag:Heidelberg,13-24.
    [102]Lee B.,Park Y.An e-mail monitoring system for detecting outflow of confidential documents,in Intelligence and Security Informatics,Proceedings,Lecture Notes in Computer Science 2665.2003,371-374.
    [103]Zhu X.Semi-Supervised Learning Literature Survey,Technical Report,Computer Sciences TR 1530.2006,Madison,University of Wisconsin-Madison.
    [104]Nigam K.,McCallum A.,Mitchell T.Semi-supervised Text Classification Using EM,in Semi-Supervised Learning.2006,MIT Press:Boston.
    [105]Yang Y.,Zhang J.,Kisiel B.A scalability analysis of classifiers in text categorization,in SIGIR-03,26th ACM International Conference on Research and Development in Information Retrieval.2003,Toronto,CA:ACM Press,96-103.
    [106]Liu T.-Y.,Yang Y.,Wan H.,Zeng H.-J.,Chen Z.,Ma W.-Y.Support vector machines classification with a very large-scale taxonomy,SIGKDD Explor.Newsl.2005,7(1):36-43.
    [107]Liu T.-Y.,Yang Y.,Wan H.,Zhou Q.,Gao B.,Zeng H.-J.,Chen Z.,Ma W.-Y.An experimental study on large-scale Web categorization,in WWW-05,14th International World Wide Web Conference.2005,Chiba,Japan:ACM Press, 1106-1107.
    [108]Oh H.-J.,Myaeng S.H.,Lee M.-H.A,practical hypertext categorization method using links and incrementally available class information,in SIGIR-00,23rd ACM International Conference on Research and Development in Information Retrieval.2000,Athens,GR:ACM Press,264-271.
    [109]Yang Y.,Slattery S.,Ghani R.A study of approaches to hypertext categorization,Journal of Intelligent Information Systems.2002,18(2-3):219-241.
    [110]Glover E.J.,Tsioutsiouliklis K.,Lawrence S.,Pennock D.M.,Flake G.W.Using Web structure for classifying and describing Web pages,in WWW-02,International Conference on the World Wide Web.2002,Honolulu,US:ACM Press,562-569.
    [111]Furnkranz J.Exploiting structural information for text classification on the WWW,in Advances in Intelligent Data Analysis,Proceedings,Lecture Notes in Computer Science 1642.1999,Springer Verlag,487-497.
    [112]Kan M.-Y.,Thi H.O.N.Fast Webpage classification using URL features,in CIKM-05,14th ACM Conference on Information and Knowledge Management.2005,Bremen,Germany:ACM Press,325-326.
    [113]Shih L.K.,Karger D.R.Using URLs and table layout for Web classification tasks,in WWW-04,13th International Conference on the World Wide Web.2004,New York:ACM Press,193-202.
    [114]Chakrabarti S.,Dom B.E.,Indyk P.Enhanced hypertext categorization using hypedinks,in SIGMOD-98,ACM International Conference on Management of Data.1998,Seattle,US:ACM Press,307-318.
    [115]Nigam K.,Ghani R.Analyzing the applicability and effectiveness of co-training,in CIKM-00,9th ACM International Conference on Information and Knowledge Management.2000,McLean,US:ACM Press,86-93.
    [116]樊兴华,孙茂松.一种高性能的两类中文文本分类方法,计算机学报.2006,29(1):124-131.
    [117]Cristianini N.,Shawe-Taylor J.,Lodhi H.Latent semantic kernels,Journal of Intelligent Information Systems.2002,18(2/3):127-152.
    [118]Lodhi H.,Saunders C.,Shawe-Taylor J.,Cristianini N.,Watkins C.Text classification using string kernels,Journal of Machine Learning Research.2002,2:419-444.
    [119]Cancedda N.,Gaussier E.,Goutte C.,Renders J.M.Word sequence kernels, Journal of Machine Learning Research.2003,3(6):1059-1082.
    [120]Leslie C.,Kuang R.Fast kernels for inexact string matching,in 16th Annual Conference on Learning Theory and 7th Kernel Workshop(COLT/Kernel 2003).2003,Washington,DC:Springer Verlag,114-128.
    [121]Wang J.,Wang H.,Zhang S.,Hu Y.A simple and efficient algorithm to classify a large scale of text,Journal of Computer Research and Development.2005,42(1):85-93.
    [122]Debole F.,Sebastiani F.An analysis of the relative hardness of Reuters-21578subsets,in Proceedings of LREC-04,4th International Conference on Language Resources and Evaluation.2004:Lisbon,PT,971-974.
    [123]Tan S.,Cheng X.,Wang B.,Xu H.,Ghanem M.M.,Guo Y.Using dragpushing to refine centroid text classifiers,in SIGIR-05,28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2005,Salvador,Brazil:ACM Press,653-654.
    [124]Hsu C.,Liu C.A comparison of methods for multi-class support vector machines,IEEE Transactions on Neural Networks.2002,13:415-425.
    [125]Ghani R.Using error-correcting codes for text classification,in Proceedings of ICML-00,17th International Conference on Machine Learning.2000,Morgan Kaufmann Publishers:San Francisco,US,303-310.
    [126]Vural V.,Dy J.G.A hierarchical methods for multi-class Support Vector Machines,in ICML-04,21st International Conference on Machine Learning.2004,Banff,Canada,105-112.
    [127]Fei B.,Liu J.Binary tree of SVM:A new fast multiclass training and classification algorithm,IEEE Transactions on Neural Networks.2006,17(3):696-704.
    [128]The 20 Newsgroups data set.2007.http://people.csail.mit.edu/jrennie/20Newsgroups/
    [129]Lewis D.D.Reuters-21578 text categorization test collection.Distribution 1.0.1997.http://www.daviddlewis.com/resources/testcollections/reuters21578/readme.txt
    [130]谭松波,王月粉.中文文本分类语料库-TanCorpV1.0.2005.http://www.searchforum.org.cn/tansongbo/corpus1.php
    [131]Chang C.-C.,Lin C.-J.LIBSVM:A library for support vector machines.2002.http://www.csie.ntu.edu.tw/～cjlin/libsvm
    [132]Sun A.,Lim E.-P.,Ng W.-K.Hierarchical text classification methods and their specification,in Cooperative Internet Computing.2003,Kluwer Academic Publishers:Dordrecht,NL,236-256.
    [133]Sun A.,Lim E.-P.,Ng W.-K.,Srivastava J.Blocking reduction strategies in hierarchical text classification,IEEE Transactions on Knowledge and Data Engineering.2004,16(10):1305-1308.
    [134]Koller D.,Sahami M.Hierarchically classifying documents using very few words,in ICML-97,14th International Conference on Machine Learning.1997,Nashville,US:Morgan Kaufmann Publishers,170-178.
    [135]Sun A.,Lim E.-P.Hierarchical text classification and evaluation,in ICDM-01,1st IEEE International Conference on Data Mining.2001,San Jose,CA:IEEE Computer Society,521-528.
    [136]Cheng C.H.,Tang J.,Fu A.W.-C.,King I.Hierarchical classification of documents with error control,in PAKDD-01,5th Pacific-Asia Conference on Knowledge Discovery and Data Mining.2001,Hong Kong:Springer-Verlag.
    [137]Sun A.,Lim E.-P.,Ng W.-K.Performance measurement framework for hierarchical text classification,Journal of the American Society for Information Science and Technology.2003,54(11):1014-1028.
    [138]Dempster A.P.,Laird N.M.,Rubin D.B.Maximum likelihood from incomplete data via the EM algorithm,Journal of the Royal Statistical Society,Series B (Methodological).1997,39(1):1-38.
    [139]Cong G.,Lee W.S.,Wu H.R.,Liu B.Semi-supervised text classification using partitioned EM,in Database Systems for Advanced Applications,Lecture Notes in Computer Science 2973.2004,482-493.
    [140]Joachims T.Transductive inference for text classification using support vector machines,in ICML-99,16th International Conference on Machine Learning.1999,Bled,SL:Morgan Kaufmann Publishers,200-209.
    [141]Taira H.,Haruno M.Text categorization using transductive boosting,in Proceedings of ECML-01,12th European Conference on Machine Learning.2001,Springer Vedag:Freiburg,454-465.
    [142]陈毅松,汪国平,董士海.基于支持向量机的渐进直推式分类学习算法,软件学报.2003,14(3):451-460.
    [143]Blum A.,Mitchell T.Combining labeled and unlabeled data with co-training in COLT-98,11th Conference on Computational Learning Theory.1998,Madison, Wisconsin:ACM,92-100.
    [144]Kiritchenko S.,Matwin S.Email classification with co-training,in Proceedings of the 2001 conference of the Centre for Advanced Studies on Collaborative research.2001,IBM Press:Toronto,Ontario,Canada,8-17.
    [145]Park S.-B.,Zhang B.-T.Co-trained support vector machines for large scale unstructured document classification using unlabeled data and syntactic information,Information Processing and Management.2004,40(3):421-439.
    [146]Zhou Z.-H.,Li M.Tri-training:Exploiting unlabeled data using three classifiers,IEEE Transactions on Knowledge and Data Engineering.2005,17(11):1529-1541.
    [147]邓超,郭茂祖.基于自适应数据剪辑策略的Tri-trailling算法,计算机学报.2007,30(8):1213-1226.
    [148]Li M.,Zhou Z.-H.SETRED:Self-training with editing,in PAKDD-05,9th Pacific-Asia Conference on Knowledge Discovery and Data Mining.2005,Hanoi,Vietnam:Springer Verlag,500-504.
    [149]Yu H.,Han J.,Chang K.C.-C.PEBL:Web page classification without negative examples,IEEE Transactions on Knowledge and Data Engineering.2004,16(1):70-81.
    [150]Fung G.P.C.,Yu J.X.,Lu H.,Yu P.S.Text classification without negative examples revisit,IEEE Transactions on Knowledge and Data Engineering.2006,18(1):6-20.
    [151]Li X.,Liu B.Learning to classify texts using positive and unlabeled data,in IJCAI-03,18th International Joint Conference on Artificial Intelligence.2003,Acapulco,MX:Morgan Kaufi-nann Publishers,587-594.
    [152]Angluin D.,Laird P.Learning from noisy examples Machine Learning.1988,2(4):343-370.
    [153]Wang W.,Zhou Z.-H.Analyzing co-training style algorithms,in ECML-07,18th European Conference on Machine Learning.2007,Warsaw,Poland:Springer Verlag,454-465.
    [154]Goldman S.,Zhou Y.Enhancing supervised learning with unlabeled data,in ICML-00,17th International Conference on Machine Learning.2000,San Francisco,CA,327-334.
    [155]Zhou Z.-H.,Li M.Semi-supervised regression with co-training.,in The 19th International Joint Conference on Artificial Intelligence(IJCAI-05).2005, Edinburgh,Scotland,908-913.
    [156]Kang N.,Domeniconi C.,Barbara D.Categorization and keyword identification of unlabeled documents,in The 5th IEEE International Conference on Data Mining(ICDM-05).2005,Houston,Texas:IEEE Computer Society,677-680.
    [157]Fj(a|¨)llstr(o|¨)m P.-O.Algorithms for graph partitioning:a survey,Linkoping Electronic Atricles in Computer and Information Science,3.1998.
    [158]Karypis G.,Kumar V.METIS-Serial graph partitioning and fill-reducing matrix ordering.2007.http://glaros.dtc.umn.edu/gkhome/metis/metis/overview
    [159]CMU world wide knowledge base(Web-＞KB) project.2001.http://www.cs.cmu.edu/afs/cs/project/theo-11/www/wwkb/
    [160]Zhang M.-L.,Zhou Z.-H.Multi-label learning by instance differentiation,in IJCAI-07,20th International Joint Conference on Artificial Intelligence.2007,Vancouver,Canada,669-674.
    [161]Zhang M.-L.,Zhou Z.-H.ML-kNN:A lazy learning approach to multi-label learning,Pattern Recognition.2007,40(7):2038-2048.
    [162]Elisseeff A.,JasonWeston.A kernel method for multi-labelled classification,in Advances in Neural Information Processing Systems 14.2002,MIT Press:Cambridge,MA,681-687.
    [163]Boutell M.R.,Luo J.B.,Shen X.P.,Brown C.M.Learning multi-label scene classification,Pattern Recognition.2004,37(9):1757-1771.
    [164]Yang Y.A study on thresholding strategies for text categorization,in SIGIR-01,24th ACM International Conference on Research and Development in Information Retrieval.2001,New Orleans,US:ACM Press,137-145.
    [165]Lee K.H.,Kay J.,Kang B.H.,Rosebrock U.A comparative study on statistical machine learning algorithms and thresholding strategies for automatic text categorization,in PRICAI-02,7th Pacific Rim International Conference on Artificial Intelligence.2002,Springer Verlag:Tokyo,JP,444-453.
    [166]Li T.,Zhang C.,Zhu S.Empirical studies on multi-label classification,in ICTAI-06,18th IEEE International Conference on Tools with Artificial Intelligence.2006,Washington D.C.,US:IEEE Computer Society,86-92.
    [167]Gao S.,Wu W.,Lee C.-H.,Chua T.-S.A maximal figure-of-merit learning approach to text categorization,in SIGIR-03,26th ACM International Conference on Research and Development in Information Retrieval.2003,Toronto,CA:ACM Press,174-181.
    [168] McCallum A.K. Multi-label multiclass document classification with a mixture model trained by EM, Working Notes of the AAAI -99 Workshop on Text Learning 1999. http://citeseer.ist.psu.edu/article/mccallum99multilabel.html

    [169] Kaneda Y., Ueda N., Saito K. Extended parametric mixture model for robust multi-labeled text categorization, in Knowledge-Based Intelligent Information and Engineering Systems, Pt 2, Proceedings, Lecture Notes in Computer Science 3214. 2004, Springer Verlag: Heidelberg, 616-623.

    [170] Ueda N., Saito K. Parametric mixture models for multi-label text, in Advances in Neural Information Processing Systems 15. 2003, MIT Press: Cambridge,MA, 721-728.

    [171] Zhu S., Ji X., Xu W., Gong Y. Multi-labelled classification using maximum entropy method, in SIGIR-05, 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2005,Salvador, Brazil: ACM Press, 274-281.

    [172] Comite F.D., Gilleron R., Tommasi M. Learning multi-label alternating decision tree from texts and data, in Lecture Notes in Computer Science 2734. 2003,Springer Berlin: Heidelberg, 35-49.

    [173] Kazawa H., Izumitani T., Taira H., Maeda E. Maximal margin labeling for multi-topic text categorization, in Advances in Neural Information Processing Systems 17. 2005, MIT Press: Cambridge MA, 649-656.

    [174] Press W.H., Teukolsky S.A., Vetterling W.T., Flannery B.P. Numerical Recipes in C: the Art of Scientific Computing. 1992, New York: Cambridge University Press.

常见问题　|　交通位置　|　联系我们　|　OA远程办公

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700