基于监督聚类的专利训练数据修剪研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
我们生活在一个信息爆炸的时代,各行各业积累了大量的,甚至是海量的数据。根据世界知识产权组织的统计,专利文献含有世界每年发明创造成果的90%~95%,世界每年的申请量以100多万件的速度递增,目前,累计总量已近4000万件,充分利用这些专利文献进行技术创新能够节约60%时间、节省40%的科研资金投入。每一件专利都会依据其内容被分类至某一个国际专利分类码(International Patent Classification,IPC)中。由于数据的规模大,完全依靠专家进行分类需要耗费大量的人力物力,这就促进了各种自动专利分类的研究的兴起。朴素贝叶斯,最近邻,决策树,以及支持向量机等已经应用到文本分类领域,并取得了一定的效果。然而,专利分类是一个大规模,不平衡,层次化以及多标号的文本分类问题,大多数的传统分类方法无法处理这样复杂的问题。即使是性能最好的分类器—支持向量机,由于其求解过程是一个二次规划问题,导致训练时间与训练样本个数接近平方级别的关系。因此,吕宝粮和他的合作者提出了最小最大模块化网络,它最显著的特点是并行的,模块化的结构。其基本思想是“分而治之”:将一个大规模问题,分解成一些独立的小规模问题,分别求解这些小规模问题,然后合并成大规模问题的解。
     本文的贡献在于,通过引进一种基于高斯零交叉函数最小最大模块化网络的监督聚类算法,来修剪训练数据的规模,并将其成功的应用到专利分类问题中去。文章的主要贡献在以下几个方面。
     1)分析了高斯零交叉函数最小最大模块化网络的特点:高度的模块化,可以输出“不知道”的能力和增量学习能力。
     2)分析了高斯零交叉函数最小最大模块化网络接收域的特点,根据此接收域,在学习过程中对训练样本进行聚类,去除冗余样本。
     3)在聚类后,可能有些聚类含有的样本数很少,这些样本点可能是噪声点。我们采用了噪声去除和聚类合并算法对样本进行后处理。
     4)我们在NTCIR-5专利数据库上进行专利分类的仿真实验,比较了在聚类和非聚类情况下的各项性能。实验结果证明,我们提出的聚类算法,可以去除冗余样本,并保证在较少的训练数据集下,保持甚至获得更好的泛化能力。
     5)通过仿真实验,我们也验证了高斯零交叉函数最小最大模块化网络具有的增量学习能力。
We are living in an information explosion era; all walks of life have accumulateda great deal, even massive data. According to the statistics from the WIPO, patentdocuments contain 90% 95% of the outcome of the world’s annual inventions. Theapplications for patent in the world increase more than 100 million every year and thetotal number has accumulated nearly to 4 billion. If we can take full advantage of thesepatent documents, we can save 60 % of the research time and 40 % of the research andcapital investment for a technical innovation. Each patent will be classified to a specificcategory in international patent classification (International Patent Classification, IPC)according to the contents. In the past, we classify patents in a manual way whichgreatly relies on domain experts and is time-consuming and not effective. Automaticpatent classification is of great important in this environment and a variety of automaticpatent classification study has raised, such as Naive Bayes, nearest neighbor, decisiontree and support vector machines. All of them have been applied to text classification,and have achieved some effects.
     The patent classification is a large-scale, unbalanced, hierarchical and multi-labeled text classification problem. Most of the traditional classification methods can’thandle such kind of complex issues. Even the best performance classifier—supportvector machine can’t handle it. The reason is because its process of solving problemis a quadratic programming problem. And it leads to a result that the training time isnear the square level of the number of training samples. Therefore, Bao-Liang Lu andhis collaborators proposed min-max modular network, its most notable features are:the parallel and modular structure. The basic idea of the network is to”divide andconquer”: for a large-scale problem, we divide it into a number of independent small-scale problems. We solve these small-scale problems in parallel, and then combine them into the large-scale problems.
     The contribution of the thesis is to introduce a supervised clustering based onmin-max modular network. We use this algorithm to prune the training samples andsuccessfully apply it into the classification of patent data. The main contributions ofthis thesis are listed following:
     1) Analyze the feature of min-max modular network: highly modularization,incremental learning ability.
     2) Analyze the feature of receivable field of the min-max modular network,and propose a supervised clustering method based on the receivable field to prune thetraining sample.
     3) After clustering, some cluster may have few samples and some of them maybe noises. We use a noise removal and cluster center combination algorithm to postprocess the network.
     4) We arrange a serial of experiments on NTCIR-5 patent data and compare theperformance of clustering to no-clustering. And the results denote that the clusteringalgorithm can use as a pre-process method to prune the training samples and maintainor even improve the generalization ability.
     5) We also arrange an experiment on the patent data to prove the incrementallearning ability of min-max modular network.
引文
[1] LU B, ITO M. Task decomposition and module combination based on class rela-tions: A modular neural network for pattern classification[J]. IEEE Transactionson Neural Networks, 1999, 10(5):1244–1256.
    [2] JENKINS R, YUHAS B. A simplified neural-network solution through problemdecomposition: The case of the truck backer-upper[J]. Neural Computation,1992, 4(5):647–649.
    [3] CHEN C, YOU G. Class-sensitive neural network[J]. Neural Parallel Scie. Com-put, 1993, 1(1):93–96.
    [4] XU L, KRZYZAK A, SUEN C. Methods of combining multiple classifiers andtheir applications to handwriting recognition[J]. IEEE Transactions on SystemsMan and Cybernetics, 1992, 22(3):418–435.
    [5] CHEN K, WANG L, CHI H. Methods of combining multiple classifiers withdifferent features and their applications to text-independent speaker identifica-tion[J]. International Journal of Pattern Recognition and Artificial Intelligence,1997, 11(3):417–446.
    [6] JACOBS R, JORDAN M, BARTO A. Task decomposition through competitionin a modular connectionist architecture: The what and where vision tasks[J].Machine learning: from theory to applications: cooperative research at Siemensand MIT, 1993, 15(2):175.
    [7] SCHAPIRE R. The strength of weak learnability[J]. Machine learning, 1990,5(2):197–227.
    [8] TUMER K, GHOSH J. Analysis of decision boundaries in linearly combinedneural classifiers[J]. Pattern Recognition, 1996, 29(2):341–348.
    [9] BATTITI R, COLLA A. Democracy in neural nets: Voting schemes for classifi-cation[J]. Neural Networks, 1994, 7(4):691–707.
    [10] TURNER K, GHOSH J. ORDER STATISTICS COMBINERS FOR NEURALCLASSIFIERS1[C]//WCNN’95, Washington, DC: World Congress on NeuralNetworks: 1995 International Neural Network Society Annual Meeting: Renais-sance Hotel, Washington, DC, USA, July 17-21, 1995. .[S.l.]: [s.n.] , 1995.
    [11] LANGDON W, BUXTON B. Genetic programming for combining classi-fiers[C]//Proceedings of the Genetic and Evolutionary Computation Conference(GECCO-2001). .[S.l.]: [s.n.] , 2001:66–73.
    [12] LU B, ICHIKAWA M. A Gaussian zero-crossing discriminant function for min-max modular neural networks[J]. Knowledge-based intelligent information en-gineering systems & allied technologies: KES’2001, 2001:298.
    [13] LU B, ICHIKAWA M. Emergent online learning with a Gaussian zero-crossingdiscriminantfunction[C]//Neural Networks, 2002. IJCNN’02. Proceedings of the2002 International Joint Conference on. .[S.l.]: [s.n.] , 2002,2.
    [14] SALTON G, MCGILL M. Introduction to modern information retrieval[M].[S.l.]:McGraw-Hill New York, 1983.
    [15] AAS K, EIKVIL L. Text categorisation: A survey[J]. Raport NR, 1999, 941.
    [16] YANG Y, PEDERSEN J. A comparative study on feature selection in text catego-rization[C]//MACHINE LEARNING-INTERNATIONAL WORKSHOP THENCONFERENCE-. .[S.l.]: [s.n.] , 1997:412–420.
    [17] LEWIS D. An evaluation of phrasal and clustered representations on a textcategorization task[C]//Proceedings of the 15th annual international ACM SIGIRconference on Research and development in information retrieval. .[S.l.]: [s.n.] ,1992:50.
    [18] BAKER L, MCCALLUM A. Distributional clustering of words for text classifica-tion[C]//Proceedings of the 21st annual international ACM SIGIR conference onResearch and development in information retrieval. .[S.l.]: [s.n.] , 1998:96–103.
    [19] BEKKERMAN R. Distributional clustering of words for text categoriza-tion[D].[S.l.]: Citeseer, 2003.
    [20] LI Y, JAIN A. Classification of text documents[J]. The Computer Journal, 1998,41(8):537.
    [21] DUMAIS S, LETSCHE T, LITTMAN M, et al. Automatic cross-language retrievalusing latent semantic indexing[J]. AAAI Spring Symposuim on Cross-LanguageText and Speech Retrieval, 1997:115–132.
    [22] HULL D. Improving text retrieval for the routing problem using latent seman-tic indexing[C]//Proceedings of the 17th annual international ACM SIGIR con-ference on Research and development in information retrieval. .[S.l.]: [s.n.] ,1994:282–291.
    [23] KARYPIS G, HAN E. Fast supervised dimensionality reduction algorithm withapplications to document categorization & retrieval[C]//Proceedings of the ninthinternational conference on Information and knowledge management. .[S.l.]:[s.n.] , 2000:12–19.
    [24] ANDO R. Latent semantic space: iterative scaling improves precision of inter-document similarity measurement[C]//Proceedings of the 23rd annual interna-tional ACM SIGIR conference on Research and development in information re-trieval. .[S.l.]: [s.n.] , 2000:216–223.
    [25] HE X, CAI D, LIU H, et al. Locality preserving indexing for document represen-tation[C]//Proceedings of the 27th annual international ACM SIGIR conferenceon Research and development in information retrieval. .[S.l.]: [s.n.] , 2004:96–103.
    [26] LEE D, SEUNG H. Learning the parts of objects by non-negative matrix factor-ization[J]. Nature, 1999, 401(6755):788–791.
    [27] LEE D, SEUNG H. Algorithms for non-negative matrix factorization[J]. Ad-vances in neural information processing systems, 2001:556–562.
    [28] JOACHIMS T. A probabilistic analysis of the Rocchio algorithm with TFIDF fortext categorization[C]//MACHINE LEARNING-INTERNATIONAL WORK-SHOP THEN CONFERENCE-. .[S.l.]: [s.n.] , 1997:143–151.
    [29] SEBASTIANI F. Machine learning in automated text categorization[J]. ACMcomputing surveys (CSUR), 2002, 34(1):1–47.
    [30] DUDA R, HART P. Pattern classification and scene analysis[J]. 1973.
    [31] GOLDBERG D, HOLLAND J. Genetic algorithms and machine learning[J]. Ma-chine Learning, 1988, 3(2):95–99.
    [32] VAPNIK V. Structure of statistical learning theory[J]. Computational Learningand Probabilistic Reasoning, 1996:3.
    [33] LU B, MA Q, ICHIKAWA M, et al. Efficient Part-of-Speech Tagging with a Min-Max Modular Neural-Network Model[J]. Applied Intelligence, 2003, 19(1):65–81.
    [34] LU B, SHIN J, ICHIKAWA M. Massively parallel classification of EEG signalsusing Min-Max modular neural networks[J]. Lecture notes in computer science,2001:601–608.
    [35] YANG Y, LU B. Prediction of protein subcellular multi-locations with a min-max modular support vector machine[J]. LECTURE NOTES IN COMPUTERSCIENCE, 2006, 3973:667.
    [36] LIAN H, LU B, TAKIKAWA E, et al. Gender recognition using a min-maxmodular support vector machine[J]. Lecture notes in computer science, 2005,3611:438.
    [37] FAN Z, LU B. Multi-view face recognition with min-max modular SVMs[J].Lecture notes in computer science, 2005, 3611:396.
    [38] CHEN K, LU B, KWOK J. Efficient Classification of Multi-label and Imbal-anced Data Using Min-Max Modular Classifiers[C]//Proc. World Congress onComputation Intelligence―Int’l Joint Conf. Neural Networks. .[S.l.]: [s.n.] ,2006:1770–1775.
    [39] WEN Y, LU B, ZHAO H. Equal clustering makes min-max modular supportvector machine more efficient[C]//Proc. 12th International Conference on NeuralInformation Processing. .[S.l.]: [s.n.] :77–82.
    [40] SNEATH P, SOKAL R. Numerical taxonomy[M].[S.l.]: Springer, 1973.
    [41] WILLETT P. Recent Trends in Hierarchic Document Clustering: A Critical Re-view.[J]. Information Processing and management, 1988, 24(5):577–97.
    [42] KOHONEN T. Self-organized formation of topologically correct feature maps[J].Biological cybernetics, 1982, 43(1):59–69.
    [43] GRIRA N, CRUCIANU M, BOUJEMAA N. Unsupervised and semi-supervisedclustering: a brief survey[J]. A Review of Machine Learning Techniques forProcessing Multimedia Content’, Report of the MUSCLE European Networkof Excellence (FP6), 2004.
    [44] BASU S, BANERJEE A, MOONEY R. Semi-supervised clustering byseeding[C]//MACHINE LEARNING-INTERNATIONAL WORKSHOP THENCONFERENCE-. .[S.l.]: [s.n.] , 2002:19–26.
    [45] KOHONEN T. Improved versions of learning vector quantization[C]//NeuralNetworks, 1990., 1990 IJCNN International Joint Conference on. .[S.l.]: [s.n.] ,1990:545–550.
    [46] BANSAL N, BLUM A, CHAWLA S. Correlation clustering[J]. Machine Learning,2004, 56(1):89–113.
    [47] EICK C, ZEIDAT N, VILALTA R. Using representative-based clustering for near-est neighbor dataset editing[C]//Proc. IEEE International Conference on DataMining (ICDM). .[S.l.]: [s.n.] , 2004.
    [48] WILSON D. Asymptotic properties of nearest neighbor rules using edited data[J].IEEE Transactions on Systems, Man, and Cybernetics, 1972, 2(3):408–421.
    [49] VILALTA R, ACHARI M, EICK C. Piece-wise model fitting using local datapatterns[C]//ECAI. .[S.l.]: [s.n.] , 2004,16:559.
    [50] EICK C, ROUHANA A, BAGHERJEIRAN A, et al. Using clustering to learn dis-tance functions for supervised similarity assessment[J]. Engineering Applica-tions of Artificial Intelligence, 2006, 19(4):395–401.
    [51] LI X, YE N. Grid-and dummy-cluster-based learning of normal and intrusiveclusters for computer intrusion detection[J]. Quality and Reliability EngineeringInternational, 2002, 18(3):231–242.
    [52] LI X, YE N. A supervised clustering algorithm for computer intrusion detec-tion[J]. Knowledge and Information Systems, 2005, 8(4):498–509.
    [53] LI J, LU B. A New Supervised Clustering Algorithm Based on Min-Max Modular Network with Gaussian-Zero-Crossing Functions[C]//Neural Net-works, 2006. IJCNN’06. International Joint Conference on. .[S.l.]: [s.n.] :786–793.
    [54] SPROULL R. Refinements to nearest-neighbor searching in k-dimensionaltrees[J]. Algorithmica, 1991, 6(1):579–589.
    [55] FUJII A, IWAYAMA M, KANDO N. Overview of patent retrieval task at NTCIR-5[C]//Proceedings of the 5th TCIR Workshop Meeting. .[S.l.]: [s.n.] , 2005:269–277.
    [56] FALL C, T”ORCSV A′RI A, BENZINEB K, et al. Automated categorization in the interna-tional patent classification[C]//ACM SIGIR Forum. .[S.l.]: [s.n.] , 2003,37:25.
    [57] MATSUMOTO Y, KITAUCHI A, YAMASHITA T, et al. Japanese morphologicalanalysis system ChaSen manual[J]. Nara Institute of Science and TechnologyTechnical Report NAIST-IS-TR, 1997, 97007:232–237.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700