基于RBF神经网络的网页分类技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着Internet的普及,网络已经成为人们获取信息的主要途径,为了帮助人们从海量网页中获取有用的信息,网页自动分类技术应运而生,其可以快速有效地分析和组织海量网页信息,它是利用机器学习的方法对网页实现自动类别标注。在众多网页分类算法中,RBF神经网络因其出色的分类能力,成为机器学习的研究热点。
     介绍了网页分类的流程,分析了RBF神经网络技术发展、原理和相关技术,讨论了RBF神经网络在网页分类中的重要作用。阐述了目前RBF神经网络常用训练算法,研究了在多实例多标签框架下发展而来的MIMLRBF神经网络模型。针对MIMLRBF在不平衡样本下分类效果差的问题,提出了改进的训练算法,考虑了样本的整体分布情况,使各类上产生的隐含层神经元趋于平衡,减少了不平衡样本对网络模型的影响。
     针对SVD方法在含有噪声数据的样本集上会导致网络整体误差变大的问题,提出了基于最速下降法优化的权重训练算法,使用SVD方法初始化权值矩阵,采用最速下降法优化权值矩阵,并利用新权值矩阵的误差平方和函数计算学习率矩阵,提高了MIMLRBF神经网络在含有噪声数据的样本集上的分类精度。
     最后,将改进后的训练算法应用到网页分类系统中,并对改进算法进行了实验对比和性能分析。实验数据表明,本文算法具有更高的分类效率和准确率。
With the popularity of the Internet, the Internet has become the main way people get information. Web pages classification can analyze and organize massive web pages quickly and efficiently, it is a kind of machine learning methods that assign labels to web pages automatically. Among the many web pages classification algorithms, RBF neural network become a research focus in machine learning because of its excellent classification ability.
     This thesis describes the process of Web pages classification, the development of RBF neural network, related technologies, summarizes the important role of RBF neural network in web pages classification. The common training methods of RBF are also studied, including the derived multi-instance multi-label RBF neural network. We proposed an improved method for the poor performance of MIMLRBF on unbalanced dataset. This method takes into account the overall distribution of the samples, so that the hidden neurons generated on all classes tends to balance, reducing the unbalance problem on the network.
     When the training data are noisy or not easily discernible, the SVD method will cause augmented overall error in network performance. In this thesis, the weights optimization method based on the steepest descent method is proposed for relieve this problem. Firstly, the weight matrix is initialized by SVD method, and then optimized by steepest descent method. The learning rate matrix is computed by minimizing the sum-squared error function of the new weight matrix. The performance of network is improved on noisy training data.
     Finally, the improved training algorithms are applied to the web pages classification system. The performance of improved algorithms are analyzed and compared. Experimental data show that the algorithms have higher efficiency and accuracy.
引文
[1]孙建涛,沈抖,陆玉昌,石纯一.网页分类技术[J].清华大学学报:自然科学版, 2004,44(1):65-68.
    [2] Vladimir Vapnik. The Nature of Statistical Learning Theory[M]. New York:Springer-verlag, 2000.
    [3] Lu Yingwei, Sundararajan, N., Saratchandran, P.. Performance Evaluation of a Sequential Minimal Radial Basis Function (RBF) Neural Network Learning Algorithm[J]. IEEE transactions on neural networks, 1998,9(2):308-318.
    [4] Warren S. Mcculloch, Walter Pitts. A logical calculus of the ideas immanent in nervous activity[J]. Bullentin of Mathematical Biophysics,1943,5:115-133.
    [5] Rolland L. Hardy. Multiquadric Equations of Topography and Other Irregular Surfaces[J]. Journal of Geophysical Research, 1971,76(8):1905-1915.
    [6] M.J.D. Powell. Radial basis functions for multivariable interpolation: A review[C]. In: Proceedings of the IMA Conference on Algorithms for the Approximation of Functions and Data, RMCA, Shrivenham, 1985.
    [7] Broomhead D. S., D. Lowe. Multi-variable functional interpolation and adaptive networks[J]. Complex Systems, 1988,2:321-355.
    [8]乔丽,姜慧霖.一种k-means聚类的案例检索算法[J].计算机工程与应用, 2011, 47(4):185-187.
    [9]夏宁霞,苏一丹,覃希.一种高效的K-medoids聚类算法[J].计算机应用研究,2010, 27(12):4517-4519.
    [10]苏小红,侯秋香等. RBF神经网络的混合学习算法[J].哈尔滨工业大学学报,2006, 38(9):1446-1449.
    [11]赵志刚,单晓红.一种基于遗传算法的RBF神经网络优化方法[J].计算机工程, 2007, 33(6):211-212.
    [12] J. Moody, C. J. Darken. Fast learning in networks of locally-tuned processing units[J]. Neural Computation,1989:281-294.
    [13] Nabil Benoudjit, et al. Width optimization of the Gaussian kernels in Radial Basis Function Networks[C]. European Symposium on Artificial Neural Networks Bruges(Belgium), 2002.
    [14]陈自宽.求解任意样本集的综合鉴别函数的两步伪逆法[J].数值计算与计算机应用, 1996, 1:8-13.
    [15]李春宇,张晓林.基于LMS算法的多点滑动DFT方法[J].电子学报,2010, 38(10):2422-2425.
    [16]吴锋,李秀梅,朱旭辉,黄哲华.最速下降法的若干重要改进[J].广西大学学报:自然科学版, 2010, 35(4):596-600.
    [17] S. Haykin. Neural Networks: A Comprehensive Foundation[M]. USA:Prentice Hall, 1998.
    [18] T. Poggio, F. Girosi. Networks for approximation and learning[J]. Proc. IEEE, 1990, 78(9): 1481-1497.
    [19] Gh.A. Montazer, et al. Improvement of learning algorithms for RBF neural networks in a helicopter sound identification system[J]. Neurocomputing,2007,71:167-173.
    [20] C. Harpham, C.W. Dawson. The effect of different basis functions on a radial basis function network for time series prediction: A comparative study[J]. Neurocomputing,2006, 69: 2161-2170.
    [21] Z.-H. Zhou, M.-L. Zhang. Multi-instance multi-label learning with application to scene classification[J]. Advances in Neural Information Processing Systems,2007,19:1609-1616.
    [22] Z.-H. Zhou, M.-L. Zhang, S.-J. Huang, Y.-F Li. MIML: a framework for learning with ambiguous objects[C]. CORR,2008.
    [23] Min-Ling Zhang, Zhi-Jian Wang. MIMLRBF: RBF neural networks for multi-instance multi-label learning[J]. Neurocomputing,2009,72:3951-3956.
    [24] M.-L. Zhang, Z.-H. Zhou. Multi-instance clustering with applications to multi-instance prediction[J]. Applied Intelligence,2009,31(1):47-68.
    [25] Gh. A. Montazer, Reza Sabzevari, Fatemeh Ghorbani. Three-phase strategy for the OSD learning method in RBF neural networks[J]. Neurocomputing,2009,72:1797-1802.
    [26] Yang Y, Liu X. A re-examination of text categorization methods[C]. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval,1999.
    [27]范焱,郑诚,王清毅等.用Naive Bayes方法协调分类Web网页[J].软件学报,2001,12(9):1386-1392.
    [28] Valter Crescenzi, Giansalvatore Mecca. Automatic Information Extraction from Large Websites[J]. Journal of the ACM,2004,(5):750-761.
    [29] Soderlan S. Leaming Information Extraction Rules for Semi-structured and Free Text[J]. International Journal of Machine Learning,1999,34(l-3):233-272.
    [30] Ion Muslea, Steven Minton, Craig A. Knoblock: Hierarchical Wrapper Induction for Semi-structured Information Sources[J]. Autonomous Agents and Mufti-Agent Systems, 2001(4):93-114.
    [31] Arnaud Sahugue, Fabien Azavan. Building Intelligent Web Applications Using Lightweight Wrappers[J]. Data Knowledge Engineering,2001,36(3):283-316.
    [32] Rpbert Bai, gartmer, Sergio Flesca, George GottlobB. Visual Web Information Extraction with Lixto [C]. Proceedings of 27th International Conference on Very Large Database, Roma, Italy, 2001:119-128.
    [33] Valter Crescenzi, Giansalvatore Mecca. RoadRunner: Towards Automatic Data Extraction from Large Web Sites[C]. In Proceedings of the 27th International Conference on Very Large Database. Roma, Italy, 2001:317-328.
    [34] Embley D, Campbell D, Smith R. Ontology based Extraction and Structuring of Information from Data Rich Unstructured Documents[C]. In Proceedings of the Conference on Information and Knowledge Management, 1998:52-59.
    [35] Arocena G.O., Mendelzon A.O.. WebOQL: Restructuring Documents, Databases and Webs[C]. In Proceedings of the 14th ICDE Conference, Orlando, Florida, USA, 1998:24-33.
    [36] Yun Zhengjia, Li Yinan, Yang Xiaochun. An Algorithm for Matching Strings with Wildcards[J]. Journal of Frontiers of Computer Science & Technology,2010,4(11):984-995.
    [37] Yang Xiao-jia, Jiang Wei, Hao Wen-ning. Implementation of Field Word Segmentation Based on Ontology and Syntax Analysis[J]. Computer Engineering,2008,34(23):26-28.
    [38] Zhang Min, Wang Chun-hong. Study on New Words of Web Based on Statistical Word Segmentation[J]. Computer Engineering & Science,2010,32(5):133-135.
    [39]胡佳妮,徐蔚然,郭军等.中文文本分类中的特征选择算法研究[J].光通信研究,2005,129(3):44-46.
    [40]孙荣宗,苗夺谦,卫志华,李文.基于粗糙集的快速KNN文本分类算法[J].计算机工程, 2010,36(24):175-177.
    [41]刘沛骞,冯晶晶.一种改进的朴素贝叶斯文本分类算法[J].微计算机应用,2010, 26(9-3):187-188.
    [42] Vapnik V. The Nature of Statistical Learning Theory[M]. Springer,1995.
    [43] O. Maron, A.L. Ratan. Multiple-instance learning for natural scene classification[C]. Proceedings of the 15th International Conference on Machine Learning,1998:341-349.
    [44] F. Sebastiani. Machine learning in automated text categorization[J]. ACM Computing Surveys,2002,34(1):1-47.
    [45] Stuart Andrews, Ioannis Tsochantaridis, Thomas Hofmann. Support Vector Machines for Multiple-Instance Learning[C]. In Proceedings of NIPS, 2002:561-568.
    [46] Morris H. DeGroot, Mark J. Schervish. Probability and statistics[M]. Addison-Wesley, 2002.
    [47] R. E. Schapire, Y. Singer. BoosTexter: A boosting-based system for text categorization[J]. Machine Learning,2000,39(2):135-168.
    [48] J.Stoer, R.Bulirsch. Introduction to Numerical Analysis[M]. Berlin:Springer,2002.
    [49] S.Cohen, N.Intrator. Global optimization of RBF networks[J]. IEEE Transactions on Neural Networks,2000.
    [50]杜冬梅,许彩欣,苏健.浅谈正则表达式在web系统中的应用[J].计算机系统应用, 2007,8:87-90.
    [51]裴英博,刘晓霞.文本分类中改进型CHI特征选择方法的研究[J].计算机工程与应用, 2011,47(4):128-194.
NGLC 2004-2010.National Geological Library of China All Rights Reserved.
Add:29 Xueyuan Rd,Haidian District,Beijing,PRC. Mail Add: 8324 mailbox 100083
For exchange or info please contact us via email.