Efficient Machine Learning Technique for Web Page Classification
详细信息    查看全文
  • 作者:S. Markkandeyan ; M. Indra Devi
  • 关键词:Web page classification ; Feature selection ; Attribute ; selected classifier ; Principal component analysis ; Genetic search ; Rank search
  • 刊名:Arabian Journal for Science and Engineering
  • 出版年:2015
  • 出版时间:December 2015
  • 年:2015
  • 卷:40
  • 期:12
  • 页码:3555-3566
  • 全文大小:649 KB
  • 参考文献:1.Qi, X.; Davison, B.D.: Web page classification: features and algorithms. ACM Comput. Surv. 41(2), 1鈥?1, Article 12 (2009)
    2.Bidgoli A.M., Parsa M.N.: A hybrid feature selection by resampling, chi squared and consistency evaluation techniques. World Acad. Sci. Eng. Technol. 68, 230鈥?39 (2012)
    3.Indra Devi, M.; Selvakuberan, K.; Rajaram, R.: Generating best features for web page classification. Webology 5(1), Article 52 (2008)
    4.Tan C.P., Lim K.S., Lai W.K.: Multidimensional features reduction of consistency subset evaluator on unsupervised expectation maximization classifier for imaging surveillance application. Int. J. Image Process. 2(1), 18鈥?6 (2008)
    5.Wakaki, T.; Itakura, H.; Tamura, M.; Motoda, H.; Washio, T.: A study on rough set-aided feature selection for automatic webpage classification. In: Web Intelligence and Agent Systems: An International Journal, pp. 431鈥?41. ISO Press (2006)
    6.Yu, L.; Liu, H.: Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington, DC (2003)
    7.Leeladevi B., Sankar A.: Feature selection for web page classification using swarm optimization. Int. J. Comput. Control Quantum Inf. Eng. 9(1), 340鈥?46 (2015)
    8.Meshkizadeh S., Rahmani A.M.: Web page classification based on compound of using HTML and URL features and features of sibling pages. Int. J. Adv. Comput. Technol. 2, 36鈥?6 (2010)
    9.Choi B., Yao Z.: Web page classification. Found. Adv. Data Min. Stud. Fuzziness Soft Comput. 180, 221鈥?74 (2005)CrossRef
    10.Vaghela S., Chaudhary M.B., Chauhan D.: Web page classification using term frequency. Int. J. Technol. Res. Eng. 1(9), 949鈥?54 (2014)
    11.Kaur P., Kaur R.: An optimized approach for feature selection using membrane computing to classify web pages. Int. J. Curr. Eng. Technol. 4(5), 3579鈥?584 (2014)MathSciNet
    12.Kenekayoro P., Buckley K., Thelwall M.: Automatic classification of academic web page types. Scientometrics 101(2), 1015鈥?026 (2014)CrossRef
    13.Zheng, Z.; Srihari, R.; Srihari, S.: A feature selection framework for text filtering. In: Third IEEE International Conference on Data Mining, pp. 705鈥?08 (2003)
    14.Liu J., Sun H., Ding Z.: An efficient webpage classification algorithm based on LSH. Intell. Comput. Big Data Era Commun. Comput. Inf. Sci. 503, 250鈥?57 (2015)
    15.Liu, H.; Setino, R.: Feature selection and classification鈥攁 probabilistic wrapper approach. In: 9th International Conference on Industrial & Engineering Applications of Artificial Intelligence & Expert Systems (IEA-AIE), Fukuoka, Japan, pp. 419鈥?24 (1996)
    16.Patil, A.S.; Pawar, B.V.: Automated classification of web sites using Naive Bayesian algorithm. In: Proceedings of the International MultiConference of Engineers and Computer Scientists, Hong Kong (2012)
    17.Liu, H.; Setino, R.: A probabilistic approach to feature selection鈥攁 filter solution. In: 13th International Conference on Machine Learning, Italy, pp. 319鈥?27 (1996)
    18.Almuallim H., Dietterich T.G.: Learning Boolean concepts in the presence of many irrelevant features. Artif. Intell. 69(1鈥?), 279鈥?05 (1994)MATH MathSciNet CrossRef
    19.Kira, K.; Rendell, L.A.: The feature selection problem: traditional methods and a new algorithm. In: Proceedings Ninth National Conference on Artificial Intelligence, pp. 129鈥?34. AAAI Press/ The MIT Press (1992)
    20.John, G.H.; Kohavi, R.; Peger, K.: Irrelevant feature and the subset selection problem in machine learning. In: Proceedings of the Eleventh International Conference, San Francisco, CA, pp. 121鈥?29. Morgan Kaufmann Publisher (1994)
    21.Forman G.: An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3, 1289鈥?305 (2003)MATH
    22.Krishnapuram B., Harternink A.J., Carin L., Figueiredo M.A.T.: A bayesian approach to joint feature selection and classifier design. IEEE Trans. Pattern Anal. Mach. Intell. 26(9), 1105鈥?111 (2004)CrossRef
    23.Chen, K.; Liu, H.: Towards an evolutionary algorithm: comparison of two feature selection algorithms. In: Proceedings in Congress on Evolutionary Computation, Washington, DC, USA, pp. 1309鈥?313 (1999)
    24.Vafaie, H.; De Jong, K.: Robust feature selection algorithms. In: Proceedings of Fifth International Conference on Tools with Artificial Intelligence, Boston, pp. 356鈥?63. IEEE Computer Society Press (1993)
    25.Porter M.F.: An algorithm for suffix stripping: program. Electron. Libr. Inf. Syst. 14(3), 130鈥?37 (1980)
    26.Wold S., Esbensen K., Glade P.: Principal components analysis. Chemo Metr. Intell. Lab. Syst. 2, 37鈥?5 (1987)CrossRef
    27.Jolliffe I.T.: Principal Component Analysis. Springer, New York (1986)CrossRef
    28.Holland, J.H.: Adaptation in Natural and Artificial Systems. University of Michigan Press (reprinted in 1992 by MIT Press, Cambridge, MA) (1975)
    29.De Jong, K.: Learning with genetic algorithms: an overview. Mach. Learn. 3(2鈥?), 121鈥?38 (1988)
    30.Goldberg D.E.: Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, Reading, MA (1989)MATH
    31.Hall M.A., Holmes G.: Benchmarking attributes selection techniques for discrete class data mining. IEEE Trans. Knowl. Data Eng. 15(6), 1437鈥?447 (2003)CrossRef
    32.Jones K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28(1), 11鈥?0 (1972)CrossRef
    33.Robertson S.E., Robertson K.: Relevance weighting of search terms. J. Am. Soc. Inf. Sci. 27(3), 129鈥?46 (1976)CrossRef
  • 作者单位:S. Markkandeyan (1)
    M. Indra Devi (2)

    1. Department of Information Technology, Ratnavel Subramaniam College of Engineering and Technology, Dindigul, Tamilnadu, 624005, India
    2. Department of Computer Science and Engineering, Kamaraj College of Engineering and Technology, Virudhunagar, Tamilnadu, India
  • 刊物类别:Engineering
  • 刊物主题:Engineering, general
    Mathematics
    Science, general
  • 出版者:Springer Berlin / Heidelberg
文摘
Web page classification plays a major role in information management and retrieval task. Feature selection is an important process for accurate classification of Web pages. Web pages contain several features, and more number of features reduce the classification accuracy. We propose a hybrid feature selection approach which is both efficient and effective for automatic Web page classification problem and also helps the Web search tool to get relevant results in the relevant category. Experiments were conducted by us with various feature selection methods for Web page classification and keyword search problem. From these experiments, it was found that some features present in the initial feature set (IFS) are irrelevant, redundant, and noisy, and they consume more memory space, increase computational time, and give a poor predictive performance. These features can be eliminated using evaluator methods such as principal component analysis, consistency subset evaluator, and search methods such as genetic search and rank search, resulting in minimal and more relevant features. We call these features as intermediate feature set (IMFS), and further optimization in this feature set gives more accurate results. Finally, attribute-selected classifier which is a part of machine learning meta-classifier was applied to the IMFS to get final feature set (FFS), and it was found that accuracy has increased up to 97% and computational time for all classifiers is minimized compared to IFS using WebKb (Faculty and Course) and ODP (Sports) benchmarking datasets. The proposed method yields better classification performance and reduces space requirements and search time in the Web documents compared with the existing methods. Keywords Web page classification Feature selection Attribute-selected classifier Principal component analysis Genetic search Rank search

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700