Clustering of Wikipedia Texts Based on Keywords
详细信息    查看全文
  • 关键词:Documents ; Text ; Spectral clustering ; IR ; Wikipedia ; HCI
  • 刊名:Lecture Notes in Computer Science
  • 出版年:2016
  • 出版时间:2016
  • 年:2016
  • 卷:9790
  • 期:1
  • 页码:513-529
  • 全文大小:1,993 KB
  • 参考文献:1.Manning, C., Raghavan, P., Schütze, H.: Corporation, E.: Introduction to information retrieval, vol. 1. Cambridge University Press, Cambridge (2008)CrossRef MATH
    2.Yang, P., Zhu, Q., Huang, B.: Spectral clustering with density sensitive similarity function. Knowl.-Based Syst. 24, 621–628 (2011)CrossRef
    3.Cvetkovic, D., Doob, M., Sachs, H.: Spectra of Graphs-Theory and Applications, III revised and enlarged edn. Johan Ambrosius Barth Verlag, Heidelberg-Leipzig (1995)MATH
    4.Von Luxburg, U.: A tutorial on spectral clustering. Stat. comput. 17, 395–416 (2007)MathSciNet CrossRef
    5.Vazirani, V.: Algorytmy aproksymacyjne. WNT Warszawa, Warszawa (2005)
    6.Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: analysis and an algorithm. Adv. Neural Inf. Process. Syst. 2, 849–856 (2002)
    7.Kannan, R., Vetta, A.: On clusterings: good, bad and spectral. J. ACM (JACM) 51, 497–515 (2004)MathSciNet CrossRef MATH
    8.Verma, D., Meila, M.: A comparison of spectral clustering algorithms. Technical report, University of Washington UW-CSE-03-05-01 (2003)
    9.Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Commun. ACM 18, 613–620 (1975)CrossRef MATH
    10.Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S.: Constrained k-means clustering with background knowledge. In: Proceedings of the Eighteenth International Conference on Machine Learning, vol. 577, p. 584. Citeseer (2001)
    11.Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22, 888–905 (2000)CrossRef
    12.Hartigan, J., Wong, M.: Algorithm as 136: A k-means clustering algorithm. J. Roy. Stat. Soc.: Ser. C (Appl. Stat.) 28, 100–108 (1979)MATH
    13.Damashek, M.: Gauging similarity with n-grams: language-independent categorization of text. Science 267, 843 (1995)CrossRef
    14.Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24, 513–523 (1988)CrossRef
    15.Wong, S.K.M., Ziarko, W., Wong, P.N.: Generalized vector spaces model in information retrieval. In: Proceedings of SIGIR 1985, pp. 18–25. ACM Press, New York (1985)
    16.Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD workshop on text mining, vol. 400, pp. 525–526. Citeseer (2000)
    17.Korenius, T., Laurikkala, J., Juhola, M.: On principal component analysis, cosine and Euclidean measures in information retrieval. Inf. Sci. 177, 4893–4905 (2007)MathSciNet CrossRef MATH
    18.Jiang, Y., Lin, H., Wang, X., Lu, D.: A technique for improving the performance of naive bayes text classification. In: Gong, Z., Luo, X., Chen, J., Lei, J., Wang, F.L. (eds.) WISM 2011, Part II. LNCS, vol. 6988, pp. 196–203. Springer, Heidelberg (2011)CrossRef
    19.Szymański, J.: Wikipedia articles representation with matrix’u. In: Hota, C., Srimani, P.K. (eds.) ICDCIT 2013. LNCS, vol. 7753, pp. 500–510. Springer, Heidelberg (2013)CrossRef
    20.Grossi, R., Vitter, J.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: Proceedings of the Thirty-Second Annual ACM Symposium on Theory of Computing, PP. 397–406 (2000)
    21.Bennett, C., Li, M., Ma, B.: Chain letters and evolutionary histories. Sci. Am. 288, 76–81 (2003)CrossRef
    22.Eldridge, S., Ashby, D., Bennett, C., Wakelin, M., Feder, G.: Internal and external validity of cluster randomised trials: systematic review of recent trials. BMJ 336, 876 (2008)CrossRef
    23.Yeung, K., Haynor, D., Ruzzo, W.: Validating clustering for gene expression data. Bioinformatics 17, 309 (2001)CrossRef
    24.Provost, F., Fawcett, T., Kohavi, R.: The case against accuracy estimation for comparing induction algorithms. In: Proceedings of the Fifteenth International Conference on Machine Learning, vol. 445. Citeseer (1998)
    25.Zepeda-Mendoza, M.L., Resendis-Antonio, O.: Hierarchical agglomerative clustering. In: Dubitzky, W., Wolkenhaue, O., Cho, K.-H., Yokota, H. (eds.) Encyclopedia of Systems Biology, pp. 886–887. Springer, New York (2013)CrossRef
    26.Krebs, C.J.: Ecological Methodology, vol. 2. Benjamin/Cummings, Menlo Park (1999)
    27.Wang, C., Duo, C.: An improved density-based DBSCAN clustering algorithm. J. Guangxi Norm. Univ. Nat. Sci. Edit. 25, 104 (2007)
    28.Frey, B., Dueck, D.: Clustering by passing messages between data points. Science 315, 972 (2007)MathSciNet CrossRef MATH
    29.Kriegel, H., Pfeifle, M.: Density-based clustering of uncertain data. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, p. 677. ACM (2005)
    30.Szymański, J.: Towards automatic classification of wikipedia content. In: Fyfe, C., Tino, P., Charles, D., Garcia-Osorio, C., Yin, H. (eds.) IDEAL 2010. LNCS, vol. 6283, pp. 102–109. Springer, Heidelberg (2010)CrossRef
    31.Duch, W.: Neurocognitive informatics manifesto. In: Series of Information and Management Sciences (2009)
    32.Collins, A., Loftus, E.: A spreading-activation theory of semantic processing. Psychol. Rev. 82, 407 (1975)CrossRef
    33.Duch, W., Matykiewicz, P., Pestian, J.: Neurolinguistic approach to natural language processing with applications to medical text analysis. Neural Netw. 21(10), 1500–1510 (2008)CrossRef
    34.Miller, G.A., Beckitch, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to WordNet: An On-line Lexical Database. Cognitive Science Laboratory, Princeton University Press, Princeton (1993)
    35.Szymański, J., Mizgier, A., Szopi ński, M., P., L.: Ujednoznacznianie słów przy uzyciu słownika WordNet. Wydawnictwo Naukowe PG TI 2008 18 89–195 536 (2008)
    36.Szymański, J., Duch, W.: Annotating words using wordNet semantic glosses. In: Huang, T., Zeng, Z., Li, C., Leung, C.S. (eds.) ICONIP 2012, Part IV. LNCS, vol. 7666, pp. 180–187. Springer, Heidelberg (2012)CrossRef
  • 作者单位:Jalalaldin Gharibi Karyak (22)
    Fardin Yazdanpanah Sisakht (22)
    Sadrollah Abbasi (23)

    22. Technical and Vocational University, Yasooj, Iran
    23. Department of Computer Engineering, Iran Health Insurance Organization, Yasouj, Iran
  • 丛书名:Computational Science and Its Applications – ICCSA 2016
  • ISBN:978-3-319-42092-9
  • 刊物类别:Computer Science
  • 刊物主题:Artificial Intelligence and Robotics
    Computer Communication Networks
    Software Engineering
    Data Encryption
    Database Management
    Computation by Abstract Devices
    Algorithm Analysis and Problem Complexity
  • 出版者:Springer Berlin / Heidelberg
  • ISSN:1611-3349
  • 卷排序:9790
文摘
The paper presents application of spectral clustering algorithms used for grouping Wikipedia search results. The main contribution of the paper is a representation method for Wikipedia articles that has been based on combination of words and links and it has been used to categorize search result in this repository. We evaluate proposed approach with Primary Component Analysis and show, on a test data, how usage of cosine transformation to create combined representations influence a data variability. On a sample test datasets we also show how combined representation improves the data separation that increases overall results of data categorization. We gave the review of the main spectral clustering methods and we compare them using external validation criteria with standard clustering quality measures. Discussion on descriptiveness of evaluation measures and performed experiments on test datasets allows us to select the one spectral clustering algorithm that has been implemented in our system. We gave a brief description of the system architecture that groups on-line Wikipedia articles retrieved with specified keywords. Using the system we show how clustering increases information retrieval effectiveness for Wikipedia data repository.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700