Statistical word sense aware topic models

详细信息查看全文

作者：Guoyu Tang (1)
Yunqing Xia (1)
Jun Sun (2)
Min Zhang (3)
Thomas Fang Zheng (1)

1. Department of Computer Science and Technology ; TNList ; Tsinghua University ; Beijing ; China
2. Institute for Infocomm Research ; A-STAR ; Singapore ; Singapore
3. Soochow University ; Suzhou ; China
关键词：Topic modeling ; Word sense induction ; Document representation ; Document clustering
刊名：Soft Computing - A Fusion of Foundations, Methodologies and Applications
出版年：2015
出版时间：January 2015
年：2015
卷：19
期：1
页码：13-27
全文大小：784 KB
参考文献：1. Agirre E, Soroa A (2007) Semeval-2007 task 02: Evaluating word sense induction and discrimination systems. In: Proceedings of the 4th International Workshop on Semantic Evaluations, Association for Computational Linguistics, Stroudsburg, PA, USA, SemEval 鈥?7, pp 7鈥?2. http://dl.acm.org/citation.cfm?id=1621474.1621476
2. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993鈥?022. http://dl.acm.org/citation.cfm?id=944919.944937
3. Boyd-Graber J, Blei D (2007) Putop: Turning predominant senses into a topic model for word sense disambiguation. In: Proceedings of the 4th International Workshop on Semantic Evaluations, Association for Computational Linguistics, Stroudsburg, PA, USA, SemEval 鈥?7, pp 277鈥?81. http://dl.acm.org/citation.cfm?id=1621474.1621534
4. Boyd-Graber JL, Blei DM, Zhu X (2007) A topic model for word sense disambiguation. In: EMNLP-CoNLL, pp 1024鈥?033
5. Brody S, Lapata M (2009) Bayesian word sense induction. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, Stroudsburg, PA, USA, EACL 鈥?9, pp 103鈥?11. http://dl.acm.org/citation.cfm?id=1609067.1609078
6. Cambria E, White B (2014) Jumping NLP curves: a review of natural language processing research. IEEE Comput Intell Mag 9(2):48鈥?7. doi:10.1109/MCI.2014.2307227
7. Chemudugunta C, Smyth P, Steyvers M (2008) Combining concept hierarchies and statistical topic models. In: Proceedings of the 17th ACM conference on Information and knowledge management, ACM, New York, NY, USA, CIKM 鈥?8, pp 1469鈥?470. doi:10.1145/1458082.1458337 . http://doi.acm.org/10.1145/1458082.1458337
8. Denkowski M (2009) A survey of techniques for unsupervised word sense induction. Language and Statistics II Literature Review.
9. Dietz L, Bickel S, Scheffer T (2007) Unsupervised prediction of citation influences. In: Proceedings of the 24th International Conference on Machine Learning, pp 233鈥?40
10. Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th international joint conference on Artifical intelligence, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, IJCAI鈥?7, pp 1606鈥?611. http://dl.acm.org/citation.cfm?id=1625275.1625535
11. Griffiths TL, Steyvers M (2004) Finding scientific topics. PNAS 101(suppl. 1):5228鈥?235 CrossRef
12. Guo W, Diab M (2011) Semantic topic models: combining word distributional statistics and dictionary definitions. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Stroudsburg, PA, USA, EMNLP 鈥?1, pp 552鈥?61. http://dl.acm.org/citation.cfm?id=2145432.2145496
13. Hotho A, Staab S, Stumme G (2003) Wordnet improves text document clustering. In: Proc. of the SIGIR 2003 Semantic Web Workshop, pp 541鈥?44.
14. Hovy E, Marcus M, Palmer M, Ramshaw L, Weischedel R (2006) Ontonotes: the 90% solution. Proceedings of the human language technology conference of the NAACL. Companion Volume, Short Papers, Association for Computational Linguistics , pp 57鈥?0
15. Huang HH, Kuo YH (2010) Cross-lingual document representation and semantic similarity measure: a fuzzy set and rough set based approach. Trans Fuz Sys 18(6), pp. 1098鈥?111. doi:10.1109/TFUZZ.2010.2065811
16. Klapaftis IP, Manandhar S (2013) Evaluating word sense induction and disambiguation methods. Lang Resour Eval 47(3):579鈥?05. doi:10.1007/s10579-012-9205-0
17. Kong J, Graff D (2005) Tdt4 multilingual broadcast news speech corpus. Linguistic Data Consortium. http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp
18. Lau RYK, Xia Y, Ye Y (2014) A probabilistic generative model for mining cybercriminal networks from online social media. IEEE Comput Intell Mag 9(1):31鈥?3. doi:10.1109/MCI.2013.2291689
19. Lewis DD (1997) Reuters-21578 text categorization test collection, distribution 1.0. http://www.research.att.com/~lewis/reuters21578.html
20. Li L, Roth B, Sporleder C (2010) Topic models for word sense disambiguation and token-based idiom detection. In: Proceedings of the 48th annual meeting of the association for computational linguistics. Association for Computational Linguistics, Stroudsburg, PA, pp 1138鈥?147
21. McCarthy D, Koeling R, Weeds J, Carroll J (2004) Finding predominant word senses in untagged text. In: Proceedings of the 42Nd Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, Stroudsburg, PA, USA, ACL 鈥?4. doi:10.3115/1218955.1218991
22. Navigli R (2009) Word sense disambiguation: A survey. ACM Comput Surv 41(2):10:1鈥?0:69. doi:10.1145/1459352.1459355
23. Navigli R, Crisafulli G (2010) Inducing word senses to improve web search result clustering. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Stroudsburg, PA, USA, EMNLP 鈥?0, pp 116鈥?26, URL http://dl.acm.org/citation.cfm?id=1870658.1870670
24. Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11), pp. 613鈥?20. doi: 10.1145/361219.361220
25. Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. Proceedings of international conference on new methods in language processing, Manchester, UK 12:44鈥?9
26. Schtze H, Pedersen J (1995) Information retrieval based on word senses. In: Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval.
27. Slonim N, Tishby N (2000) Document clustering using word clusters via the information bottleneck method. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, NY, USA, SIGIR 鈥?0, pp 208鈥?15. doi:10.1145/345508.345578
28. Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In. In KDD Workshop on Text Mining.
29. Stokoe C, Oakes MP, Tait J (2003) Word sense disambiguation in information retrieval revisited. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, ACM, New York, NY, USA, SIGIR 鈥?3, pp 159鈥?66. doi:10.1145/860435.860466
30. Teh YW, Jordan MI, Beal MJ, Blei DM (2004) Hierarchical dirichlet processes. Journal of the American Statistical Association 101.
31. Tufi艧 D, Koeva S (2007) Ontology-supported text classification based on cross-lingual word sense disambiguation. In: Applications of Fuzzy Sets Theory. Springer, Berlin, pp 447鈥?55.
32. Wang X, McCallum A, Wei X (2007) Topical n-grams: Phrase and topic discovery, with an application to information retrieval. In: Proceedings of the 2007 Seventh IEEE International Conference on Data Mining, IEEE Computer Society, Washington, DC, USA, ICDM 鈥?7, pp 697鈥?02. doi:10.1109/ICDM.2007.86
33. Yao X, Van Durme B (2011) Nonparametric bayesian word sense induction. In: Proceedings of TextGraphs-6: Graph-based Methods for Natural Language Processing, Association for Computational Linguistics, pp 10鈥?4. http://cs.jhu.edu/xuchen/paper/Yao2011WSI.slides.pdf. http://cs.jhu.edu/xuchen/paper/Yao2011WSI.pdf
刊物类别：Engineering
刊物主题：Numerical and Computational Methods in Engineering
Theory of Computation
Computing Methodologies
Mathematical Logic and Foundations
Control Engineering
出版者：Springer Berlin / Heidelberg
ISSN：1433-7479

文摘

LDA has been proved effective in modeling the semantic relation between surface words. This semantic information in the document collection is useful to measure the topic distribution for a document. In general, a surface word may significantly contribute to several topics in a document collection. LDA measures the contribution of a surface word to each topic and considers a surface word to be identical across all documents. However, a surface word may present different signatures in different contexts, i.e., polysemous words can be used with different senses in different contexts. Intuitively, disambiguating word senses for topic models can enhance their discriminative capabilities. In this work, we propose a joint model to automatically induce document topics and word senses simultaneously. Instead of using some pre-defined word sense resources, we capture the word sense information via a latent variable and directly induce them in a fully unsupervised manner from the corpora. Experimental results show that the proposed joint model outperforms the baselines significantly in document clustering and improves the word sense induction as well against a standalone non-parametric model.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700