Constructing and utilizing wordnets using statistical methods

详细信息查看全文

作者：Gerard de Melo (1) demelo@mpi-inf.mpg.de
Gerhard Weikum (1) weikum@mpi-inf.mpg.de
关键词：Lexical resources &#8211 ; WordNet &#8211 ; Machine learning
刊名：Language Resources and Evaluation
出版年：2012
出版时间：June 2012
年：2012
卷：46
期：2
页码：287-311
全文大小：552.3 KB
参考文献：1. Atserias, J., Climent, S., Farreres, X., Rigau, G., & Rodr铆guez, H. (1997). Combining multiple methods for the automatic construction of multilingual WordNets. In Proceedings of the international conference on recent advances in NLP 1997 (pp. 143–149).
2. Baker, C., & Fellbaum, C. (2008). Can wordnet and framenet be made “interoperable”? In Proceedings of the first international conference on global interoperability for language resources.
3. Benitez, L., Cervell, S., Escudero, G., Lopez, M., Rigau, G., & Taul茅, M. (1998). Methods and tools for building the Catalan WordNet. In: Proceedings of the ELRA workshop on language res. for Europ. Minority Lang., 1st international conference on language resources and evaluation.
4. Bentivogli, L., Forner, P., Magnini, B., & Pianta, E. (2004). Revising the WordNet domains hierarchy. In COLING 2004 multiling. Ling. Resources, Geneva, Switzerland (pp. 94–101).
5. Bizer, C., Heath, T., & Berners-Lee, T. (2009). Linked data—the story so far. International Journal on Semantic Web and Information Systems, 5(3), 1–22.
6. Buscaldi, D., & Rosso, P. (2008). Geo-wordnet: Automatic georeferencing of wordnet. In (ELRA) ELRA (Ed.), Proceedings of the 6th international language resources and evaluation (LREC’08), Marrakech, Morocco.
7. Chang, C. C., & Lin, C. J. (2001) LIBSVM: A library for support vector machines. URL http://www.csie.ntu.edu.tw/cjlin/libsvm.
8. Chen, H. H., Lin, C. C., & Lin, W. C. (2000). Construction of a Chinese-English WordNet and its application to CLIR. In Proceedings of the fifth international workshop on information retrieval with Asian languages, IRAL ’00 (pp. 189–196). New York, NY, USA: ACM Press.
9. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.
10. Cycorp Inc. (2008). Opencyc. http://www.opencyc.org/.
11. Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Li, F. F. (2009). ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR 2009).
12. de Melo, G., & Siersdorfer, S. (2007). Multilingual text classification using ontologies. In G. Amati (Ed.), Proceedings of the 29th European conference on information retrieval (ECIR 2007). Springer, Rome, Italy, Lecture Notes in Computer Science, Vol. 4425.
13. de Melo, G., & Weikum, G. (2009). Towards a universal wordnet by learning from combined evidence. In Proceedings of the 18th ACM conference on information and knowledge management (CIKM 2009) (pp. 513–522). New York, NY, USA: ACM.
14. Fellbaum, C. (Ed.) (1998). WordNet: An electronic lexical database (language, speech, and communication. Cambridge: The MIT Press.
15. Francopoulo, G., Declerck, T., & Sornlertlamvanich, V., de la Clergerie, E., & Monachini, M. (2008). Data category registry: Morpho-syntactic and syntactic profiles. In Proceedings of the workshop on use and usage of language resource-related standards at the LREC 2008.
16. Gangemi, A., Navigli, R., & Velardi, P. (2003). The ontowordnet project: Extension and axiomatization of conceptual relations in wordnet. In On the move to meaningful internet systems 2003: CoopIS, DOA, and ODBASE (pp. 820–838).
17. Gurevych, I. (2005). Using the structure of a conceptual network in computing semantic relatedness. In Proceedings of the second international joint conference on natural language processing, IJCNLP, Jeju Island, Republic of Korea.
18. Gurevych, I., M眉ller, C., & Zesch, T. (2007). What to be?— electronic career guidance based on semantic relatedness. In Proceedings of the 45th annual meeting of the association for computational linguistics, Association for Computational Linguistics, Prague, Czech Republic (pp. 1032–1039).
19. Harabagiu, S. M., Bunescu, R. C., & Maiorano, S. J. (2001). Text and knowledge mining for coreference resolution. In NAACL ’01: Second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies 2001, Association for Computational Linguistics, Morristown, NJ, USA (pp. 1–8).
20. Joachims, T. (1999). Making large-scale support vector machine learning practical. In B. Sch枚lkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods: Support vector machines. Cambridge, MA, USA: MIT Press.
21. Kipper, K., Dang, H. T., & Palmer, M. (2000). Class-based construction of a verb lexicon. In AAAI (pp. 691–696).
22. Knight, K. (1993). Building a large ontology for machine translation. In Proceedings of the workshop human language technology (pp. 185–190).
23. Kunze, C., & Lemnitzer, L. (2002). GermaNet—representation, visualization, application. In Proceedings of the LREC 2002 (pp. 1485–1491).
24. Lesk, M. (1986). Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the 5th annual international conference on systems documentation, SIGDOC ’86 (pp. 24–26). New York, NY, USA: ACM Press.
25. Lin, H. T., Lin, C. J., & Weng, R. C. (2007). A note on platt’s probabilistic outputs for support vector machines. Machine Learning, 68(3), 267–276.
26. Lyons, J. (1977). Semantics, Vol. 1. Cambridge: Cambridge University Press.
27. Mih谩ltz, M., & Pr贸sz茅ky, G. (2004). Results and evaluation of Hungarian Nominal WordNet v1.0. In Proceedings of the second global WordNet conference. Brno, Czech Republic: Masaryk University.
28. Niles, I., & Pease, A. (2003). Linking lexicons and ontologies: Mapping WordNet to the suggested upper merged ontology. In Proceedings of the 2003 international conference information and knowledge engineering, Las Vegas, NV, USA.
29. Okumura, A., & Hovy, E. (1994). Building Japanese-English dictionary based on ontology for machine translation. In Proceedings of the workshop on human language technology (pp. 141–146).
30. Ordan, N., & Wintner, S. (2007). Hebrew WordNet: A test case of aligning lexical databases across languages. International Journal of Translation, 19(1), 39–58.
31. Patwardhan, S., Banerjee, S., & Pedersen, T. (2003). Using measures of semantic relatedness for word sense disambiguation. In Proceedings 4th international conference on computational linguistics and intelligent text processing (CICLing), Mexico City, Mexico.
32. Pianta, E., Bentivogli, L., & Girardi, C. (2002). MultiWordNet: Developing an aligned multilingual database. In Proceedings of the 1st international global WordNet conference, Mysore, India (pp. 293–302).
33. Platt, J. C. (1999). Fast training of support vector machines using sequential minimal optimization (pp. 185–208). Cambridge, MA, USA: MIT Press.
34. Platt, J. C. (2000). Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In A. Smola, P. Bartlett, B. Sch枚lkopf, & D. Schuurmans (Eds.), Advances in large margin classifiers (pp. 61–74). Cambridge, MA, USA: MIT Press.
35. Reuters. (2000a). Reuters Corpus, Vol. 1: English language, 1996-08-20 to 1997-08-19. URL http://trec.nist.gov/data/reuters/reuters.html.
36. Reuters. (2000b). Reuters Corpus, Vol. 2: Multilingual, 1996-08-20 to 1997-08-19. http://trec.nist.gov/data/reuters/reuters.html.
37. Richter, F. (2007). Ding version 1.5. http://www-user.tu-chemnitz.de/~fri/ding/.
38. Rigau, G., & Agirre, E. (1995). Disambiguating bilingual nominal entries against WordNet. In Proceedings of the Workshop ‘The Computational Lexicon’ at European summer school logic, language & information.
39. Sathapornrungkij, P., & Pluempitiwiriyawej, C. (2005). Construction of Thai WordNet lexical database from machine readable dictionaries. In Proceedings of the 10th machine translation summit, Phuket, Thailand.
40. Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In International conference on new methods in language processing, Manchester, UK.
41. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.
42. Suchanek, F. M., Kasneci, G., & Weikum, G. (2007). Yago: A Core of semantic knowledge. In 16th International World Wide Web Conference (WWW 2007). New York: ACM Press.
43. Tufiş, D., Ion, R., & Ide, N. (2004). Fine-grained word sense disambiguation based on parallel corpora, word alignment, word clustering and aligned wordnets. In COLING ’04: Proceedings of the 20th international conference on computational Linguistics, Association for Computational Linguistics, Morristown, NJ, USA (p. 1312).
44. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley-Interscience.
45. Vossen, P. (Ed.) (1998). EuroWordNet: A multilingual database with lexical semantic networks. Berlin: Springer.
46. Zesch, T., & Gurevych, I. (2006). Automatically creating datasets for measures of semantic relatedness. In COLING/ACL 2006 workshop on linguistic distances, Sydney, Australia (pp. 16–24).
作者单位：1. Max Planck Institute for Informatics, Campus E1 4, 66123 Saarbr眉cken, Germany
ISSN：1574-0218

文摘

Lexical databases following the wordnet paradigm capture information about words, word senses, and their relationships. A large number of existing tools and datasets are based on the original WordNet, so extending the landscape of resources aligned with WordNet leads to great potential for interoperability and to substantial synergies. Wordnets are being compiled for a considerable number of languages, however most have yet to reach a comparable level of coverage. We propose a method for automatically producing such resources for new languages based on WordNet, and analyse the implications of this approach both from a linguistic perspective as well as by considering natural language processing tasks. Our approach takes advantage of the original WordNet in conjunction with translation dictionaries. A small set of training associations is used to learn a statistical model for predicting associations between terms and senses. The associations are represented using a variety of scores that take into account structural properties as well as semantic relatedness and corpus frequency information. Although the resulting wordnets are imperfect in terms of their quality and coverage of language-specific phenomena, we show that they constitute a cheap and suitable alternative for many applications, both for monolingual tasks as well as for cross-lingual interoperability. Apart from analysing the resources directly, we conducted tests on semantic relatedness assessment and cross-lingual text classification with very promising results.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700