Exploiting a Web-Based Encyclopedia as a Knowledge Base for the Extraction of Multilingual Terminology

详细信息查看全文

作者：Fatiha Sadat (1) sadat.fatiha@uqam.ca
关键词：terminology – ; comparable corpora – ; translation – ; Cross ; Language Information Retrieval – ; linguistics ; based information
刊名：Lecture Notes in Computer Science
出版年：2012
出版时间：2012
年：2012
卷：7614
期：1
页码：88-96
全文大小：375.1 KB
参考文献：1. Adafre, S.F., De Rijke, M.: Finding similar sentences across multiple languages in Wikipedia. In: Proceedings of the EACL Workshop on NEW TEXT Wikis and Blogs and Other Dynamic Text Sources (2006)
2. Adar, E., Skinner, M., Weld, D.S.: Information arbitrage across multi-lingual Wikipedia. In: Proceedings of the Second ACM International Conference on Web Search and Data Mining, Barcelona, Spain, February 09-12 (2009)
3. Dejean, H., Gaussier, E., Sadat, F.: An Approach based on Multilingual Thesauri and Model Combination for Bilingual Lexicon Extraction. In: Proceedings of COLING 2002, Taiwan (2002)
4. Dunning, T.: Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19(1), 61–74 (1993)
5. Erdmann, M., Nakayama, K., Hara, T., Nishio, S.: An Approach for Extracting Bilingual Terminology from Wikipedia. In: Haritsa, J.R., Kotagiri, R., Pudi, V. (eds.) DASFAA 2008. LNCS, vol. 4947, pp. 380–392. Springer, Heidelberg (2008a)
6. Erdmann, M., Nakayama, K., Hara, T., Nishio, S.: Extraction of bilingual terminology from a multilingual Web-based encyclopedia. J. Inform. Process. (2008b)
7. Erdmann, M., Nakayama, K., Hara, T., Nishio, S.: Improving the extraction of bilingual terminology from Wikipedia. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP) 5(4) (October 2009)
8. Fung, P.: A Statistical View of Bilingual Lexicon Extraction: From Parallel Corpora to Non-Parallel Corpora. In: V茅ronis, J. (ed.) Parallel Text Processing (2000)
9. Gœuriot, L., Daille, B., Morin, E.: Compilation of specialized comparable corpus in French and Japanese. Proceedings. In: ACL-IJCNLP Workshop “Building and Using Comparable Corpora” (BUCC 2009), Singapore (August 2009)
10. Gœuriot, L., Morin, E., Daille, B.: Reconnaissance de crit猫res de comparabilit茅 dans un corpus multilingue sp茅cialis茅. Actes. In: Sixi猫me 茅dition de la Conf茅rence en Recherche d’Information et Applications, CORIA 2009 (2009)
11. Kun, Y., Tsujii, J.: Bilingual Dictionary Extraction from Wikipedia (2009a). In: Proceedings of MT Summit XII Proceedings 2009 (2009)
12. Kun, Y., Junichi, T.: Extracting Bilingual Dictionary from Comparable Corpora with Dependency Heterogeneity. In: Proceedings of NAACL HLT 2009: Short Papers, Boulder, Colorado, pp. 121–124 (June 2009b)
13. Mohammadi, M., QasemAgharee, N.: In: Proceedings of NIPS Workshop, Grammar Induction, Representation of Language and Language Learning, Whistler, Canada (December 2009)
14. Morin, E., Daille, B.: Extraction de terminologies bilingues 脿 partir de corpus comparables d’un domaine sp茅cialis茅. Traitement Automatique des Langues (TAL), Lavoisier 45(3), 103–122 (2004)
15. Morin, E., Daille, B.: Comparabilit茅 de corpus et fouille terminologique multilingue. Traitement Automatique des Langues (TAL) 47(1), 113–136 (2006)
16. Nakagawa, H.: Disambiguation of Lexical Translations Based on Bilingual Comparable Corpora. In: Proceedings of LREC 2000, Workshop of Terminology Resources and Computation, WTRC 2000, pp. 33–38 (2000)
17. Peters, C., Picchi, E.: Capturing the Comparable: A System for Querying Comparable Text Corpora. In: Proceedings of the Third International Conference on Statistical Analysis of Textual Data, pp. 255–262 (1995)
18. Rapp, R.: Automatic Identification of Word Translations from Unrelated English and German Corpora. In: Proceedings of European Chapter of the Association for Computational Linguistics, EACL (1999)
19. Sadat, F., Yoshikawa, M., Uemura, S.: Learning bilingual translations from comparable corpora to cross-language information retrieval: hybrid statistics-based and linguistics-based approach. In: Proceedings of EACL 2003, Workshop on Information Retrieval with Asian Languages, Sapporo, Japan, vol. 11, pp. 57–64 (2003)
20. Sadat, F.: Knowledge Acquisition from Collections of News Articles to Cross-language Information Retrieval. In: Proceedings of RIAO 2004 Conference, Avignon, France, pp. 504–513 (2004)
21. V茅ronis, J.: Parallel Text Processing: Alignment and Use of Translation Corpora. Kluwer Academic Publishers Ed., Dordrecht (2000)
22. Voss, J.: Measuring Wikipedia. In: Proceedings of 10th International Conference of the International Society for Scientometrics and Informetrics (2005)
作者单位：1. Universit茅 du Quebec 脿 Montr茅al, 201 av. President Kennedy, Montr茅al, QC H3X 2Y3, Canada
ISSN：1611-3349

文摘

Multilingual linguistic resources are usually constructed from parallel corpora, but since these corpora are available only for selected text domains and language pairs, the potential of other resources is being explored as well. This article seeks to explore and to exploit the idea of using multilingual web-based encyclopaedias such as Wikipedia as comparable corpora for bilingual terminology extraction. We propose an approach to extract terms and their translations from different types of Wikipedia link information and data. The next step will be using linguistic-based information to re-rank and filter the extracted term candidates in the target language. Preliminary evaluations using the combined statistics-based and linguistic-based approaches were applied on different pairs of languages including Japanese, French and English. These evaluations showed a real open improvement and a good quality of the extracted term candidates for building or enriching multilingual anthologies, dictionaries or feeding a cross-language information retrieval system with the related expansion terms of the source query.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700