Contemporaneous text as side-information in statistical language modeling

详细信息

作者：Khudanpur ; Sanjeev ; Kim ; Woosung
关键词：Multi-lingual processing ; Statistical language modeling ; Automatic speech recognition ; Resource-deficient languages ; Lexical triggers ; Maximum entropy
刊名：Computer Speech & Language
年：2004
期：2
来源：Elsevier
类型：期刊

摘要

We propose new methods to exploit contemporaneous text, such as on-line news articles, to improve language models for automatic speech recognition and other natural language processing applications. In particular, we investigate the use of text from a resource-rich language to sharpen language models for processing a news story or article in a language with scarce linguistic resources. We demonstrate that even with fairly crude cross-language information retrieval and simple machine translation, one can construct story-specific Chinese language models which exploit cues from a side-corpus of English newswire to significantly improve the performance of language models estimated from a static Chinese corpus. Our investigations cover cases when the amount of available Chinese text is small, and a case when a large Chinese text corpus is available. We examine the effectiveness of our techniques both when the side-corpus contains English documents that are near-translations of the Chinese documents being processed, and when the English side-corpus is merely from contemporaneous and independent news sources. We present experimental results for automatic transcription of speech from the Mandarin Broadcast News corpus.