A compression algorithm using integrated record information for translation dictionaries
详细信息查看全文 | 推荐本文 |
摘要
A Trie structure is a well-known method for retrieving natural language (NL) dictionaries for morphological analysis, machine translation and so on. With the development of a variety of NL processing systems, some types of dictionaries in a computer hard disk have a lot of common information. This paper presents a method of merging individual dictionaries into the generalized dictionary. It enables us to reduce the total dictionary size and to expand the usage of individual dictionaries to that of the other applications. For key retrieval of the merged dictionary, there are many long strings such as compound words and idioms which take much space for a huge set of keys when stored in the Trie, so a fast trie structure, called a double-array structure is introduced and its compression scheme is proposed by replacing long strings into corresponding leaf node numbers of the Trie. Although the size of the presented records grows, the total number of them is extremely decreased by merging common information. The presented method is evaluated by the observation experimental results for nine dictionaries show that new method is more efficient than previous ones.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700