Hybrid method for modeless Japanese input using N-gram based binary classification and dictionary
详细信息    查看全文
  • 作者:Yukino Ikegami ; Setsuo Tsuruta
  • 关键词:Multilingual documents ; Modeless Japanese input
  • 刊名:Multimedia Tools and Applications
  • 出版年:2015
  • 出版时间:June 2015
  • 年:2015
  • 卷:74
  • 期:11
  • 页码:3933-3946
  • 全文大小:333 KB
  • 参考文献:1.Beesley KR (1988) Language identifier: a computer program for automatic natural-language identification of on-line text. In: Proceedings of the 29th ATA annual conference. pp 47鈥?4
    2.Bellandi V, Ceravolo P, Damiani E, Frati F, Maggesi J (2012) Towards a Collaborative Innovation Catalyst. In: Proceedings of SITIS 2012. IEEE Computer Society, pp 637鈥?43
    3.Cavnar WB, Trenkle JM (1994) N-gram-based text categorization. In: Proceedings of SDAIR鈥?4. pp 161鈥?75
    4.Chang CC, Lin CJ (2001) LIBSVM: a library for support vector machines. Software available at http://鈥媤ww.鈥媍sie.鈥媙tu.鈥媏du.鈥媡w/鈥?7Ecjlin/鈥媗ibsvm
    5.Chen Z, Lee K (2000) A new statistical approach to Chinese Pinyin input. In: Proceedings of the 38th annual meeting on association for computational linguistics. pp 241鈥?47
    6.Damiani E, di Vimercati SDC, Paraboschi S, Samarati P (2004) An open digest-based technique for spam detection. In: ISCA PDCS 2004. pp 559鈥?64
    7.Davies M (2009) The 385+ million word corpus of contemporary american english (19902008+): design, architecture, and linguistic insights. Int J Corpus Linguis 14(2):159鈥?90View Article
    8.Dumais S (1998) Using SVMs for text categorization. IEEE Intell Syst 13(4):21鈥?3
    9.Ehara Y, Tanaka-Ishii K (2008) Multilingual text entry using automatic language detection. In: Proceedings of international joint conference on natural language processing. pp 441鈥?48
    10.Fan RE, Chang KW, Hsieh C-J, Wang X-R, Lin C-J (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871鈥?874MATH
    11.Grudin JT (1983) Error patterns in novice and skilled transcription typing. In: Cognitive aspects of skilled typewriting. Springer, Verlag, pp 121鈥?43
    12.Hakkani-T鈥櫶乽r DZ, Oflazer K, T鈥櫶乽r G (2002) Statistical morphological disambiguation for agglutinative languages. Comput Humanit 36(4):381鈥?10View Article
    13.Ikegami Y, Sakurai Y, Tsuruta S (2012) Modeless Japanese input method using multiple character sequence features. In: Proceedings of eighth international conference on signal image technology and internet based systems. IEEE Computer Society, pp 613鈥?18
    14.Internet.com K.K. (Japan) (2009) Roma to Kana input users are 90 %, direct Kana input users are 10 % - survey about typing - (in Japanese), http://鈥媕apan.鈥媔nternet.鈥媍om/鈥媟esearch/鈥?0090611/鈥?.鈥媓tml . Accessed 3 July 2013
    15.Japanese Ministry of Internal Affairs and Communications (2009) Utilization situation of Internet (in Japanese). http://鈥媤ww.鈥媠oumu.鈥媑o.鈥媕p/鈥媕ohotsusintokei/鈥媤hitepaper/鈥媕a/鈥媓24/鈥媓tml/鈥媙c.鈥?43120.鈥媓tml . Accessed 10 October 2013
    16.Joachims T (2006) Training linear SVMs in linear time. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 217鈥?26
    17.Kasahara S, Komachi M, Nagata M, Matsumoto Y (2011) Error correcting Romaji-kana conversion for Japanese language education. In: Proceedings of the workshop on advances in text input methods. pp 38鈥?2
    18.Kerkhofs R, Dijkstra T, Chwilla DJ, de Bruijn ER (2006) Testing a model for bilingual semantic priming with interlingual homographs: RT and N400 effects. In: Brain research, vol 1068. Elsevier, pp 170鈥?13
    19.Kudo T, Yamamoto K, Matsumoto Y (2004) Applying conditional random fields to Japanese morphologiaical analysis. In: Proceedings of the EMNLP-2004. pp 230鈥?37
    20.Maekawa K (2008) Balanced corpus of contemporary written Japanese. In: Proceedings of the 6th workshop on asian language resources. pp 101鈥?02
    21.Neubig G, Duh K (2013) How much is said in a tweet? A multilingual, in-formation-theoretic perspective. In: Proceedings of the AAAI鈥?3 spring symposium on analyzing microtext. Stanford
    22.Neubig G, Nakata Y, Mori S (2011) Pointwise prediction for robust, adaptable Japanese morphological analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 2. pp 529鈥?33
    23.Pouliquen B, Steinberger R, Ignat C (2006) Automatic annotation of multilingual text collections with a conceptual thesaurus. arXiv:preprint cs/0609059
    24.Roeber H, Bacus J, Tomasi C (2003) Typing in thin air: the canesta projection keyboard - a new method of interaction with electronic devices. In: Proceedings of CHI extended abstracts. pp 712鈥?13
    25.Shalev-Shwartz S, Singer Y, Srebro N (2007) Pegasos Primal estimated sub-gradient solver for SVM. In: Proceedings of the 24th international conference on machine learning. ACM, pp 807鈥?14
    26.Suzumegano F, Amano J, Maruyama Y, Hayakawa E, Namiki M, Takahashi N (1995) The evaluation environment for a Kana to Kanji transliteration system and an evaluation of the modeless input method. In: IPSJ SIG technical report, vol 1995-HI-42. pp 9鈥?6
    27.Teahan WJ (2000) Text classification and segmentation using minimum cross-entropy. In: Proceedings of RIAO鈥?0. pp 943鈥?61
    28.Zheng Y, Liu C, Ding X (2001) Single-character type identification. In: Electronic imaging 2002, International society for optics and photonics. pp 49鈥?6
  • 作者单位:Yukino Ikegami (1)
    Setsuo Tsuruta (1)

    1. Tokyo Denki University, 2-1200, MuzaiGakuendai, Inzai-shi, Chiba, Japan
  • 刊物类别:Computer Science
  • 刊物主题:Multimedia Information Systems
    Computer Communication Networks
    Data Structures, Cryptology and Information Theory
    Special Purpose and Application-Based Systems
  • 出版者:Springer Netherlands
  • ISSN:1573-7721
文摘
The rapid growth of globalization requires handling a large number of multilingual documents, where Japanese input co-exist with English and other languages, which use the Roman alphabet. Conventional methods for Japanese input require Japanese users to switch the input mode between Japanese and the Latin alphabet. As current solution, there is a modeless Japanese input method that automatically switches the input mode. However, those need training with a large amount of text data for improving the performance. This paper proposes a hybrid modeless Japanese input method that is based on the non-Japanese word dictionary and n-gram character sequence features to decide whether to convert and switch to Kana input or not. The aim of using the non-Japanese word dictionary is decreasing false positive against non-Japanese language words. This dictionary is composed by text data available on the Web. The n-gram based discriminative model are learned by a Support Vector Machine from a balanced corpus, which contains various domain texts. The evaluation of our method has shown that its statistical accuracy according to F-measure for prediction of non-Kana characters improves 7.7 % compared to n-gram only based method. In addition, the real user test has shown the average value of inputted time was agreeside for our method, against disagree side for conventional Japanese input method that requires switching input mode.
NGLC 2004-2010.National Geological Library of China All Rights Reserved.
Add:29 Xueyuan Rd,Haidian District,Beijing,PRC. Mail Add: 8324 mailbox 100083
For exchange or info please contact us via email.