Post OCR Correction of Swedish Patent Text
详细信息    查看全文
  • 作者:Linda Andersson (17)
    Helena Rastas (18)
    Andreas Rauber (17)
  • 关键词:Optical character recognition OCR ; error correction algorithm ; manual error correction
  • 刊名:Lecture Notes in Computer Science
  • 出版年:2014
  • 出版时间:2014
  • 年:2014
  • 卷:8849
  • 期:1
  • 页码:1-9
  • 全文大小:202 KB
  • 参考文献:1. van Dulken, S.: Free patent databases on the Internet: A critical view. WPI聽21(4), 253鈥?57 (1999)
    2. Nagy, G.: Twenty Years of Document Image Analysis in PAMI. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38鈥?2 (January 2000)
    3. Rice, S., Nartker, G.: The ISRI Analytic Tools for OCR Evaluation, UNLV/Information Science Research Institute, TR-96-02 (August 1996)
    4. Baird, H.S.: Difficult and Urgent Open Problems in Document Image Analysis for Libraries. In: 6th International Workshop on Document Image Analysis for Libraries, Palo Alto, pp. 25鈥?2 (2004)
    5. Vinciarelli, A.: Noisy Text Categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence聽27(12), 1882鈥?895 (2005) CrossRef
    6. Mittendorf, E., Sch盲uble, P.: Measuring the effects of data corruption on information retrieval. In: 5th Annual Symposium on Document Analysis and Information Retrieval (SDAIR) (1996)
    7. Atkinson, K.H.: Toward a more rational patent search paradigm. In: 1st ACM Workshop on Patent Information Retrieval, California, USA, pp. 37鈥?0 (2008)
    8. Nylander, S.: Statistics and Graphotactical Rules in Finding. Uppsala University, Dep. of Linguistic (2000)
    9. Lin, X.: Quality Assurance in High Volume Document Digitization: A Survey. In: 2nd IEEE International Conference on Document Image Analysis for Libraries, France, pp. 76鈥?2 (2006)
    10. Zhuang, L., Zhu, X.: An OCR Post-processing Approach Based on Multi-knowledge. In: Khosla, R., Howlett, R.J., Jain, L.C. (eds.) KES 2005. LNCS (LNAI), vol.聽3681, pp. 346鈥?52. Springer, Heidelberg (2005) CrossRef
    11. Feng, S., Manmatha, R.: A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books. In: 6th Joint Conference on Digital Libraries, pp. 109鈥?18. ACM Press, New York (2006)
    12. Boguraev, B.K., Byrd, R.J., Cheng, K.-S.F., Coden, A.R., Tanenblatt, M.A., Wilfried, T.: System and Method for Identifying Document Structure and Assosciated Metainformation and Faclilitating appropriate processing, US 2009/0276378 A1 (November 5, 2009)
    13. Teleman, U., Hellberg, S., Andersson, E., Christensen, L.: Svenska Akademiens Grammatik (The grammar of the Swedish Academy), 4 vols. Svenska Akademien, Stockholm (1999)
    14. Karlgren, J.: Occurrence of compound terms and their constituent elements in Swedish. In: 15th Nordic Conference on Computational Linguistics, Joensuu, Findland (2005)
    15. Hedlund, T.A., Pirkola, A., J盲rvelin, K.: Aspects of Swedish morphology and Semantics from the perspective of Mono- and Cross-language Information Retrieval. Information Processing and Management聽37(1), 147鈥?61 (2001) CrossRef
  • 作者单位:Linda Andersson (17)
    Helena Rastas (18)
    Andreas Rauber (17)

    17. Vienna University of Technology, Austria
    18. Uppdragshuset AB, Sweden
  • ISSN:1611-3349
文摘
The purpose of this paper is to compare two basic post-processing algorithms for correction of optical character recognition (OCR) errors in Swedish text. One is based on language knowledge and manual correction (lexical filter); the other is based on a generic algorithm using limited language knowledge in order to perform corrections (generic filter). The different methods aim to improve the quality of OCR generated Swedish patent text. Tests are conducted on 7,721 randomly selected patent claims generated by different OCR software tools. The OCR generated and automatically corrected (by the lexical or generic filter) texts are compared with manually corrected ground truth. The preliminary results indicate that the OCR tools are biased to different characters when generating text and the language knowledge of post correction influences the final results.
NGLC 2004-2010.National Geological Library of China All Rights Reserved.
Add:29 Xueyuan Rd,Haidian District,Beijing,PRC. Mail Add: 8324 mailbox 100083
For exchange or info please contact us via email.