Studying machine translation technologies for large-data CLIR tasks: a patent prior-art search case study
详细信息    查看全文
  • 作者:Walid Magdy (1)
    Gareth J. F. Jones (2)
  • 关键词:Cross ; language patent retrieval ; Prior ; art Patent search ; Cross ; language information retrieval ; Large ; data CLIR ; Machine translation
  • 刊名:Information Retrieval
  • 出版年:2014
  • 出版时间:October 2014
  • 年:2014
  • 卷:17
  • 期:5-6
  • 页码:492-519
  • 全文大小:3,636 KB
  • 参考文献:1. Azzopardi, L., Joho, H., & Vanderbauwhede, W. (2010). A survey on patent users search behavior, search functionality and system requirements. / IRF Report, / 1, 2010.
    2. Chen, A., & Gey, F. (2004). Combining Query Translation and Document Translation in Cross-Language Retrieval. / Proceedings of CLEF-2003.
    3. Darwish, K., & Oard, D. W. (2003). Probabilistic structured query methods. / Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval SIGIR鈥?3, Toronto, Canada.
    4. Franz, M., & McCarley, S. (2002). Arabic information retrieval at IBM. / Proceedings of TREC-2002.
    5. Fujii, A. (2007). Enhancing patent retrieval by citation analysis. / Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval SIGIR鈥?7, Amsterdam, The Netherlands.
    6. Gao, J., Nie, J-Y., Xun, E., Zhang, J., Zhou, M., & Huang, C. (2001). Improving query translation for cross-language information retrieval using statistical models. / Proceedings of the 24th annual international ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR 2001). Louisiana, USA.
    7. Hull, D. (1993). Using statistical testing in the evaluation of retrieval Experiments. / Proceedings of the 16th annual international ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR鈥?93), Pittsburgh, Pennsylvania, USA.
    8. Iwayama, M., Fujii, A., Kando, N., & Takano, A. (2003). Overview of patent retrieval task at NTCIR-3. / Proceedings of the 3rd NTCIR Workshop.
    9. Jochim, C., Lioma, C., Sch眉tze, H., Koch, S., & Ertl, T. (2010). Preliminary study into query translation for patent retrieval. / Proceedings of the 3rd international workshop on Patent information retrieval (PaIR 鈥?0), Toronto, Canada.
    10. Jones, G. J. F., Sakai, T., Collier, N. H., Kumano, A., & Sumita, K. (1999). A comparison of query translation methods for English-Japanese cross-language information retrieval. / Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 99), San Francisco, U.S.A.
    11. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., & Herbst, E. (2007). Moses: Open source toolkit for statistical machine translation. / Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), demonstration session, Prague, Czech Republic.
    12. Krier, M., & Zacca, F. (2002). Automatic categorization applications at the European patent office. / World Patent Information, / 24(3), 187鈥?96. f="http://dx.doi.org/10.1016/S0172-2190(02)00026-1" target="_blank" title="It opens in new window">CrossRef
    13. Leong, M.K. (2001). Patent data for IR research and evaluation. / Proceedings of the 2nd NTCIR Workshop.
    14. Leveling, J., Magdy, W., & Jones, G. J. F. (2011). An investigation of decompounding for cross-language patent search. / Proceedings of the 34th annual international SIGIR conference on Research and Development in Information Retrieval (SIGIR鈥?1). Beijing, China.
    15. Levow, G.-A., Oard, D. W., & Resnik, P. (2005). Dictionary-based techniques for cross-language information retrieval. / Information Processing and Management, / 41(3), 523鈥?47. f="http://dx.doi.org/10.1016/j.ipm.2004.06.012" target="_blank" title="It opens in new window">CrossRef
    16. Lopez, P., & Romary, L. (2010). Experiments with citation mining and key-term extraction for prior art search. / Proceedings of the CLEF-2010.
    17. Lupu, M., & Hanbury, A. (2013). Patent retrieval. / Foundations and Trends庐 in Information Retrieval, / 7(1), 1鈥?7. f="http://dx.doi.org/10.1561/1500000027" target="_blank" title="It opens in new window">CrossRef
    18. Ma, Y., Nie, J., Wu, H., & Wang, H. (2012). Opening Machine Translation Black Box for Cross-Language Information Retrieval. / Information Retrieval Technology. Lecture Notes in Computer Science, / 7675, 467鈥?76.
    19. Magdy W., & Jones, G. J. F. (2011). Should MT systems be used as black boxes in CLIR?. / Proceeding of the 33rd European Conference on Information Retrieval (ECIR鈥?1). Dublin, Ireland.
    20. Magdy, W. (2012). Toward higher effectiveness for recall-oriented information retrieval: A patent retrieval case study. / PhD Thesis, Dublin City University.
    21. Magdy, W., & Jones., G. J. F. (2010). PRES: A score metric for evaluating recall-oriented information retrieval applications. / Proceedings of the 33rd annual international SIGIR conference on Research and Development in Information Retrieval (SIGIR鈥?0). Geneva, Switzerland.
    22. Magdy, W., & Jones, G. J. F. (2010). Examining the robustness of evaluation metrics for patent retrieval with incomplete relevance judgements. / Iroceedings of the CLEF 2010: Conference on Cross-Language Information Retrieval and Evaluation, Padua, Italy.
    23. Magdy, W., & Jones, G. J. F. (2010). Applying the KISS principle for the CLEF-IP 2010 prior art candidate patent search task. / Proceedings of CLEF-2010.
    24. Magdy, W., & Jones, G.J.F. (2011). A Study of Query Expansion Methods for Patent Retrieval / . Proceedings of PaIR worjshop 2011, Glasgow, Scotland.
    25. Magdy, W., & Jones, G. J. F. (2011). An efficient method for using machine translation technologies in cross-language patent search. / Proceedings of the 20th ACM international conference on Information and Knowledge Management (CIKM鈥?1). Glasgow, Scotland.
    26. Manning, C. D., Raghavan, P., & Sch眉tze, H. (2009). / Introduction to information retrieval. Cambridge: Cambridge University Press.
    27. Nie J.-Y. (2010). Cross-Language Information Retrieval. Morgan & Claypool Publishers.
    28. Oard, D. W. (1998). A comparative study of query and document translation for cross-language information retrieval. / Proceedings of the 3rd conference of the association for machine translation in the Americas on MT and the information soup AMTA.
    29. Oard, D. W., & Diekema, A. R. (1998). Cross-language information retrieval. In M. Williams (Ed.), / Annual review of information science ARIST, pp. 223鈥?56.
    30. Oard, D. W., & Gey, F. (2002). The TREC-2002 Arabic/English CLIR track. / Proceedings of TREC-2002.
    31. Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. / Computational Linguistics, / 19(1), 19鈥?1. f="http://dx.doi.org/10.1162/089120103321337421" target="_blank" title="It opens in new window">CrossRef
    32. Papineni, K., Roukos, S., Ward, T., & Zhu,W.-J. (2001). BLEU: A method for automatic evaluation of machine translation. / Technical Report RC22176(W0109-022), IBM Research Report.
    33. Parton, K., McKeown, K. R., Allan, J., & Henestroza, E. (2008). Simultaneous multilingual search for translingual information retrieval. / Proceedings of ACM 17th Conference on Information and Knowledge Management (CIKM鈥?8), California, US.
    34. Piroi, F. (2010). CLEF-IP 2010: Retrieval experiments in the intellectual property domain. / Proceedings of CLEF-2010.
    35. Piroi, F., Lupu, M., Hanbury, A., Magdy, W., Sexton, A. P., & Filippov, I. (2012). CLEF-IP 2012: Retrieval experiments in the intellectual property domain. / Proceedings of CLEF-2012.
    36. Roda, G., Tait, J., Piroi, F., & Zenz, V. (2009). CLEF-IP 2009: Retrieval experiments in the intellectual property domain. / Proceedings of CLEF-2009.
    37. Strohman, T., Metzler, D., Turtle, H., & Croft, W. B. (2004). Indri: A language model-based search engine for complex queries. / Proceedings of the International Conference on Intelligence Analysis.
    38. Stroppa, N., & Way, A. (2006). MaTrEx: DCU machine translation system for IWSLT 2006. / Proceedings of the International Workshop on Spoken Language Translation, Kyoto, Japan.
    39. Teodoro, D., Gobeill, J., Pasche, E., Vishnyakova, D., Ruch, P., & Lovis, C. (2010). Automatic prior art searching and patent encoding at CLEF-IP鈥?0. / Proceedings of CLEF-2010.
    40. Ture, F., Lin, J., & Oard, D.W. (2012). Looking inside the box: Context-sensitive translation for cross-language information retrieval. / Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval (SIGIR鈥?2). New York, NY, USA.
    41. Verberne, S., D鈥檋ondt, E., & Oostdijk, N. (2010). Quantifying the challenges in parsing patent claims. / Proceedings of the 1st International Workshop on Advances in Patent Information Retrieval AsPIRe鈥?0.
    42. Wang, W., Knight, K., & Marcu, D. (2006). Capitalizing machine translation. / Proceedings of the Joint Conference on Human Language Technologies and the Annual Meeting of the North American Chapter of the Association of Computational Linguistics (HLT-NAACL), New York, USA.
    43. Wang, J., & Oard, D. W. (2006). Combining bidirectional translation and synonymy for cross-language informzation retrieval. / Proceedings of the 29th annual international ACM SIGIR conference on Research and Development in Information Retrieval, Seattle, Washington, USA.
  • 作者单位:Walid Magdy (1)
    Gareth J. F. Jones (2)

    1. Qatar Computing Research Institute, Qatar Foundation, Doha, Qatar
    2. Centre of Next Generation Localization, School of Computing, Dublin City University, Dublin 9, Ireland
  • ISSN:1573-7659
文摘
Prior-art search in patent retrieval is concerned with finding all existing patents relevant to a patent application. Since patents often appear in different languages, cross-language information retrieval (CLIR) is an essential component of effective patent search. In recent years machine translation (MT) has become the dominant approach to translation in CLIR. Standard MT systems focus on generating proper translations that are morphologically and syntactically correct. Development of effective MT systems of this type requires large training resources and high computational power for training and translation. This is an important issue for patent CLIR where queries are typically very long sometimes taking the form of a full patent application, meaning that query translation using MT systems can be very slow. However, in contrast to MT, the focus for information retrieval (IR) is on the conceptual meaning of the search words regardless of their surface form, or the linguistic structure of the output. Thus much of the complexity of MT is not required for effective CLIR. We present an adapted MT technique specifically designed for CLIR. In this method IR text pre-processing in the form of stop word removal and stemming are applied to the MT training corpus prior to the training phase. Applying this step leads to a significant decrease in the MT computational and training resources requirements. Experimental application of the new approach to the cross language patent retrieval task from CLEF-IP 2010 shows that the new technique to be up to 23 times faster than standard MT for query translations, while maintaining IR effectiveness statistically indistinguishable from standard MT when large training resources are used. Furthermore the new method is significantly better than standard MT when only limited translation training resources are available, which can be a significant issue for translation in specialized domains. The new MT technique also enables patent document translation in a practical amount of time with a resulting significant improvement in the retrieval effectiveness.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700