详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
In the recent decade, Patent Mining has experienced a prominent flourish. In the past, much of the focus for patent search and retrieval has been from the database community, but in recent years, it has been from Natual Language Processing (NLP) technology and Information Retrieval (R) community. The improvement of Patent Mining can be attributed to the two factors:the boom of statistical machine learning approaches provided new methodology for solving Patent Mining and Natual Language Processing tasks; the improvement of Natual Language Processing and Information Retrieval technology. The platform of International Patent Evaluation and workshop provides a forum in which researchers and practitioners from relevant communities can share their ideas, approaches, perspectives, and experiences from their work in progress.
     In this paper, we research the content characteristic of the patent text and data statistic based on different patent corpus. Then we analyse the difficult problem of Patent Mining task. Based on the Natual langugage processing Patent mining task has several questions:(1) Scalar of patent corpus is huge, there are almost several million patent samples; (2) Content of Patent text refers to all technology domains. The phenomenon of cross-cutting issue between domains is common, which is adverse to text classification; (3) The data distribute of the patent text on International Patent Classification (IPC) classification system is imbalance and train data in main class is insufficiency; (4) The classification system of patent is different from that of the traditional text classification, especially IPC system has large scale number of classes which is Hierarchy; (5) Patent text has multi-classclassification tag.
     This dissertation focuses on how to resolve the main problem of Patent Mining task and research technology to improve the performance of patent mining system. We propose some models and methods for patent mining task based on the previous works. We focus on the following issue:
     (1) Using the frame of text classification based on NLP technology to process the Patent Mining task.
     (2) Using inverted indexing to improve the speed of text classification, which is common technology Information Retrieval community.
     (3) Propose class clustering method to improve data imbalance problem.
     (4) Using several similarity calculation methods for Patent Mining task.
     (5) Propose several Ranking methods for class decision-making process, especially, the method based on log-linear and the system combine method based on Rank-SVM model.
     In this paper, all the research work bases on Patent Mining Evaluation task of NTCIR-7, and build the creditable system for patent mining task used U.S. patent and the English translation of the Japanese patent data.
    3. Martin Meyer. Does science push technology? Patents citing scientific literature. Research Policy,29,2000,409-434.
    5. NTCIR-7. http://www.nlp.its.hiroshima-cu.ac.jp/-nanba/ntcir-7/cfp-en.html.
    7. International Patent Classifiaction:Guide, Survey of Classes and Summary of Main Groups [J],7th edtion, Volume 9, World Intellectual Property Organization, Geneva,1999.
    8. Larkey, L. S.Some Issues in the Automatic Classification of U.S. Patents [J]. Working Notes for the AAAI-98 workshop on learning for Text Categorization.1998.
    9. Hisao Mase, Makoto Iwayama, NTCIR-6 Patent Retrieval Experiments at Hitachi [J]. In Proceedings of NTCIR-6 Workshop Meeting, Tokyo, Japan, May 15-18,2007,403-406.
    10. Hironori Takeuchi, Naohiko Urmoto, Koichi Takeda, Experiments on Patent Retrieval at NTCIR-4 Workshop [J]. In Proceedings of NTCIR-4, Tokyo, April 2003, June 2004.
    11. Yaoyong Li, Kalina Bontcheva and Hamish Cunningham. SVM Based Learning Systerm for F-term Patent Classification [J]. Proceedings of NTCIR-6 Workshop Meeting, Tokyo, Japan, May 15-18,2007.
    12. Yaoyong Li, Kalina Bontcheva and Hamish Cunningham. Cost Sensitive Evaluation Measures for F-term Patent Classifiation [J]. The first Internatinal Workshop on Evaluating Information Access(EVIA), Tokyo, Japan, May 15,2007.
    13. Larkey, Leah S. Automatic Essay Grading Using Text Categorization Techniques [J]. In Proceedings of the 21st Annual International Conference on Research and Development in Information Retrieval (SIGIR 98), Melbourne, Australia,1998,90-95.
    14. C. J. Fall, A. Torcsvari, K. Benzineb, G. Karetka. Automated Categorization in the International Patent Classification [J], SIGIR Forum 37 (1),2003.
    15. M. Murata, K. Uchimoto, H. Ozaku, Q. Ma, M. Utiyama, and H. Isahara. Japanese probabilistic information retrieval using location and category information. The Fifth International Workshop on Information Retrieval with Asian Languages,2000,81-88.
    16. Hironori Doi, Yohei Seki, Masaki Aono, A Patent Retrieval Method Using a Hierarchy of Clusters at TUT. In Proceedings of NTCIR-5 Workshop Meeting, Tokyo, Japan, May 15-18, 2005,403-406.
    19. C.J. van Rijsbergen, S.E. Robertson and M.F. Porter,1980. New models in probabilistic information retrieval. London:British Library. (British Library Research and Development Report, no.5587).
    20. David D. Lewis and Philip J. Hayes. Guest editors'introduction to the special issue on text categorization [J]. ACM Transactions on Information Systems,1994,12(3):231.
    23. T. M. Mitchell. Machine Learning [M]. The McGraw-Hill Companies, Inc,1997.
    24. Jingbo Zhu, Huizhen Wang and Xijuan Zhang, Discrimination-based Feature Selection for Multinomial Naive Bayes Text Classification [J].21st International Conference on Computer Processing of Oriental Languages (ICCPOL2006), LNAI/CS, December 17-19,2006, Singapore.
    25. A. Berger, S. Della Pietra, and V. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics,22(1),1996.
    26. Muhua Zhu, Jingbo Zhu, Wenliang Chen. Effect analysis of dimension reduction on support vector machines [J]. IEEE International Conference on Natural Language Processing and Knowledge Engineering.2005.10
    27. Zhenxing Wang, Jingbo Zhu. Improving K-NN Text Categorization by Bootstrap Technique. International Conference on Chinese Computing 2007. Wuhan, China. Oct.12-15,2007,493-499.
    28. Chen Wenliang, Chang Xingzhi, Wang Huizhen, Zhu Jingbo, Yao Tianshun. Automatic Word Clustering for Text Categorization Using Global Information. S. H. Myaeng et al. (Eds):AIRS 2004, LNCS 3411,,2005,1-11.
    29.陈文亮,朱靖波, 朱慕华, 姚天顺.基于领域词典的文本特征表示[J],计算机研究与发展,2005,42(12):2155-2160.
    31. Ricardo Baeza Yates, Berthier Ribeiro Neto. Modern Information Retrieval [M]. Pearson Education Press 1999.
    32. Porter algorithm. http://tartarus.org/-martin/PorterStemmer/.
    33. J.M. Pena, J.A. Lozano, P. Larranaga. An empirical comparison of four initialization methods for the K-Means algorithm. Pattern Recognition Letters [J].1999,20(10): 1027-1040.
    34. Masaki Murata, Toshiyuki Kanamaru, Tamotsu Shirado, Hitoshi Isahara. Using the K Nearest Neighbor Method and BM25 in the Patent Document Categorization Subtask at NTCIR-5 [J]. Proceedings of NTCIR-5 Workshop Meeting, ToKyo, Japan December 6-9, 2005.
    35. Masaki Murata, Toshiyuki Kanamaru, Tamotsu Shirado, Hitoshi Isahara. Using the K Nearest Neighbor Method and SMART Weighting in the Patent Document Categorization Subtask at NTCIR-6 [J]. Proceedings of NTCIR-6 Workshop Meeting, May 15-18,2007, ToKyo, Japan.
    36. A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization [J]. In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR'96),1996,21-29.
    37. D.K.Harman, G. Candela. Retrieving Records from a Gigabyte of Text on a Minicomputer Using Statistical Ranking. Journal of the American Society for Information Science.1990, 41(8):581-589.
    38. Amit Singhal, Chris Buckley, and Mandar Mitra. Pivoted document Length normalization. In SIGIR'96:Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, p279-280,1999.
    39. Ronan Cummins, Colm O'Riordan, An Axiomatic Study of Learned Term-Weighting Schemes. In Proceedings of the 30th annual international ACM SIGIR workshop on Learning to Rank for Information Retrieval (SIGIR'07),2007.
    40. J. N. Darroch and D. Ratcliff. Generalized iterative scaling for log-linear models [J]. Annals of Mathematical Statistics,1972,1470-1480.
    41. Herbrich, R., Graepel, T.,& Obermayer, K. Large Margin Rank Boundaries for Ordinal Regression [J]. Advances in Large Margin Classifiers.2000,115-132.
    42. Yunbo Cao, Jun Xu, Tie-Yan Liu, Hang Li, YalouHuang, and Hsiao-Wuen Hon, Adapting Ranking SVM to Document Retrieval [J], Proc. of SIGIR 2006,186-193.
    43. S. E. Robertson and S. Walker. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,1994.
    44. Franz Josef Och, Hermann Ney. Discriminative Training and Maximum Entropy Models for Statistical Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002,295-302.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700