基于内容分析的专利挖掘技术研究

英文题名：Content Analysis Based Patent Mining Research
作者：曹菲菲
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：专利挖掘 ; 文本分类 ; 相似度计算 ; 决策技术
英文关键词：Patent Mining ; text classification ; similarity calculation ; Ranking
学位年度：2008
导师：朱靖波
学科代码：081202
学位授予单位：东北大学
论文提交日期：2008-06-01
答辩委员会主席：姚天顺

摘要

近十几年来,专利挖掘的研究越来越被重视。早先,专利研究主要基于在专利数据库,近几年,专利研究转向基于自然语言处理的技术或者信息检索的技术。推动专利挖掘技术发展的主要因素：一方面统计机器学习的方法不断的发展和改进,为解决专利挖掘以及自然语料处理提供了强大的方法论武器；另一方面,自然语言处理的技术以及信息检索的技术的进步,促进了专利文本挖掘的发展。同时,专利挖掘的评测举办,为专利挖掘提供了技术交流的平台,促进了专利挖掘研究的进步,并为专利文本处理提供了发展的方向。
     本文通过研究专利文本的特点,对不同的训练语料做数据统计,分析专利挖掘任务中的难点问题。基于自然语言处理的专利挖掘技术,遇到几大问题：(1)专利挖掘是一个大规模的文本分析任务；(2)专利文本内容涉及到技术发展的各个领域,领域之间交叉现象严重,不利于文本分类；(3)专利文本在各个领域上数量分布不均衡,大量的类别下训练数据不充分；(4)专利文本的分类体系与传统分类体系不同,尤其是国际专利分类标准,具有超大规模的类别空间,多层次等特点；(5)专利的国际分类都是多标签标记,因此专利分类是多标签的分类问题。上述几个主要问题,决定了专利文本处理与传统的文本处理的不同。
     本文围绕专利挖掘任务中的问题,从不同的方面研究提高专利挖掘系统的性能。作者在前人的工作基础上,综合了多个领域的技术,提出了一些专利挖掘的处理技术。文本解决专利挖掘问题的主要技术：
     (1)本文采用基于自然处理的分类系统的框架,处理专利挖掘的任务。
     (2)本文研究了在大规模的数据的分类问题,采用信息检索中常用的检索技术——倒排索引文档——应用到分类模型中,提高分类模型的计算速度。
     (3)本文提出了类别归并的方法解决数据分布不均衡的问题。在国际专利分类系统下,大量的类别中数据样本很少,采用多种归并的方法将小类别聚合成大类别,解决分布不均衡的问题。
     (4)专利挖掘任务中,文本之间的相似度计算的是重要的研究环节。本文采用了多种相似度计算方法,在数据非同源的任务中,BM25的计算方法性能较好,并比较稳定。
     (5)本文提出了多种类别排序的决策方法。分类器给定样本之间的相似度的方法,需要通过某种转化的机制,映射成类别标记的排序。文本提出了带用类别信息的相似度加和的方法以及基于Log-linear模型的线性加和方法,对类别进行Rank,实验结果显示带用类别信息的相似度加和的方法以及基于Log-linear模型的线性加和方法性能较好。
     本文基于NTCIRT-7的专利挖掘评测任务的平台,在美国专利以及日本专利的英文翻译的数据上,实现专利挖掘的分类系统,并针对专利挖掘的主要问题和核心技术做了大量实验,并做了详细的数据分析。最后确定解决专利挖掘任务的最可信的系统。
In the recent decade, Patent Mining has experienced a prominent flourish. In the past, much of the focus for patent search and retrieval has been from the database community, but in recent years, it has been from Natual Language Processing (NLP) technology and Information Retrieval (R) community. The improvement of Patent Mining can be attributed to the two factors:the boom of statistical machine learning approaches provided new methodology for solving Patent Mining and Natual Language Processing tasks; the improvement of Natual Language Processing and Information Retrieval technology. The platform of International Patent Evaluation and workshop provides a forum in which researchers and practitioners from relevant communities can share their ideas, approaches, perspectives, and experiences from their work in progress.
     In this paper, we research the content characteristic of the patent text and data statistic based on different patent corpus. Then we analyse the difficult problem of Patent Mining task. Based on the Natual langugage processing Patent mining task has several questions:(1) Scalar of patent corpus is huge, there are almost several million patent samples; (2) Content of Patent text refers to all technology domains. The phenomenon of cross-cutting issue between domains is common, which is adverse to text classification; (3) The data distribute of the patent text on International Patent Classification (IPC) classification system is imbalance and train data in main class is insufficiency; (4) The classification system of patent is different from that of the traditional text classification, especially IPC system has large scale number of classes which is Hierarchy; (5) Patent text has multi-classclassification tag.
     This dissertation focuses on how to resolve the main problem of Patent Mining task and research technology to improve the performance of patent mining system. We propose some models and methods for patent mining task based on the previous works. We focus on the following issue:
     (1) Using the frame of text classification based on NLP technology to process the Patent Mining task.
     (2) Using inverted indexing to improve the speed of text classification, which is common technology Information Retrieval community.
     (3) Propose class clustering method to improve data imbalance problem.
     (4) Using several similarity calculation methods for Patent Mining task.
     (5) Propose several Ranking methods for class decision-making process, especially, the method based on log-linear and the system combine method based on Rank-SVM model.
     In this paper, all the research work bases on Patent Mining Evaluation task of NTCIR-7, and build the creditable system for patent mining task used U.S. patent and the English translation of the Japanese patent data.

引文

1.龚荒,王元地.中国专利制度与经济增长关系的实证研究[M],科技管理研究,2008,179-181.
    2.刘华.专利制度与经济增长：理论与实现[J].中国软科学,2002(10)：26-30.
    3. Martin Meyer. Does science push technology? Patents citing scientific literature. Research Policy,29,2000,409-434.
    4.万小丽,朱雪忠.专利价值的评估指标体系及模糊综合评价[M].科技管理研究,p185-191,Vol.29,No.2,2008.
    5. NTCIR-7. http://www.nlp.its.hiroshima-cu.ac.jp/-nanba/ntcir-7/cfp-en.html.
    6.骆云中,陈蔚杰,徐晓琳.专利情报分析与利用[M].华东工业大学出版社,2007.
    7. International Patent Classifiaction:Guide, Survey of Classes and Summary of Main Groups [J],7th edtion, Volume 9, World Intellectual Property Organization, Geneva,1999.
    8. Larkey, L. S.Some Issues in the Automatic Classification of U.S. Patents [J]. Working Notes for the AAAI-98 workshop on learning for Text Categorization.1998.
    9. Hisao Mase, Makoto Iwayama, NTCIR-6 Patent Retrieval Experiments at Hitachi [J]. In Proceedings of NTCIR-6 Workshop Meeting, Tokyo, Japan, May 15-18,2007,403-406.
    10. Hironori Takeuchi, Naohiko Urmoto, Koichi Takeda, Experiments on Patent Retrieval at NTCIR-4 Workshop [J]. In Proceedings of NTCIR-4, Tokyo, April 2003, June 2004.
    11. Yaoyong Li, Kalina Bontcheva and Hamish Cunningham. SVM Based Learning Systerm for F-term Patent Classification [J]. Proceedings of NTCIR-6 Workshop Meeting, Tokyo, Japan, May 15-18,2007.
    12. Yaoyong Li, Kalina Bontcheva and Hamish Cunningham. Cost Sensitive Evaluation Measures for F-term Patent Classifiation [J]. The first Internatinal Workshop on Evaluating Information Access(EVIA), Tokyo, Japan, May 15,2007.
    13. Larkey, Leah S. Automatic Essay Grading Using Text Categorization Techniques [J]. In Proceedings of the 21st Annual International Conference on Research and Development in Information Retrieval (SIGIR 98), Melbourne, Australia,1998,90-95.
    14. C. J. Fall, A. Torcsvari, K. Benzineb, G. Karetka. Automated Categorization in the International Patent Classification [J], SIGIR Forum 37 (1),2003.
    15. M. Murata, K. Uchimoto, H. Ozaku, Q. Ma, M. Utiyama, and H. Isahara. Japanese probabilistic information retrieval using location and category information. The Fifth International Workshop on Information Retrieval with Asian Languages,2000,81-88.
    16. Hironori Doi, Yohei Seki, Masaki Aono, A Patent Retrieval Method Using a Hierarchy of Clusters at TUT. In Proceedings of NTCIR-5 Workshop Meeting, Tokyo, Japan, May 15-18, 2005,403-406.
    17.刘玉琴,桂捷,朱华东.基于IPC知识结构的专利自动分类方法[J].计算机工程.2008.2.34(3)：207-209.
    18.郭炜强,戴天,文贵华.基于领域知识的专利自动分类[J].计算机工程.2005.12.31(23)：52-54.
    19. C.J. van Rijsbergen, S.E. Robertson and M.F. Porter,1980. New models in probabilistic information retrieval. London:British Library. (British Library Research and Development Report, no.5587).
    20. David D. Lewis and Philip J. Hayes. Guest editors'introduction to the special issue on text categorization [J]. ACM Transactions on Information Systems,1994,12(3):231.
    21.陈文亮.面向文本分类的文本特征学习技术研究(D),沈阳：东北大学,2005
    22.王会珍.文本内容分类和主题追踪关键技术研究(D],沈阳：东北大学,2008
    23. T. M. Mitchell. Machine Learning [M]. The McGraw-Hill Companies, Inc,1997.
    24. Jingbo Zhu, Huizhen Wang and Xijuan Zhang, Discrimination-based Feature Selection for Multinomial Naive Bayes Text Classification [J].21st International Conference on Computer Processing of Oriental Languages (ICCPOL2006), LNAI/CS, December 17-19,2006, Singapore.
    25. A. Berger, S. Della Pietra, and V. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics,22(1),1996.
    26. Muhua Zhu, Jingbo Zhu, Wenliang Chen. Effect analysis of dimension reduction on support vector machines [J]. IEEE International Conference on Natural Language Processing and Knowledge Engineering.2005.10
    27. Zhenxing Wang, Jingbo Zhu. Improving K-NN Text Categorization by Bootstrap Technique. International Conference on Chinese Computing 2007. Wuhan, China. Oct.12-15,2007,493-499.
    28. Chen Wenliang, Chang Xingzhi, Wang Huizhen, Zhu Jingbo, Yao Tianshun. Automatic Word Clustering for Text Categorization Using Global Information. S. H. Myaeng et al. (Eds):AIRS 2004, LNCS 3411,,2005,1-11.
    29.陈文亮,朱靖波, 朱慕华, 姚天顺.基于领域词典的文本特征表示[J],计算机研究与发展,2005,42(12)：2155-2160.
    30.陈文亮.面向文本分类的文本特征学习技术研究(D),沈阳：东北大学,2005.
    31. Ricardo Baeza Yates, Berthier Ribeiro Neto. Modern Information Retrieval [M]. Pearson Education Press 1999.
    32. Porter algorithm. http://tartarus.org/-martin/PorterStemmer/.
    33. J.M. Pena, J.A. Lozano, P. Larranaga. An empirical comparison of four initialization methods for the K-Means algorithm. Pattern Recognition Letters [J].1999,20(10): 1027-1040.
    34. Masaki Murata, Toshiyuki Kanamaru, Tamotsu Shirado, Hitoshi Isahara. Using the K Nearest Neighbor Method and BM25 in the Patent Document Categorization Subtask at NTCIR-5 [J]. Proceedings of NTCIR-5 Workshop Meeting, ToKyo, Japan December 6-9, 2005.
    35. Masaki Murata, Toshiyuki Kanamaru, Tamotsu Shirado, Hitoshi Isahara. Using the K Nearest Neighbor Method and SMART Weighting in the Patent Document Categorization Subtask at NTCIR-6 [J]. Proceedings of NTCIR-6 Workshop Meeting, May 15-18,2007, ToKyo, Japan.
    36. A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization [J]. In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR'96),1996,21-29.
    37. D.K.Harman, G. Candela. Retrieving Records from a Gigabyte of Text on a Minicomputer Using Statistical Ranking. Journal of the American Society for Information Science.1990, 41(8):581-589.
    38. Amit Singhal, Chris Buckley, and Mandar Mitra. Pivoted document Length normalization. In SIGIR'96:Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, p279-280,1999.
    39. Ronan Cummins, Colm O'Riordan, An Axiomatic Study of Learned Term-Weighting Schemes. In Proceedings of the 30th annual international ACM SIGIR workshop on Learning to Rank for Information Retrieval (SIGIR'07),2007.
    40. J. N. Darroch and D. Ratcliff. Generalized iterative scaling for log-linear models [J]. Annals of Mathematical Statistics,1972,1470-1480.
    41. Herbrich, R., Graepel, T.,& Obermayer, K. Large Margin Rank Boundaries for Ordinal Regression [J]. Advances in Large Margin Classifiers.2000,115-132.
    42. Yunbo Cao, Jun Xu, Tie-Yan Liu, Hang Li, YalouHuang, and Hsiao-Wuen Hon, Adapting Ranking SVM to Document Retrieval [J], Proc. of SIGIR 2006,186-193.
    43. S. E. Robertson and S. Walker. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,1994.
    44. Franz Josef Och, Hermann Ney. Discriminative Training and Maximum Entropy Models for Statistical Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002,295-302.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700