科技政策库的系统集成与建设
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:System Integration and Construction of Science and Technology Policy Database
  • 作者:武虹 ; 杨宝龙 ; 杜治高 ; 李涵露
  • 英文作者:WU Hong;YANG Bao-Long;DU Zhi-Gao;LI Han-Lu;National Academy of Innovation Strategy;Beihang University;
  • 关键词:科技政策库 ; 网络爬虫 ; 数据清洗 ; 机器学习 ; 自然语言处理
  • 英文关键词:science and technology policy database;;Web crawler;;data cleaning;;machine learning;;natural language processing
  • 中文刊名:XTYY
  • 英文刊名:Computer Systems & Applications
  • 机构:中国科协创新战略研究院;北京航空航天大学;
  • 出版日期:2019-07-15
  • 出版单位:计算机系统应用
  • 年:2019
  • 期:v.28
  • 语种:中文;
  • 页:XTYY201907009
  • 页数:7
  • CN:07
  • ISSN:11-2854/TP
  • 分类号:62-68
摘要
为了满足科技政策研究需要,中国科协设计并实现了一种科技政策库系统.本文首先介绍了科技政策库的总体设计方案、系统工作流程;然后详细介绍了系统组成,整个系统由数据采集子系统、数据清洗子系统、数据分析子系统3个子系统组成.数据采集子系统基于网络爬虫框架Scrapy软件针对大量异构站点设计了可管理的网络爬虫,并基于ABBYY FineReader软件(俄罗斯软件公司ABBYY发行的一款文档识别软件)实现了历史文献OCR识别(Optical Character Recognition,光学字符识别)和入库.数据清洗子系统基于机器学习算法实现了数据去重、非相关数据识别、数据属性缺陷识别等功能.数据分析子系统则对有效入库的科技政策进一步进行了文本分类、关联关系分析、全文检索.从2018年10月上线以来,该系统从226个数据源采集564 749条数据,经过数据清洗之后入库404 083条数据,能够有力地支撑科技政策研究工作.
        In order to meet the needs of science and technology policy research, China Association for Science and Technology designs and implements a policy database system. This study first introduces the overall design scheme and system workflow of the science and technology policy database. Then it introduces the system components in detail. The system consists of three subsystems: data acquisition subsystem, data cleaning subsystem and data analysis subsystem.The data acquisition subsystem is based on the Scrapy framework for designing manageable web crawlers for a large number of heterogeneous sites, as well as ABBYY FineReader-based OCR(Optical Character Recognition) for historical documentation. The data cleaning subsystem implements functions such as data deduplication, non-correlated data identification, and data attribute defect recognition based on machine learning algorithms. The data analysis subsystem further carries out text classification, association analysis and full-text search for the effective policies. Since its launch in October 2018, the system has collected 564 749 pieces of data from 226 data sources. After data cleaning, it stores 404 083 pieces of data, which can strongly support the research of science and technology policy.
引文
1樊春良,马小亮.美国科技政策科学的发展及其对中国的启示.中国软科学, 2013,(10):168–181.[doi:10.3969/j.issn.1002-9753.2013.10.016]
    2 肖小溪,杨国梁,李晓轩.美国科技政策方法学(SoSP)及其对我国的启示.科学学研究, 2011, 29(7):961–964.
    3 NSTC&OSTP.The science of sciencepolicy:A federal research roadmap. Washington:The White House, 2008.
    4 樊春良.科技政策科学的思想与实践.科学学研究, 2014,32 (11):1601–1607.[doi:10.3969/j.issn.1003-2053.2014.11.001]
    5 陈光,方新.关于科技政策学方法论研究.科学学研究,2014,32(3):321–326.[doi:10.3969/j.issn.1003-2053.2014.03.001]
    6 樊春良.科技政策学的知识构成和体系.科学学研究,2017,35(2):161–169.[doi:10.3969/j.issn.1003-2053.2017.02.001]
    7 李燕萍,吴绍棠,郜斐,等.改革开放以来我国科研经费管理政策的变迁、评介与走向——基于政策文本的内容分析.科学学研究, 2009, 27(10):1441–1447, 1453.
    8 徐翔,聂鸣.我国科技创新政策研究综述.科技进步与对策,2005,22(11):178–180.[doi:10.3969/j.issn.1001-7348.2005.11.066]
    9 李萌.大数据时代对我国科技情报事业发展的新思考.中国软科学, 2016,(12):1–4.[doi:10.3969/j.issn.1002-9753.2016.12.001]
    10 樊宇豪.基于Scrapy的分布式网络爬虫系统设计与实现[硕士学位论文].成都:电子科技大学, 2018.
    11 Charikar MS. Similarity estimation techniques from rounding algorithms. Proceeding of the 34th Annual ACM Symposium on Theory of Computing. Montreal, Quebec, Canada. 2002.380 –388.
    12 Bin L, Yuan G Y. Improvement of TF-IDF algorithm based on Hadoop framework. Proceedings of the 2nd International Conference on Computer Application and System Modeling.Paris, France. 2012. 391–393.
    13 王济川,郭志刚. Logistic回归模型—方法与应用.北京:高等教育出版社, 2001.
    14 Agrawal R, Srikant R. Fast algorithms for mining association rules.Proceedings of the 20th International Conference on Very Large Data Bases. Santiago, Chile. 1994. 487–499.
    15 赵晨.关联规则挖掘算法的研究及应用[硕士学位论文].西安:西安电子科技大学, 2011.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700