基于关联规则挖掘的出生缺陷预警系统的研究与实现

英文题名：Research and Implementation of Birth Defect Early Warning System Based on Association Rules
作者：赵佳璐
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：关联规则 ; 算法 ; 约束 ; 出生缺陷 ; 数据预处理
英文关键词：association rules ; constraints ; birth defect ; algorithm ;
英文关键词：data preprocessing
学位年度：2013
导师：杨俊
学科代码：0812
学位授予单位：北京邮电大学
论文提交日期：2012-12-30

摘要

我国出生缺陷发生率逐年增加,对人类可持续发展和社会经济发展造成的威胁越来越大,数据挖掘领域的关联规则挖掘可以找出与出生缺陷相关的致病因素,从而进行出生缺陷预防。但传统的关联规则挖掘算法存在耗时长以及规则冗余的问题,并且无法直接应用于分布式的数值型医疗数据的挖掘。针对以上两大挑战,本文对医疗数据的关联规则挖掘方法做了探索性的研究。本文选题自“十一五”国家科技支撑计划课题“安全可信的电信级生殖健康运营支撑体系关键技术研究”,主要解决了如何从采集到的一百六十多万份家庭档案中挖掘出跟出生缺陷相关的因素,从而实现预警目标的问题。
     论文的工作主要体现在以下几个方面：1.研究了关联规则挖掘的理论知识,包括基本概念和分类等,对最有影响的算法即Apriori和FP-gorwth算法进行重点研究并进行比较分析。2.提出了一种将用户兴趣约束引入关联规则挖掘的新算法ACARMT,解决了现有算法耗时长和规则冗余的问题。3.设计了一个针对医疗数据的预处理模型,该模型实现分布式数据集成,定义了数据转换规则,将数量庞大的源数据转换成适用于直接挖掘的中间数据,解决了医疗数据无法直接进行关联规则挖掘的问题。4.设计并实现了一个出生缺陷预警系统,达到出生缺陷致病因素的挖掘以及对可疑档案实时预警的目标。
     论文的主要贡献是,提出了一种基于约束的关联规则挖掘新算法ACARMT,提高了挖掘效率和挖掘结果的针对性,设计了一个针对医疗数据挖掘的数据预处理模型,使海量医学数据可以使用新算法进行关联规则挖掘。最后,在出生缺陷预警系统的设计与实现中应用ACARMT算法和数据预处理模型,通过对“国家免费孕前优生健康检查信息服务管理平台”采集到的一百多万份档案进行关联规则挖掘,验证了算法与模型的有效性,最终实现出生缺陷预警。
The incidence of birth defect in China has increased year by year, which threatens human sustainable development and social economic development. Association rules mining, one of data mining methods can find the pathogenic factors by mining the medical data, and then prevent the birth defect. But traditional algorithms of association rules mining have disadvantages of time-consuming and generate redundant rules, which cannot be used to mine the distributed and numeric medical data directly. In view of above two challenges, this paper does the exploratory research about the association rules mining methods of medical data. This paper topic from "Eleventh Five-Year" National Science and Technology Support Project "safe and reliable reproductive health services, telecom operation support system for key technologies", solve the problem of how to mine the factors related with birth defect from1.6million family archives collected by the project, and then achieve the goal of early warning.
     The work of this paper reflected in the following aspects:1. Research the knowledge of association rules mining, including basic concepts and types. Then focused research and compare the classical algorithms Apriori and FP-growth.2. Propose a new algorithm (ACARMT) which use the constraints based on the interests of users after research exist algorithm.3. In view of the characteristics of medical data, design a data preprocessing model. This model which implements the integration of distributed data and define the data transfer rules to transfer the source data to the Intermediate data which can use the algorithm to mine association rules. This solves the problem of cannot mine association rules in medical data.4. Based on the new algorithm and new model, design and implement a birth defect early warning system to mine the factors lead to birth defect and give early warning to suspicious archives.
     The main contribution of the paper is to propose a constrained association rules mining algorithm ACARMT which improve the mining efficiency and results' pertinence, and to design a data preprocessing model which makes the mass medical data can use the new algorithm to mine the association rules. Finally, Application of the ACARMT and data preprocessing model in designing and implement of birth defect early warning system to verify the effectiveness of the algorithm and model and realize the early warning by mining the association rules in1.6million family archives collected by platform of national pre-pregnancy information management.

引文

[1]孔令斌,张作记,戚厚兴等.儿童出生缺陷发生危险因素的病例对照研究[J].中国行为医学科学,2004,13(4)：435-436.
    [2]中华人民共和国卫生部.《中国出生缺陷防治报告(2012)》问答[J].中国实用乡村医生杂志,2012,19(20)：3-5.
    [3]吴伶俐.数据挖掘技术在基于XML的电子病历中的应用研究[D].武汉理工大学,2006.
    [4]WU J,WANG J,MENG B,etal.Exploratory spatial data analysis for the identification of risk factors to birth defects[J].BMC Public Health,2004,23(4):23-33.
    [5]杨峰.基于决策树的出生缺陷预警系统研究与实现[D].东北师范大学,2006.
    [6]BAI H,GE Y,WANG J,etal.Using rough set theory to identify villages affected by birth defects:the example of Heshun,Shanxi,China[J].International Journal of Geographical Information Science,2010,24(4):559-576.
    [7]张承江,闫朝升,宋立群等.中医肾病治疗信息中关联规则的挖掘算法[J].黑龙江大学自然科学学报,2005,22(6)：842-845.
    [8]李爱凤.基于数据挖掘技术的购物篮模式研究[J].计算机应用与软件,2011,28(12)：156-158.
    [9]谢美萍,芮廷先.基于Apriori算法的改进关联规则的算法研究[J].泰山学院学报,2012,(3)：10-12.
    [10]张红梅.数据挖掘中快速关联规则发现算法研究及应用[D].河北工业大学,2002.
    [11]李广原,杨炳儒,周如旗等.一种基于约束的关联规则挖掘算法[J].计算机科学,2012,39(1)：244-247.
    [12]申彦,宋顺林,朱玉全等.基于磁盘表存储FP-TREE的关联规则挖掘算法[J].计算机研究与发展,2012,49(6)：1313-1322.
    [13]王爱平,王占凤,陶嗣干等.数据挖掘中常用关联规则挖掘算法[J].计算机技术与发展,2010,20(4)：105-108.
    [14]Sean N,Ghazavi,Thunshun,W.Liao.Medical Data Mining by Fuzzy Modeling with Selected Features. Artificial Intelligence in Medieine,2008,43(3):195-206.
    [15]吴绍函,余昭平.基于矩阵的关联规则挖掘算法[J].计算机工程,2008,34(23)：31-33.
    [16]程江,易云飞,林建辉等.基于前缀树的模糊关联规则挖掘算法[J].计算机工程,2009,35(7)：68-69,72.
    [17]方刚.基于二进制的约束性关联规则挖掘算法[J].计算机工程,2009,35(7)：78-81.
    [18]桂琼,程小辉.基于事务相似矩阵的关联规则挖掘算法[J].桂林工学院学报,2008,28(4)：568-571.
    [19]赵纪涛,马莉,王现君等.一种自适应的模糊关联规则挖掘算法[J].计算机技术与发展,2008,18(5)：64-66.
    [20]崔建,李强,杨龙坡等.基于垂直数据分布的大型稠密数据库快速关联规则挖掘算法[J].计算机科学,2011,38(4)：216-220.
    [21]王娟勤,李书琴.一种高效关联规则挖掘算法[J].湖南科技大学学报(自然科学版),2011,26(4)：60-64.
    [22]彭永供,王靓明,朱敏等.基于散列技术的高效剪枝关联规则挖掘算法[J].南昌大学学报(理科版),2009,33(5)：494-498.
    [23]郎瑾.关联规则挖掘技术研究田].西安电子科技大学,2005.
    2]杨文杰,胡明昊,唐振民等.一种有效的基于约束的关联规则发现算法[J].南京理工大学学报(自然科学版),2005,29(1)：109-112.
    [24]董雁适,程翼宇,潘云鹤等.基于高频模式树的项约束关联规则发现方法[J].浙江大学学报(工学版),2002,36(4)：445-450.
    [25]吉根林,韦素云.分布式环境下约束性关联规则的快速挖掘[J].小型微型计算机系统,2007,28(5)：882-885.
    [26]方刚.一种快速挖掘约束性关联规则的算法[J].计算机应用与软件,2009,26(8)：268-270,280.
    [27]彭坤,黄党生.约束关联规则挖掘在医疗数据分析中的应用[J].国际生物医学工程杂志,2008,31(3)：129-133.
    [28]吴斌,马超.一种旅行数据约束关联规则挖掘算法[J].计算机工程与应用,2010,46(20)：129-132,137.DOI：10.3778/j.issn.1002-8331.2010.20.037.
    [29]方刚.基于二进制的约束性关联规则挖掘算法[J].计算机工程,2009,35(7)：78-81.
    [30]王佳乐,顾幼瑾.基于属性位复用的约束性关联规则挖掘算法[J].计算机工程与应用,2011,47(7)：131-134.DOI：10.3778/j.issn.1002-8331.2011.07.038.
    [31]李宏,陈松乔,陈建二等.基于Eclat算法的多种约束关联规则挖掘算法研究[J].计算机测量与控制,2006,14(7)：934-936,945.
    [32]吕刚.基于约束的多维关联规则挖掘的粗糙集模型[J].电脑知识与技术,2009,5(2)：259-260,276.
    [33]周爱武,王宝铜,李玉梅等.最大值约束下的多最小支持度关联规则挖掘[J].现代计算机(专业版),2009,(2)：9-10,34.
    [34]高丽,李丹,戴上平等.一个基于约束的关联规则挖掘算法[J].河南大学学报(自然科学版),2003,33(1)：55-58.
    [35]陈义明,李舟军,傅自纲等.基于FP-Tree的约束关联规则挖掘算法[J].计算机工程与设计,2007,28(18)：4450-4453.
    [36]马建军.基于约束的关联规则挖掘工具的设计与实现[D].北京大学,2006.
    [37]丁亚丽.基于约束关联规则数据挖掘的入侵检测算法研究[D].南京邮电大学,2010.
    [38]蒋晓辉,鲁明羽,薛为民等.一种约束关联规则挖掘算法的实现[C].//2005第一届中国分类技术与应用研讨会(CSCA).2005:269-273.
    [39]钟秋燕.数据集成技术综述[J].电脑知识与技术,2008,3(24)：1120-1122.
    [40]李春林.Web数据库集成技术及其发展趋势[J].硅谷,2012,(9)：1-2.
    [41]金宝轩.网格环境下的异构空间数据库集成技术[J].计算机工程,2008,34(5)：74-76.
    [42]谢晓兰,何恭贺.基于OGSA-DAI的网格数据访问与集成技术研究[J].桂林理工大学学报,2010,30(3)：419-425.
    [43]司海英,王辉.基于OGSA-DAI的异构数据集成研究[J].计算机工程与设计,2011,32(5)：1718-1721.
    [44]杨学深.利用OGSA-DAI中间件整合异构数据库系统[J].信息技术,2011,35(1)：121-123,126.
    [45]王华.关联规则挖掘及在医学信息处理中的应用研究[[)].合肥工业大学,2006.
    [46]纪征.医学数据挖掘应用[J].情报探索,2010,06：105-106.
    [47]王红艳,吴代文.数值属性关联规则的挖掘算法[J].信息技术,2012,01：20-24.
    [48]王傲胜,李国徽.具有利润约束的数值型关联规则的发现[J].安徽电气工程职业技术学院学报,2006,03：83-87.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700