强影响点的数据挖掘和图示

英文题名：Data Mining and Graphics Mode on Influential Point
作者：张森
论文级别：硕士
学科专业名称：应用数学
中文关键词：强影响点 ; 数据挖掘 ; 诊断 ; 图示 ; 离差度 ; 降维 ; 影响距离 ; Cook距离
英文关键词：influential point ; data mining ; diagnosis ; graphics mode ; warp-departure degree ; dimension reduction ; influential distance ; Cook-distance
学位年度：2002
导师：何中市 ; 杨虎
学科代码：070104
学位授予单位：重庆大学
论文提交日期：2002-05-15
答辩委员会主席：杨小帆

摘要

随着数据挖掘技术在现代商业中的广泛应用，对异常点和强影响点的挖掘成了经济、统计等领域广泛研究的课题。由于数据挖掘和统计诊断是近半个世纪才发展起来的新兴学科，虽然取得了很多研究成果，但仍有许多问题处于探索之中。
    本文在分析研究国内外有关强影响点的挖掘方法及其研究现状的基础上，从探索性数据分析的角度出发，提出了挖掘强影响点的两个新方法：基于关联分析的离差法和贡献得分降维法。其主要工作和结论如下：
    ·基于关联分析的离差法：利用关联分析方法，计算第k个观测值与中心的偏差系数和偏离系数，并根据它们的内积求离差度，用来判断强影响点。文中，针对几个典型实例，并编写了相应的计算程序，理论分析与计算结果表明：(1)使用该方法判断强影响点与经典方法相比较，结论是一致的。(2)该方法需要的样本容量可以很小，大于3个数据就可进行离差度计算与分析。(3)该方法计算工作量小，算法的时间复杂度为O()。
    ·贡献得分降维法：对变量作主成分分析，计算贡献得分，从而对高维数据降维，剔除数据后并利用K-均值聚类求影响距离，判断强影响点。通过实例的计算分析，结果表明：(1)降维前后，使用影响距离和Cook距离所求得的强影响点是一致的，说明降维是可行的。(2)使用影响距离判断强影响点与经典方法-Cook距离相比较，结论是一致的，说明本文提出的影响距离法也是可行的。(3)通过降维，就可对高维数据的强影响点进行图示。
    ·设计并开发了一个强影响点的挖掘系统。
With the wide application of data mining to modern business, the researches of data mining for outlier and influential point have been paid close attention to by economic and statistical circles. Though both data mining and statistical diagnostics have only fifty-year history, a lot of achievements have been made. However, there are many problems remaining unsolved.
    Based on the analysis of internal and international research works related to influential point and exploratory data analysis, two new approaches are presented in this paper to deal with the data mining of influential point, namely, relationship-based warp-departure analysis and contribution-score dimension reduction analysis.
    The main works and conclusions in this paper are listed below:
    · Relationship-based warp-departure analysis: First, we compute the warp coefficient and departure coefficient according to the method of relationship analysis. Then, warp-departure degree, the product of the two coefficients, is used to decide which is the influential point. Meanwhile, the method is applied to several typical examples, the analytical and numerical results show that: (1) Comparing with classical diagnosis method, the conclusions about influential point are the same. (2)The approach is adaptive to the case with small sample number, say, any integer larger than 3.(3)The method is of lower cost in computation, the computational complexity is 0().
    · Contribution-score dimension reduction analysis: The contribution-score which is obtained from the principal component analysis, is used to reduce the dimensions of data. Then the influential distance is employed to decide influential point by sample data removing. The computational results from some typical examples show that: (1) Analyzing the fore-and-aft influential distance and Cook-distance, the points with first largest distance are unchanged, this results that the dimension reduction method is acceptable. (2) Comparing the influential distance method with the classical analysis method-Cook distance method, the conclusions are in accord on influential point, it results that the influential distance method is acceptable. (3) Graphics mode of influential points is available via dimension reduction.
    · A data mining application system is developed to diagnose influential point.

引文

[1] Alex Berson,.贺奇.构建面向CRM的数据挖掘应用.北京.人民邮电出版社.2001.85-107:
    [2] R.Groth.候迪,宋擒豹.数据挖掘:构筑企业竞争优势.西安.西安交通大学出版社.2001:3-69
    [3] Jiawei Han, Micheline Kamber. Data Mining Concepts and Techniques(影印版).北京.高等教育出版社.2001:1-55
    [4] Cook, R.D., Weisberg,S.. Characterizations of an Empirical Influence for Detecting Influential Cases in Regression.Technometrics.1980.22
    [5] Cook,R.D..Detection of Influential in Linear Regression.Technometrics.1977.19(1):15-18
    [6] Cook,R.D., Weisberg,S.. Residuals and Influence in Reggression, Chapman and Hall, New York. 1988
    [7] 王松桂.回归诊断发展综述.应用概率统计.1998.4(3)
    [8] 杨虎.强影响点的一种新度量.数理统计与应用概率.1990.6(2)
    [9] 邓聚龙.灰色系统.北京.国防工业出版社.1985
    [10] 杨虎.单参数主成分回归估计.高校应用数学学报.1989.4(1).74-80
    [11] Critchley.Influence in Principal Components Analysis.Biometrika.1985.72
    [12] 洪楠,林爱华,李志辉,侯军.SPSS for Windows统计分析教程.北京.电子工业出版社.2000
    [13] David C.Hoaglin, Fregerick Mosterller.陈忠琏,郭德媛.探索性数据分析.北京.中国统计出版社.1998
    [14] 陈希孺,王松桂.近代回归分析.安徽.安徽教育出版社.1985
    [15] Bendat J S.Random Data:Analysis and measurement Procedures.Wiley-Interscience.1992
    [16] 韦博成,鲁国斌,史建清.统计诊断引论.南京.东南大学出版社.1991:122-164
    [17] Bibby,J.and Toutenburg,H..Prediction and Improved Estimation in Linear Models.New York.
    1977
    [18] C.Radahakrishna Rao, Helge Toutenburg. Linear Models:Least Squares and Alternatives. New York.Springer-Verlag.1995
    [19] 邓聚龙.灰色系统理论的关联空间.模糊数学.1985(2)
    [20] 袁嘉祖.灰色系统理论及其应用. 北京.科学出版社.1991
    [21] Dorian Pyle. Putting data mining in its place. Database Programming & Design. 1998.11(3)
    [22] Evan Levy. The lowdown on data mining. Teradata Review. 1999.2(2)
    [23] George E.P.Box,Gwilym M.Jenkins,Gregory C.Reinsel.顾岚, 范金城. 时间序列预测估计.北京.中国统计出版社.1997


    [24] Pandit S M,Time Series and System Analysis with Application.John Wiley and Sons.1983
    [25] Vandate W.Applied Time Series and Box-Jenkins Models.Academic Press Inc.1993
    [26] Weisberg S.. Some Principles for Regression Diagnostics and Influence Analysis. Technometrics.1983.25.452-572
    [27] Yvonne M.M.Bishop, Stephen E.Fienberg, Paul W.Holland.张尧庭,史宁中.离散多元分析理论与实践.北京.中国统计出版社.1998
    [28] 陈希孺.数理统计引论.北京.科学出版社.1999
    [29] 茆诗松,王静龙,濮晓龙.高等数理统计北京: 高等教育出版社,德国: 施普林格出版社.1998
    [30] 张继歌.回归分析的异常点与强影响点.统计研究.1994.2(58)
    [31] 蒋盛益.线性回归模型强影响点的判断.怀化师专学报.1997.16(5)
    [32] 焦万堂.线性回归模型中强影响度量的一种新方法.太原重型机械学院学报.1997.18(1)
    [33] 童春发.线性模型中自变量变换与强影响点.南京林业大学学报1997.21(4)
    [34] 龙蓓,林路. 线性模型和广义线性模型中的强影响点的一种显著性检验方法. 广西科学.1998.5(3)
    [35] 马秀兰.动态模型的跳点与强影响点.天津商学院学报1996.4
    [36] 何中市,何良材. 岭回归估计k值选取迭代算法的收敛性定理和极限. 应用数学学报.1994.17(1).60-64
    [37] 周斌,吴泉源,高洪奎.用户访问模式数据挖掘的模型与算法研究. 计算机研究与发展. 1999.36(7)
    [38] 谢开贵,周家启.变权组合预测模型研究.系统工程理论与实践.2000.21(7)
    [39] 徐世良.计算机常用算法.北京.清华大学出版社.1999
    [40] 谢云荪,张志让.数学实验.北京.科学出版社.1999
    [41] 张森.MATLAB6.0程序设计与实例应用.北京.中国铁道出版社.2001

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700