聚类和主成分回归在经济指标数据中的应用研究

英文题名：The Application of Clustering and Principal Component Regression in the Economic Indicator Data
作者：姜扬
论文级别：硕士
学科专业名称：软件工程
中文关键词：SPSS ; 聚类分析 ; K-均值聚类 ; 主成分分析 ; 回归分析
英文关键词：SPSS ; Clustering Analysis ; K-means Clustering Analysis ; Principal Component Analysis ; Regression Analysis
学位年度：2010
导师：周春光 ; 王喆
学科代码：081202
学位授予单位：吉林大学
论文提交日期：2010-04-01

摘要

建国60年来,我国的城市社会经济建设发生了翻天覆地的变化。城市建设日新月异,城市居民生活质量和生活环境得到极大改善。本文的主要数据来源就是2009中国统计年签中的10-3：省会城市和计划单列城市的主要经济指标。该数据主要描述了省会城市和计划单列城市(总共36个)的23个经济指标。
     本文使用的实验方法是聚类和主成分回归。
     本文研究的主要结果是：
     1、聚类：对36个城市分类,分类标准是22个经济指标。本文的处理方式是使用SPSS中的K-均值算法(也称为K-均值算法),将36个城市分成两类和三类,通过分类结果讨论城市之间的经济发展差距。
     2、对城市人口总数建立回归模型,即在22个经济指标属性变量中找到一些和人口总数相关的经济指标,即将22个经济指标属性变量进行降维,本文是选择了10个经济指标,建立城市人口总数和这10个经济指标直线的多元线性回归模型。
     通过聚类和主成分回归的操作,熟练掌握社会统计学软件SPSS的操作和应用。同时也对聚类、主成分分析、回归分析有着更深刻的理解。
     本文研究的重点和难点是主成分回归。主成分分析是指降维,回归分析是指建立自变量和因变量之间的回归模型,主成分回归是将两者结合起来,同时达到降维和回归的目的。
The Application of Clustering, Principal component Regression analysis in the economic indicator data. Tremendous changes have taken place on China's urban social economic construction since it set up in 1949.
     Urban development is becoming more rational layout with rapid progress of urbanization. The economic structure has been further improved. Economic of the urban plays an important role in the national economy. Rapid urban construction, urban quality of life and living conditions greatly improved.
     The main source of data in the paper is in the file named 10-3:main economic indicators of capital cities and cities with independent plans, it is included in 2009 China Statistical Yearbook. The data mainly described the 23 economic indicators of the capital cities and cities with independent plans (a total of 36), economic indicators from these numerical show the gap between the development of cities, which mainly described in the medical, public health, education, transportation and other aspects.
     The paper researches on the data of the economic indicators by the SPSS statistical software. SPSS has a complete function of data management and statistical analysis. SPSS has amount of characteristics, such as simple, no programming, powerful and convenient data interface. In addition, it has a flexible combination of function modules. The functions of SPSS include data inputting, editing, statistical analysis, reporting, graphics, production and so on. It has 11 types of 136 functions of its own. SPSS provides both simple statistical description and complex multi-factor statistical analysis methods, such as exploratory data analysis, statistical description, contingency table analysis, two-dimensional correlation, rank correlation, partial correlation, analysis of variance, nonparametric tests, multiple regression, survival analysis, analysis of covariance, discriminant analysis, factor analysis, cluster analysis, nonlinear regression, Logistic regression and so on.
     The data on economic indicators were operated through SPSS, and the main research in this paper involved the two aspects as follows:
     1、Clustering analysis
     It mainly used the application of clustering analysis to classify the data on the economic indicators of 36 cities, according to 22 attributes. We can arrive at the gap between cities category through classification of the city's economic indicators of capital cities and cities with independent plans.
     2、Principal Component Regression
     Principal Component Regression is the focus of the study in the paper. It combined the principal component analysis with regression analysis together. First, it made principal component analysis of several properties to achieve the purpose of dimension reduction, then it established the regression relationship between target variables and a few independent variables separately.
     The main purpose of principal component analysis is using fewer variables to explain most of the variation of the original data, and it can change a number of related variables in our hands into a highly independent r or irrelevant variables between each other.
     It usually chooses several new variables fewer than the original number of variables which can explain most of the information in the variation, called principal components, and it can explain a comprehensive index of information. Principal component analysis is actually a dimension reduction method.
     The main purpose of regression analysis is to establish regression model. It determined the causal relationship between variables and established the regression model through the provisions of the dependent variable and independent variables, and solved the parameters of the model based on experimental data, and then evaluated whether the regression model fit well the measured data; if it fit well, we can predict the independent variable further.
     This paper describes the applied research of the main component regression in the economic indicator data. It studied the relationships among the total urban population (Y) and a number of economic indicators by principal component regression.
     First of all, it should determine collinearity by regression analysis. It established the regression model among general population and the 21 economic indicators, and it got the 10 economic indicators related to the total population by "the back-out method". Because the model revealed the existence of collinearity,10 economic indicators needs principal component analysis.
     Secondly, principal component analysis will need to check the suitability of extracting principal components. After testing, KMO's value was 0.8 or above, and gravel figure shows a straight line presented "steep slope" shape, it was suitable for component analysis. As a result, it extracted two principal components from the 10 economic indicators, and the two principal components can reflect more than 80% of the information of the 10 economic indicators, the first two eigenvalues cumulative contribution rate has been achieved to 83.887%. After the calculation of the original load factor, it obtained the expressions among two principal components (F1, F2) and the 10 economic indicators. It obtained the principal component score by multiplying the feature vector and standardized data. In addition, it reached a comprehensive principal component.
     Finally, the paper established the regression model between a total urban population and 10 economic indicators separately. First, it established the regression model between the city's total population and the two principal components named Fl and F2 separately through SPSS regression operation. Then it built the regression model between the city's total population and 10 economic indicators synthetically by the expression of principal component.
     Through researching on economic indicators data, I understood clustering、principal component analysis and regression analysis better. I learned the ideological principles of principal component regression. Besides, I mastered several operations of SPSS.

引文

[1]中国统计年鉴[J/OJ].http://baike.baidu.corn/view/720467.htm.
    [2]国家统计局.中国统计年鉴—2009[M].2009版,北京：中国统计出版社,2009.
    [3]spss概述[J/OJ].http://baike.baidu.com/view/130328.htm.
    [4]SPSS统计软件概述[J/OJ].http://www.cnzx.info/oldweb/bykj/yjff/
    sub-cont/chapter_13/classses/2-2.htm.
    [5]spss综述[J/OJ].http://baike.baidu.com/view/130328.htm?fr= ala0_1_1.
    [6]蒋耀.基于综合评价理论的区域可持续发展研究-上海市青浦区和谐社会战略分析[D].上海交通大学博士学位论文.2008
    [7]钮建伟.面向适配设计的三维人体数据多分辨率描述与聚类分析[D].清华大学博士学位论文,2009.
    [8]李斌.基于正常简档聚类的自适应异常检测技术研究[D].中南大学硕士学位论文,2009.
    [9]Jiawei Han,Micheline kamber.Data Mining Concepts and Techniques,Second Edition[M].北京:机械工业出版社.2008:251-266
    [10]吕向东.我国农业综合生产能力研究[D].中国农业科学院博士学位论文,2006.
    [11]张旭明.产业集群持续成长因素分析与实证研究[D].吉林大学博士学位论文,2008.
    [12]冯浩.中国区域工业竞争力研究：理论探索与实证分析[D].吉林大学博士学位论文,2007.
    [13]回归分析[J/OJ].http://baike.baidu.com/view/145440.htm.
    [14]蔡建琼,于惠芳,朱志洪.SPSS统计分析实例精选[M].北京：清华大学出版社.2006.
    [15]杨飞.SPSS中主成分分析在体育科研中的应用研究[J].体育科技文献通报.2009,17(12)：128-129、132.
    [16]王庆.知识型企业知识员工任务指派及调度决策问题研究[D].天津大学博士学位论文.2006.
    [17]殷桂梁.基于人工免疫算法的分布式发电系统孤岛检测研究[D].燕山大学博士学位论文.2006.
    [18]李晓刚,贾元华,敖谷昌.基于主成分回归的公路客运量预测模型研究[J].交通标准化.2009(196)：77-81.
    [19]张文霖.主成分分析在SPSS中的操作应用[J].市场研究.2005(12)：31-34.
    [20]邵威平,李红,张五九.主成分分析法及其在啤酒风味评价[J].酿酒科技.2007(11)：107-110.
    [21]龚曙明.市场调查与预测[M].清华大学出版社,2005.
    [22]马进.公路客货运输量多元线性回归预测方法探讨[J].汽车运输研究.1994.
    [23]魏艳华,王丙参,田玉柱.主成分分析与因子分析的比较研究[J].天水师范学院学报.2009,29(2)：13-15.
    [24]刘磊.中国城镇居民收入差距影响因素的实证分析[D].天津财政大学硕士论文.2008.
    [25]杨中荣,毛广运,臧桐华,徐希平.用SAS和SPSS软件进行主成分分析[J].中国卫生统计.2009,26(2)：212-213.
    [26]叶英.我国沿海运输市场现状分析及预测实证研究——基于SPSS的时间徐磊分析模型的建立和检验[D].武汉大学硕士学位论文.2005.
    [27]王雪梅,塔西甫拉提·特依拜,柴仲平,胡江玲,龚爱谨.新疆典型盐渍化区离子特征分析[J].干旱区资源与环境.2009,23(12)：183-187.
    [28]辜子寅.基于主成分回归模型在江苏省农民增收研究中的应用[J].统计教育.2009(4)：21-24.
    [29]过芒吉.基于多元统计分析的水污染评价[J].安徽农业科学.2009,37(21)：10121-10122.
    [30]李红祥,岳东杰, 李立瑞.基于主成分回归的大坝位移模型[J].水电自动化与大坝检测.2008,32(5)：61-64.
    [31]张浩然,周冀衡,樊在斗,张一扬,李文碧,杨程,曾彦清.基于主成分回归的烤烟种植分布影响因素分析[J].湖南农业大学学报(社会科学版).2009,10(3)：26-31.
    [32]陈艳,杨菁,李会敏.利用多变量自适应回归样条函数确定ATCS复合分派规则的缩放参数[J].控制与决策.2009,24(12)：1816-1821.
    [33]艾灵志,梅正阳,王波,李友.主成分分析和回归法在高校学费标准评价中的应用[J].湖北师范学院学报(自然科学版).2009,29(3)：99-103.
    [34]薛薇.SPSS统计分析方法及应用[M].北京：电子工业出版社.2004.
    [35]李志辉.SPSS for Windows统计分析教程[M](第2版),北京：电子工业出版社.2005.
    [36]卢纹岱.SPSS for Windows统计分析[M](第2版).北京：电子工业出版社.2002.
    [37]张文彤.世界优秀统计工具SPSSILO统计分析教程(高级篇)[M].北京：北京希望电子出版社.2002.
    [38]刘先勇.SPSS10.0统计分析软件与应用[M].北京：国防工业出版社.2002.
    [39]林海明.如何用SPSS软件一步算出主成分得分值[J].统计与信息论坛.2007,22(5)：15-17.
    [40]郭志刚.社会统计分析方法-SPSS软件应用[M].北京：中国人民大学出版社.1999.
    [41]朱建平,殷瑞飞.SPSS在统计分析中的应用[M].北京：清华大学出版社.2007.
    [42]张建同,孙昌言.以Excel和SPSS为工具的管理统计[M].北京：清华大学出版社.2005.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700