基于数据挖掘的复杂产品关键质量特性识别的方法研究

英文题名：On Critical-to-quality Characteristics Identification for Complex Products Using Data Mining
作者：闫伟
论文级别：博士
学科专业名称：管理科学与工程
中文关键词：复杂产品 ; 关键质量特性 ; 特征选择 ; 信息增益 ; EM聚类
英文关键词：Complex product ; Critical-to-quality Characteristics ; Feature selection ; Information gain ; EM cluster
学位年度：2012
导师：何桢
学科代码：1201
学位授予单位：天津大学
论文提交日期：2012-05-01

摘要

复杂产品是指客户需求复杂、产品组成复杂、产品技术复杂、制造过程复杂、项目管理复杂的一类产品。在复杂产品质量控制中，质量特性监控点的有效性决定了产品质量的可控性。但是随着质量控制点数量的增多，一方面会使产品的控制成本急剧增加，另一方面会使企业的质量控制效率大幅下降。为了提高产品质量控制效率，就必须有效识别出对产品质量具有显著影响的关键质量特性，以减少控制点，提高控制效率，降低企业成本。
     本文主要从平衡数据集和不平衡数据集两个角度（以发动机叶片为例，如果合格叶片与不合格产品叶片数量均衡就称为平衡数据；如果合格是不合格几十倍就称为不平衡数据）对关键质量特性识别问题进行了研究：
     （1）在复杂产品质量类别内样本数量大概相等的平衡数据集中，将信息论中的信息熵概念引入，用信息增益来判断质量特性与所属类别之间的相关性，这样就可以绕过产品质量特性间复杂的相互依存关系，从一个新的角度对质量特性的重要性进行度量，从而识别出真正的产品CTQ。经算例验证，该法可以有效的降低质量特性维度，提高质量控制效率和控制水平，节约了大量时间和成本。
     （2）在真实的复杂产品生产中，质量类别内样本数量差距往往都很大，这样的不平衡数据集为关键质量特性的识别带来了比平衡数据集更大的困难。本文分别从几个角度建立了不同的不平衡数据集CTQ识别方法：第一种，对ReliefF方法的判别标准进行改进，使得类别划分标准向多数类偏移，以此降低少数类数据被作为异常值删除的风险；第二种对Wrapper方法进行修改，将SBS、SFS与代价敏感学习整合，以此建立质量特性循环选择机制，有效提高了CTQ识别效率；第三种对EM算法进行改进，通过聚类过滤掉不平衡数据中的冗余样本以此构建平衡数据集，并在此基础上进行关键质量特性识别，以此有效改善了质量特性识别性能并大幅降低第二类错误率。
     本文通过对复杂产品生产企业的调研，对目前质量控制的瓶颈---关键质量特性识别进行了详实的分析和研究，并通过算例进行了验证。所以，本文对未来复杂产品生产质量控制的研究具有积极的参考价值。
Complex product is a kind of product which featured with complexcustomer demand, complex product composition, complex producttechnology, complex product technology and complex project management.In the quality control process of complex product, the validity ofmonitory points measuring quality characteristic determines thecontrollability of product quality. However, with the increase in thequantity of quality control points, on the one hand, the cost of productquality control will increase dramatically; on the other hand, theefficiency of enterprise quality control will drop significantly. Inorder to improve the efficiency of product quality control, it is of greatsignificance to identify Critical-to-quality Characteristics that havea significant impact on product quality, so as to reduce the quantitycontrol points, improve control efficiency and finally cut the businesscosts.
     In this dissertation, the study of identification ofCritical-to-quality Characteristics was carried out form twoperspectives: balanced data sets and unbalanced data sets.
     (1) In the balanced data sets with approximate equal sample size ofcomplex product quality category, the information entropy technology ofinformation theory was introduced and information gain was employed todecide the correlation between quality characteristics and theirrespective categories. Thus, the complex interdependent relationswithin product quality characteristics can be bypassed and the measureof importance of quality characteristics can be conducted form a newperspective, so as to identify true product CTQ. An example testify showsthat this method can effectively reduce the dimension of the qualitycharacteristics, improve the efficiency of quality control and controllevel and save a lot of time and cost in the meanwhile.
     (2) In the real manufacture of complex product, the difference insample size is large within quality category. And the unbalanced dataset brings about greater difficulty in the identification ofCritical-to-quality Characteristics compared with the balanced data set.In this dissertation, three different CTQ recognition methods have beendeveloped from three angles in unbalanced data set. First, improvementshave been made on ReliefF criterion to make category dividing criterionoff set towards most classes, so as to reduce the risk of minority classdata being removed as outliers. Second, base on the introduction ofmodels and algorithms of feature selection, a kind of Wrapper featureselection algorithm has been proposed on the basis of balancedclassification accuracy, thereby reducing the negative impact of theunbalanced data set on the CTQ identification. Third, improve the EMalgorithm. Filtering out the redundancy samples in the unbalanced dataset by clustering so as to build a balanced data set, based on which therecognition of Critical-to-quality Characteristics have been conducted.By this means, the performance of the quality characteristics has beenimproved effectively and error rate of the second kind has been reducedsignificantly.
     Based on the investigation of complex product manufacturers, thedissertation carried out a detailed analysis and research on the presentbottleneck in quality control--Critical-to-quality Characteristics,and verification has also been made through an example. The study hasa positive reference value on future study of complex product qualitycontrol.

引文

[1]陈劲，复杂产品创新系统管理，北京：科学技术出版社，2007:14-16
    [2]李伯虎，复杂产品制造信息化的重要技术---复杂产品集成制造系统，中国制造业信息化,2006,7(1):19-23
    [3]陈劲,周子范,周永庆，复杂产品系统创新的过程模型研究，科研管理,2005,17(02)：17-21
    [4] YAN Wei, HE Zhen, TIAN Wen-meng， The Application of ReliefFAlgorithm for Identifying CTQ in Complex Products，20112nd IEEEInternational Conference on Emergency Management and ManagementSciences，Beijing:ICEMMS,2011，459-461
    [5]李延来,唐加福,姚建明等,质量功能展开中选择工程特性的多目标决策方法,计算机集成制造系统,2008,14(7):1363-1369
    [6]何益海，唐晓青．基于关键质量特性的产品保质设计．航空学报,2007,28(6):168-171
    [7]熊伟,权婧雅. QFD及其发展动向，中国质量,2008,10（2）:16-17,37
    [8] Shen H, Wan S，Controlled sequential factorial design for simulationfactor screening，European Journal of Operational Research,2008，09(05):100-105
    [9] Rout B K, Mittal R K，Screening of factors influencing the performanceof manipulator using combined array design of experiment approach，Robot Computer Intelligence Manufacture,2008，05(04):18-22
    [10]汪四水，基于交叉谱分析法的因子筛选，数学的实践与认识,2005,35(11)：45-51
    [11] Craveiro A, Matos F J， Flavor and Fragrance Journal,1989,4(1):43-44
    [12] García-Allende P B, Mirapeix J, Conde OM, Spectral processingtechnique based on feature selection and artificial neural networksfor arc-welding quality monitoring,NDT&E International,2009,42(7):56-63
    [13] Deogun J S, Choubey S K, Raghavan V V, Feature selection and effectiveclassifiers, Journal of the American Society for Information Science,2008,49(5):423-434
    [14] Yang J, Olafsson S, Optimization based feature selection withadaptive instance sampling, Computers&Operations Research,2006,33(5):3088-3106
    [15] Ahmad A, Dey L, A feature selection technique for classificatoryanalysis, Pattern recognition letters,2005,26(2):43-56
    [16] Oduntan I O, Toulouse M, Baumgartner R, A multilevel tabu searchalgorithm for the feature selection problem in biomedical data,Computers and Mathematics with applications,2008,55(7):1019-1033
    [17] Zhang HB, Sun GY, Feature selection using tabu search method, Patternrecognition,2002,35(5):701-711
    [18]Lian ZL, Colosimo BM, Del CE. Setup error adjustment: sensitivityanalysis and a new MCMC control rule. Quality and ReliabilityEngineering International,2005,22(4):403-418
    [19] Lian ZL, Del Castillo E, Adaptive dead band control of a driftingprocess with unknown parameters, Statistics&Probability Letters,2007,77(4):843-852
    [20]缪小明，徐济超，复杂产品创新：市场结构与企业战略，研究与发展管理，2007,19（1）：59-62,71
    [21] Hobday Mike, Product complexity innovation and industrialorganization, Research Policy,1998,26(6):689-710
    [22]桂彬旺，基于模块化的复杂产品系统创新因素与作用路径研究：[博士学位论文]，杭州；浙江大学，2006
    [23]刘晓冰，王霄，丁向峰等，复杂产品创新过程中的知识管理问题研究，科技进步与对策，2009，26（16）：103-106
    [24]柴旭东，李伯虎，熊光楞等，复杂产品协同仿真平台的研究与实现，计算机集成制造系统，2002，8（7）：580-584
    [25]何益海,唐晓青,王美清，产品设计质量数据与管理模型研究,计算机集成制造系统,2006,12(8)：1161-1166
    [26]田飞，牛学伟，用QFD技术构建企业关键绩效指标,北京机械工业学院学报，2006,21（4），77-80
    [27]钱炜苗，李贵平，张国耕等，基于QFD、TRIZ与专利知识挖掘的产品创新设计，轻工机械，2011,29（4）：32-35
    [28]于志忠，利用QFD方法建立基于顾客满意的质量目标，中国认证认可，2011,12(11)：35-37
    [29] Pang-NingTan, Miehael Steinbaeh, VIPinKumar,范明译,数据挖掘导论,北京:人民邮电出版社，2007.58-62
    [30]Fayyad U, UthUrsalny R. Evolving Data Mining into SolutionS ForInsights，Communication of The ACM，2002，45(8):28-31
    [31]张云涛，龚玲，数据挖掘原理与技术，北京：电子工业出版社，2004.120-132
    [32] U Fayyad, D Haussler,P Stolorz. KDD for science data analysis:Issuesand examPles，Proeeedings of2nd International Conference onKnowledge Discovery and Data Mining, USA:AAAI Press,1996:50-56
    [33]李雄飞，李军，知识发现与数据挖掘，北京：高等教育出版社，2003，113-125
    [34] E M Mugambi， Polynomial-fuzzy data Knowledge-Based Systems:dimension tree structures for classifying medial data，Knowledge-Based Systems，2004，17(2):81-87
    [35] Agrawal J, Automatic subspace clustering of high dimensional datamining applications， Management of Data.1998,17（5）:73-84
    [36]李丹丹，数据挖掘技术及其发展趋势，电脑应用技，2007,16(2):24-27
    [37] Han Jia-wei, Micheline Kamber， Data Mining: Concepts and Techniques，San Francisco:Morgan Kaufmann Publishers,2000，45-56
    [38] Pearl J. Data Mining with Graphical Models：[博士学位论文].ComputerScience Dept.Standford University,2000
    [39]罗可,蔡碧野,卜胜贤，数据挖掘及其发展研究，计算机工程与应用,2002,17(4)：45-48
    [40] Inselberg A. V isualization and data mining of high-dimensionaldata.Chemometrics and Intelligent Laboratory Systems,2002,60（5）:147-159
    [41]王桂芹,黄道，数据挖掘技术综述，电脑应用技术,2007,15(2):37-40
    [42]逄坤，数据挖掘技术探析，信息与电脑(理论版),2011,23(2)：51-55
    [43]符静，数据挖掘:情报学的发展，大学图书情报学刊,2005,21(04):51-54
    [44]沙伯海，浅谈数据挖掘与数据挖掘服务的实现，计算机光盘软件及应用，2010，17（6）21-24
    [45]张拥军，刘锦伟，网络信息挖掘在电子商务系统中的应用，电脑知识与技术，2008,3（4）：650-651
    [46]伊宏，数据挖掘技术概述，中国标准导报，2008,3（6）:19-22
    [47]董引娣，数据挖掘中关联规则在零售业中的应用，重庆科技学院学报，2010，10（1）：119-122
    [48]张亚萍，基于聚类的朴素贝叶斯分类模型的研究与应用：[硕士学位论文]，合肥；合肥工业大学，2006
    [49]卢云燕，数据挖掘技术，重庆教育学院学报，2006，5(3):44-47
    [50]刘慧，知识发现理论与方法及其应用的研究：[硕士学位论文]，大连理工大学，2004
    [51] H Liu, H Motoda, Feature Selection for Knowledge Discovery and DataMining：[硕士学位论文]，Boston；Boston university，1998
    [52] H Liu and R Setiono， Feature selection and discretization of numericattributes， In Proe7th IEEE Intl Conf on Tools with Al， IEEEComputer Society publisher,1995，388-389
    [53] P M Aoki， Generalizing search in generalized search trees，International Conference， Data Engineering publisher,1998，124-128
    [54]咚瑞，朱顺泉，数据挖掘技术在商业企业中的应用，商场现代化，2005，12（5）：55-56
    [55] R Agrawal, T mielinski, A Swm. Data Mining association rules betweensets of items in large database， Proceedings of the ACM SIGMODConference on Management of data，1993，207-216
    [56] RISH I， An empirical study of the naive bayes classifier，2001Workshop on Empirical Methods in Artificial Intelligence,2001,41-46
    [57]周俊，钢闪速熔炼贫化电炉渣含铜的线性回归分析，矿冶，2003，12(2):58-62
    [58] A K Jain,M N Murty, P J Flynn. Data clustering: A review， ACMComputing Surveys，1999，31(3):264-232
    [59] P Giordani, H A LKiers. A comparison of three methods for Principalcomponent analysis of fuzzy interval data，Computation Statistics&Data Analysis，2006，51(1):379-397
    [60] E Eskin， Anomaly Detection over Noisy Data using Leaned ProbabilityDistributions Proceedings of the Seventeenth InternationalConference on Machine Learning,2000，255-262
    [61]邢杰，萧德云，基于PCA的概率神经网络结构优化，清华大学学报，2008，48(1):141-144
    [62]林亚平，杨小林，快速概率分析进化算法及其性能研究，电子学报，2001，29(2):178-181
    [63] M W Craven,J W Shavlik， Using Neural Networks for Data Mining,Future Generation Computer Systems,1998:345-350
    [64] S I Gallant， Connectionist expert systems，Communications of theACM，1988，31(2):152-169
    [65] A B Tickle, R Andrews, M Golea， The truth will come to light:Directions and challenges in extracting the knowledge embeddedwithin trained artificial neural network, IEEE Transaction on NeuralNetworks,1998,9(1):1057-1068
    [66] R Setiono， Extracting M-of-N Rules from trained neural networks，IEEE Transaction on Neural Networks，2000,11(2):512-519
    [67]何新贵，模糊知识处理的理论与技术，北京:国防工业出版社，1994，191-210
    [68] Z Pawlak， Rough sets, International Journal of Information andComputer Science,1982,11(5):341-356
    [69]张文修，吴伟志，粗糙集理论介绍和研究综述，模糊系统与数学，2000，14(4):1-11
    [70]张学工，关于统计学习理论与支持向量机，自动化学报，2000，26(1):32-42
    [71] S Amari, S Wu，Improving support vector machine classifiers bymodifying kernel functions，NcuralNetworks，1999，12(6):783-789
    [72] Kira K, Rendell L A， The feature selection problem: Traditionalmethods and a new algorithm，Proc of the9th National Conf onArtificial Intelligence， Melo Park publisher,1992，129-134
    [73] John G H, Kohavi R, Pfleger K. Relevant features and the subsetselection problem， Proc of the11th Int Conf on Machine learning，2002，219-224
    [74] Koller D， Sahami M， Toward optimal feature selection， Proc of IntConf on Machine Learning， Bari publisher，1996，284-292
    [75] Manoranjan Dash, Huan Liu， Feature selection for classification，Intelligent Data Analysis,1997,19(3):131-156
    [76] D Hand，H Mannila and P Smyth著，张银奎，廖丽，宋俊等译，数据挖掘原理，北京：机械工业出版社，2003，17-35
    [77]王实，高文，数据挖掘中的聚类方法，计算机科学，2000，4（11）：42-45
    [78] Manoranjan Dash，Huan Liu， Feature selection for classification，Intelligent Data Analysis，1997，14(3):131-156
    [79] K M Shazzad, J S Park， Optimization of Intrusion Detection throughFast Hybrid Feature Selection， Proceedings of the6thInternational Conference on Parallel and Distributed Computing,2005，264-267
    [80] Yang Su，T M Murali， RankGene: identification of diagnostic genesbased on expression data，Vladimir Pavlovic，2003，19(12)：1578—1579
    [81]胡洁，高维数据特征降维综述，计算机应用研究，2008，25(9)：2601-2606
    [82] Lei Y，Huan L， Efficient Feature Selection via Analysis of Relevanceand Redundancy，Journal of Machine Learning Research，2004，35（5）：1205-1224
    [83] Lei Y，Huan L， Feature Selection for High-Dimensional Data: A FastCorrelation-Based Filter Solution：[硕士学位论文]，WashingtonUniversity，2003
    [84] Yvan S，Inaki I，Pedro L，A review of feature selection techniquesin bioinformatics，BIOINFORMATICS，2007，23(19)：2507-2517
    [85] Dimitrios Ververidis， Fast and accurate sequential floatingforward feature selection with the Bayes classifier applied tospeech emotion recognition，Signal Processing，2008，88（5）：2956-2970
    [86] Yu Wang，Igor V， Tetko， Gene selection from microarray data forcancer classification---a machine learning approach，ComputationalBiology and Chemistry，2005，29(1)：37-46
    [87] Jianping Hua，Waibhav D，Performance of feature-selection methodsin the classification of high-dimension data，Pattern Recognition，2009，13（42）：409-424
    [88] K M Shazzad, J S Park, Optimization of intrusion detection throughfast hybrid feature selection, Proceedings of the SixthInternational Conference on Parallel and Distributed Computing,IEEE Computer Society, Washington DC,2005，264–267
    [89] J Huang, Y Cai, X Xu, A wrapper for feature selection based on mutualinformation,18th International Conference on Pattern Recognition,Washington DC,2006，618–621
    [90] K M Osei-Bryson, K Giles, B Kositanurit, Exploration of a hybridfeature selection algorithm, Journal of the Operational ResearchSociety，2003，16(54)：790–797
    [91] M Fatourechi, G Birch, R K Ward, Application of a hybrid waveletfeature selection method in the design of a self-paced braininterface system, Journal of Nero engineering and Rehabilitation，2007，12（4）:18-21
    [92] Z Yan, C Yuan, Ant colony optimization for feature selection in facerecognition, Lecture notes in Computer Science，2004，23(30)：221-226
    [93] Bellman， Adaptive Control Professes: A Guided Tour，PrincetonUniversity Press，1961，2（5）：16-23
    [94] R weber, S Blott. A Quantitative Analysis and Performance StudySimilarity-search Method in High Dimensional spaces， In Proc of24th VLDB Conference， New york，USA，1998，94-100
    [95] K Beyer, J Goldstein,R Ranmakrishnan, When is Nearest NeighborsMeaningful， In ICDT Conference Proceedings, Jerusalem lsrael,1999，217-235
    [96] W Seott, Multivariate Density Estimation，Journal of Neroengineering and Rehabilitation,1992，7（5）：56-59
    [97] L Parsons，E Haque,H Liu， Subspace Clustering for High DimensionalData: A Review，ACM SIGKDD Explorations Newslette,2004，7(6):90-105
    [98] Dash， H Liu， Dimensionality Reduction，Encyclopedia of ComputerScience and Engineering, Journal of Nero engineering andRehabilitation,2003，17（2）:36-41
    [99] Smadja F, Retrieving Collocation from Text， ComputeLinguistics,2009,19(1):143-175
    [100]蒋良孝，一种基于信息增益的分类规则挖掘算法，中南工业大学学报,2003,34(2):69-71
    [101] Ian H, Eibe F， Data Mining: Practical Machine Learning Tools andTechniques，北京:机械工程出版社,2005，393-418
    [102] Ron K, George H， Wrappers for Feature Subset Selection，Artificial Intelligence,1997,17（5）：273-280
    [103] Marko R, Igor K，Theoretical and Empirical Analysis of ReliefF andRreliefF，Machine Learning,2003,5（3）:23-26
    [104] JaPkowiez, Leaming from imbalaneed data sets: a comparison ofvarious strategies， AAAI workshop on Leaming from imbalaneed DataSets，2000，10-15
    [105] Nitesh V，Chawla C4.5and Imbalaneed Data sets: Investigating theeffeet of sampling method，Probabilistic estimate and decisiontree strueture， The international Conference on Machine Learning，Washington DC,2003，156-162
    [106] Japkowiez，Class imbalances: Are we focusing on the right issue，Proe of the ICML-2003Workshop，2003，17-23
    [107] Batista G, Prati M，A study of the behavior of several methods forbalancing machine learning training data， SIGKDDExPlorations,2004,6(1):20-29
    [108] Van Hulse，M Khosh， Experimental Perspectives on Learning fromImbalanced Data， Imprecating of the24th International Conferenceon Machine Leaming，2007，325-331
    [109] Vapnik V, The nature of statistical learning theory，Journal ofSpringer Verlag，1995，5（6）：46-52
    [110] Nugroho A,Kuroyanagi, A solution for imbalanced training setsProblem by combine-ii and its application on fog forecasting，IEICETransaction on Information and Systems,2002,85-92
    [111] Wu S，Conformal transformation of kernel functions: A data-dependentway to improve the Performance of support vector machineclassifiers，Neural Professing，2002，15(6):121-126
    [112] Wu,G Chang, Class-Boundary Alignment for imbalanced DatasetLeaming,International Conference on Machine Leaming，NewJersey,2003,387-392
    [113] Kubat, M Matwin, Addressing the Curse of Imbalanced Training Sets:One-Sided Selection, Proceedings of the14th InternationalConference on Machine Learning,1997,113-119
    [114] Chawla,N Bowyer, SMOTE: Synthetic Minority Over-Sampling Technique,Journal of Artificial intelligence Research，2002，16(6)，321-357
    [115] Han H, Wang, Borderline-smote: A new over-sampling method inimbalanced data sets learning, International Conference e onintelligent Computing (ICIC05),Lecture Notes in Computer Seienee，Springer-Verlag，2005,344-349
    [116] Huang Kaizhu，Yang Haiqin, Correspondence: Imbalanced Data Learningwith a Biased Minimax Probability Machine， IEEE Transactions onSystems,2006,17(36):913-916
    [117] Richard，Duda，Peter， Pattern Classification，Seeond Edition,Springer-Verlag，2001,231-255
    [118] Japkowicz N, Class Imbalanced versus Small Disjoints, SpecialInterest Group on Knowledge Discovery and Data Mining explorations2004,6(1):40-49
    [119]史忠值，知识发现，北京:清华大学出版社,2002，97-105
    [120] IAN H, WITTEN, FRANK E， Data Mining: Practical Machine LearningTools and Techniques,北京：机械工程出版社,2005，393-418
    [121]宋晓宇,刘锋,孙焕良，基于粗糙集的聚类算法中阈值自动选取，系统工程与电子技术，2010，15(1):192-194
    [122]张玉芳,陈小莉，基于信息增益的特征词权重调整算法研究，计算机工程与应用，2007,43（35）:159-161
    [123] Shengli Sheng,Charles X， Hybrid Cost sensitive Decision Tree， PKDD，2005，274-284
    [124] Dash M, LIU H. Featur e selection for classification， IntelligentData Analysis,1997，23(1):131-156
    [125] KUBAT M, MATWIN S，Addressing the Curse of Imbalanced Training Sets:One-Sided Selection， Proceedings of the Fourteenth InternationalConference on Machine Learning，1997，16（24）,179-186
    [126] KONONENKO I, Estimating Attributes: Analysis and Extensions ofRELIEF. In: European Conference on Machine Learning,1994,171-182.
    [127] MURPHY P, AHA D.UCIML Repository [EB/OL].[2008-09-30] http://archive.ics.uci.edu/ml/machine-learning-databases/secom
    [128] Don J, Anna C，The Identification and Use of Key Characteristics in the Product Development Process，Proceedings of the1996ASME Design Engineering Technical Conferences and Computers in Engineering Conference， ASME， California,1996，315-320
    [129] Chen F, Luo L, Jin Y， Automatic Analysis of Change Detection ofMultitemporal ERS-2SAR Images by Using Two Threshold EM and MRFAlgorithms,Progress in Natural Science,2004,14(3):269-275
    [130] Kononenko I， Estimation， Attributes: Analysis and Extensions ofRELIEF， Proceedings of the1994European Conference on MachineLearning,1994，171-182
    [131] Bazi Y，Bruzzone L， An Unsupervised Approach based on GeneralizedGauss a Model to Automatic Change Detection in Multitemporal SARImages，IEEE Transaction on Geosciences and Remote Sensing,2005,43(4):874-887