基于人工智能的知识发现

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

基于人工智能的知识发现

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Knowledge Discovery Based on Artificial Intelligence
作者：臧其事
论文级别：硕士
学科专业名称：情报学
中文关键词：知识发现 ; 人工智能 ; 知识表示 ; 动态参数化评估
英文关键词：knowledge discovery ; artificial intelligence ; knowledge express ; dynamic parameters evaluation system
学位年度：2008
导师：范并思 ; 王仁武
学科代码：120502
学位授予单位：华东师范大学
论文提交日期：2008-05-01
答辩委员会主席：李国秋

摘要

知识发现本质是建立在高维空间中的数学计算问题,人类对于传统空间的研究已经经过数千年,但是对于高维空间的数学研究才刚刚开始。尽管如此,借助于其核心技术人工智能的发展,知识发现已经取得了非常丰硕的成果。传统数学所无法解决的问题,例如图象识别,垃圾邮件拦截,网页相似度匹配等问题都得到了一定程度上的解决。但是,从技术本身来看,这些都是支持向量机(SupportVector Machine,SVM)的一些低级别的应用,对于更广阔的未来而言,技术的发展带来了无限的可能。本文尽可能详尽地回顾了知识与知识发现的理论与沿革,人工智能技术的发展与核心算法:BP网络(Back-propagation Neutral Network)与支持向量机。在此基础上,本文提出了知识发现所面临的三大问题:学科交叉不足,局限于理工科等传统领域,而对文科和商科覆盖不足;对非结构化数据处理能力欠缺,尤其是类似于WORD和WEB的非结构化和半结构化数据;知识表示混乱,至今没有统一的标准。
     针对以上三个问题,本文设计了三个实验:
     1本文以WORD文件《说文·玉篇》中的一章作为数据源,采用规则提取的方式,将WORD文件字典中的字进行了量化抽取。以量化后的结果载入Matlab,并使用SVM工具箱进行了异体字分类识别。最后用Z语言对异体字分类的定义进行了阐述。
     2本文针对上海国拍劲标网(www.alltobid.com)上的上海市车牌历次竞标记录,采用WEB抓取的方式,获得自开始拍卖以来至今的所有数据。将这些数据作为数据源,用BP网络对车牌价格所形成的多元函数进行了拟合,对后期的车牌价格走势进行了预测。与此同时,将本文中获得的结果与传统经济学方法进行了对比,证明了AI算法相对于传统经济学方法的优越性。最后,针对本文的函数用Z语言进行了描述。
     3作为管理学硕士,本文对管理学中参数化评估以及它的多种进化形态进行了回顾,并将BP网络和SVM分类技术相结合,提出了动态参数化评估的概念。这种新评估方式主要认为:旧有的评估方法存在参数人为任意设定,权值僵化,而容易被有所针对性的回避而不能产生正确的评估效果。为了避免上述现象的发生,本文认为,应该从样本自身出发,由样本自身描述问题的本质。首先使用SVM对样本的特征进行提取,得出参数项;其次根据参数项对样本进行循环计算,得到每个项的权值;最后依据不同的权值,对权值进行函数拟合和预测。这样构成的参数评估系统,每当产生新的样本的时候,则系统重新计算并对权重和参数进行调节。无疑具有更好的自适应能力和更符合现实要求的特点。本文进行了一次针对上海房价指数的动态参数实证研究:针对上海市房产交易中心(Fangdi.com.cn)上的成交数据,以WEB抓取的形式获得;以不同区域对上海房价的影响作为参数,以影响的程度作为权重进行计算;最后以Z语言对整个动态参数化评估系统作出了描述。
     本文以如上述三个实验的方式对本文提出的问题进行了讨论和解释。针对学科交叉问题,本文结合中文学科,提取WORD文件中的异体字,并使用SVM技术进行了识别;结合笔者本科时代的经济学背景,使用BP网络对来自WEB的上海车牌拍卖数据进行了函数拟合;最后作为一名管理学的硕士,将SVM的分类技术和BP网络的函数拟合技术结合而提出了动态参数化评估,对管理学中参数化评估进行了改进。针对非结构化数据源问题,本文实验中所采用的WORD文件,WEB数据,都是非结构化数据,采用规则抽取方式,将非结构化数据转化为准结构化或者结构化数据进行知识发现;对于知识表示问题,本文使用了Z语言对每次实验所得到的知识进行了结构化描述。
     虽然本文针对知识发现的问题进行了一番探讨与改进,但是仍然存在诸多不足之处,对于动态参数化评估而言,各参数之间的优先级显然不可能是同级的,对于优先度排序方向的研究还有所欠缺;对于文字识别而言,噪音与误注所造成的偏差较大;对于时间序列的经济函数拟合而言,精度还可以进一步提高。这一切都有待于进一步的完善。
The nature of knowledge discovery is the mathematical calculation in high-dimensional space. The traditional study in human space has lasted through several thousand years, but the study in high-dimensional space has just begun. Nevertheless, through the use of the core technology of artificial intelligence, knowledge discovery has achieved many establishments. The issues that traditional math cannot solve, such as image recognition, and other issues can be resolved to a certain extent. For the future, the development of technology has brought unlimited potential. In this dissert, knowledge and knowledge discovery, artificial intelligence technology and the core algorithm: Back-propagation network and support vector machines to be recalled as detailed as possible. On this basis, this dissert proposed that the the knowledge discovery is facing three major problems: lack of cross-discipline, limited to traditional areas such as science and engineering; lack of ability to handle unstructured data, in particular, WEB or WORD data; no uniform standard of knowledge expression.
     For the above three issues, the paper design of the three experiments:
     1 In this dissert , WORD documents, "Shuowen-yu chapter" is used as a data source. By using the rules of extraction, we can get a structured data source. Then use the SVM toolbox of Matlab to start a word classification. Finally, the Z notation is used to describe this definition.
     2 In this dissert, we get data from www.alltobid.com on the Shanghai license plate records of previous bids by WEB crawl. These data used as a data source, a BP network constructed to fitting functions and forecast the price. At the same time, we get the comparation with the results from traditional methods of economics. Finally, we use Z notation to describe this function.
     3 As a Master of Management, this dissert reviewed on the parameters evaluation and its various forms, and then combined both BP networks and SVM classification technology, to assess the concept ofthe dynamic parameters evaluation system. The old methods of parameters evaluation were arbitrary, rigid weight and easy to be targeted. In order to avoid the occurrence of this phenomenon, this dissert believe that we should proceed from own samples, samples from their own description of the nature of the problem. First, to use SVM extract the samples' characteristics to drawn parameters; followed, in accordance with the parameters of the sample cycle, the weight to receive the value of each of the last according to different weights; finally weights can be function fitted and forecasted by BP. This constitutes the dynamic parameters evaluation system, whenever a new sample created, the system re-calculation and weight and adjusts parameters. Undoubtedly has a better adaptive capacity and more useful with the practical requirements. This dissert conducted a housing price index for Shanghai, the dynamic parameters of empirical research: We get the the transaction data from real estate trading center in Shanghai (Fangdi.com.cn) by WEB crawl; different regions to impact housing price in Shanghai As a parameter, the extent of how it impact as weight; finally use Z notation to describe the entire dynamic parameters evaluation system..
     In this dissert, such as the three experimental approach to the issues raised in this dissert was discussed and explained. For cross-disciplinary issues, the paper joint with Chinese subjects, use the SVM technology for the identification; joint with the author's economics background, use BP network to function Fitting the price of license plate auction in Shanghai from WEB data; Finally, as a Master of Management, to assess the concept ofthe dynamic parameters evaluation system by combining the technology of SVM and BP network, to improve the parameters evaluation in management. For unstructured data sources, the dissert use WORD document, WEB data as unstructured data, extracted by the rules, transform the unstructured datato be semi-structured or structured data for knowledge discovery; for knowledge expression, this dissert used the Z notation to descript every experiment.

引文

9 T·H·Davenport著,高洪深,丁娟娟译:《企业知识管理》,清华大学出版社,2003年
    10 野中郁次郎(Ikujiro Nonaka):《知识创造的企业》,1989年
    1、U.M.Fayyad,G.Piatetsky,P.Smyth,R.Uthurusamy:Advances in Knowledge Discovery and Data Mining[M],MIT Press,1996。
    2、王清毅,陈恩红,蔡庆生,知识发现的若干问题及应用研究[J],计算机科学,1997年第5期
    3、S.Dzeroski:Inductive logic programming and knowledge discovery[M],MIT Press,1996。
    4、高洪深:决策支持系统(DSS)理论·方法·案例[M],清华大学出版社,2005。
    5、Lakhmi Jain,Sanghamitra Bandyopadhyay,Advanced Information and Knowledge Processing from Complex Data[M],Springer Express,2004。
    6、Bandyopadhyay,S.K.Pal:Pattern classification with genetic algorithms:Incorporation of chromosome differentiation[J],Pattern Recognition,1997。
    7、 Bandyopadhyay,S.K.Pal and U.Maulik,Incorporating chromosome differentiation in genetic algorithms,Information Science,1998。
    8、T·H·Davenport著,高洪深,丁娟娟译:企业知识管理[M],清华大学出版社,2003年。
    9、Ben-Dor,A.,R.Shamir and Z.Yakhini:Clustering gene expression patterns[J],Journal of Computational Biology,1999年6期。
    10、Dorffner,G.,1996:Neural networks for time series processing[J],Neural Network World,1996年6期。
    11、Dorohonceanu,B.and C.G Nevill-Manning:Accelerating protein classification using suffix trees[R],Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology(ISMB),2000。
    12、Du,W and Z.Zhan:Building decision tree classifier on private data[R],Proceedings of the IEEE International Conference on Data Mining Workshop on Privacy,Security,and Data Mining,Australian Computer Society,2002。
    13、艾伯瑞尔:工业开发中的形式化方法:成就,问题和未来,2006年国际软件工程大会特邀报告(中文翻译版),2006。
    14、李景峰,陈平:基于Z规范的统一建模语言序列图语义分析方法[J],西安电子科技大学学报,2003年04期。
    15、孙猛,杨红丽,张乃孝,裘宗燕:基于UML的软件结构规范与精化(英文),北京大学学报(自然科学版),2007年01期。
    16、Jim Woodcock and Jim Davies:Using Z,Specification,Refinement,and Proof[M],Prentice Hall,1996。
    17、J.M.Spivey:The Z Notation,A Reference Manual[M],University of Oxford,2000。
    18、GuezA,Eiltert J L,Kam M:Neural Network Architecture for Control[J],IEEE Control Systems Magazine,1988年2期。
    19、Koako B:Adaptive Bidirectional Associative Memories,Applied Optics[J],1987年26期。
    20、Funahashi K I:On the Approximate Realization of Continuous Mapping by Neural Networks[J],Neural Networks,1989年2期。
    21、Michel A N,Farrell J A.:Associative Memories via Artifical Neural Networks[J], IEEE Control Systems Magazine,1990年3期。
    22、Lee YC:adaptive stochastic cellular automata:theory[D],1990年。
    23、Bruck J:A Study on Neural Networks[J],Intelligent Systems,1988年3期。
    24、Narendra K S,Parthasarathy K:Identification and Control of Dynamical Systems Using Neural Networks[J],IEEE Trans on Neural Networks,1990年1期。
    25、Guez A,et al.:On the Stability,Storage Capacity and Design of Nonlinear Continuous Neural Networks[J],IEEE Trans SMC,1988年18期。
    26、Jones WP:Backpropagation[J],BYTE,1987年12期。
    27、Bavarian B:Introduction to Neural Networks for Intelligent Control[J],IEEE Control Systems Magazine,1988年2期。
    28、曹安照,田丽,陈俊,吕元峰:神经网络及其研究展望[J],自动化与仪器仪表,2006年1期。
    29、 Lakhmi Jain,Sanghamitra Bandyopadhyay:Advanced Information and Knowledge Processing from Complex Data[M],Springer Express,2004。
    30、Chellappa,R.,C.L.Wilson,and S.Sirohey:Human and machine recognition of faces:a survey[J],Proceedings of the IEEE,1995年8期。
    31、Brunelli,R.and T.Poggio:Face recognition:features versus:templates[J],Proceedings of the IEEE,1993年10期。
    32、Moghaddam,B.and A.Pentland:Beyond Linear Eigenspaces:Bayesian Matching for Face Recognition[M],Mitsubishi Electric Research Lab,1998。
    33、Ullman,S.and R.Basri:Recognition by linear combinations of models[M],1991。
    34、Kotropoulos,C.,A.Tefas,and I.Pitas:Frontal face authentication using variants of dynamic link matching based on mathematical morphology[M],MIT Media Laboratory,1998。
    35、Phillips,P.J.,et al:The FERET evaluation methodology for face-recognition algorithms[J],IEEE Trans,2000年10期。
    36、Jia,X.and M.S.Nixon:Extending the feature vector for automatic face recognition[J],IEEE Trans,1995年12期。
    37、A.L,T.C.J,and C.T.F:Automatic Face Identification System Using Flexible Appearance Models,IEEE Trans,1998年3期。
    38、Shang-Hung,L.,K.Sun-Yuan,and L.Long-Ji:Face recognition/detection by probabilistic decision-based neural network[J],IEEE Trans,1997年8期。
    39、Jonsson,K.et al:Learning support vectors for face verification and recognition[M],Springer Express,2000。
    40、 Guodong,G.,S.Z.Li,and C.Kapluk:Face recognition by support vector machines[J],IEEE Trans,2000年2期。
    41、Lin,C and K.C.Fan:Triangle-based approach to the detection of human face,Pattern Recognition[J],2001年6期。
    42、Rowley,H.A.,S.Baluja,and T.Kanade:Neural network-based face detection,Pattern Recognition[J],1998年1期。
    43、Turk,M.and A.Pentland:Eigenfaces for recognition,Pattern Recognition[J],1991年1期。
    44、Liu,X,T.Chen,and B.V.K.V.Kumar:Face authentication for multiple subjects using eigenflow[J],Elsevier Science,2003年2期。
    45、Avidan,S:Support vector tracking[J],IEEE Trans.Pattern Anal.Mach.Intell,2004年8期。
    46、 HL,W and C.MU:Image semantic classification by using SVM[J],IEEE Trans.Pattern Anal.Mach.Intell,2003年11期。
    47、Vapnik,V.N:An overview of statistical learning theory[J],Elsevier Science,1999
    48、 Vapnik,V.N and A.Y.Chervonenkis,Chervonenkis:On the uniform convergence of relative frequencies of events to their probabilities[J],IEEE Trans,1971年2期。
    49、Zhan,Y and D.Shen:Design efficient support vector machine for fast classification[M],2005.38:P.187-161.
    50、胡侃、夏绍玮:基于大型数据仓库的数据挖掘:研究综述[J],软件学报,1998年1期。
    51、吴峰、施鹏飞,概念聚类挖掘方法的客户交易行为分析[J],微型电脑应用,2000年6期。
    52、王清毅,陈恩红,蔡庆生:知识发现的若干问题及应用研究[J],计算机科学,1997年5期。
    53、R.Agrawal,T.Imielisnski and A.Swami,Mining AssociatiOn Rules between Sets of Items in Large Databses[J],Proceedings of ACMSIGMOD,1993年5期。
    54、R.Agrawal and R.Srikant:Fast Algorithms for Mining Association Rules in Large Databases[R],Proceedings of the 20"' International Conference on Very Large Databases,1994。
    55、R Srikant,R Agrawal:Mining Generalized Association Rules[R],Proc.21st Int'l Conf.Very Large DataBases,1995。
    56、姚卿达,黄晓春,刘向民:数据仓库和数据采掘应用研究[J],计算机科学,1996年6期。
    57、郑之开、张广凡、邵惠鹤:数据采掘与知识发现:回顾和展望[J],信息与控制,1999,年5期。
    58、李德毅、邸凯昌、李德仁、史雪梅:用语言云模型发掘关联规则[J],软件学报,2000年11期。

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700