基于数据挖掘技术的犯罪相关因素分析

英文题名：Crime-related Factor Analysis Based on Data Mining Technology
作者：周帅
论文级别：硕士
学科专业名称：计算机科学与技术
中文关键词：贝叶斯网络 ; 分类算法 ; 聚类算法 ; K2算法 ; 犯罪因素分析
英文关键词：Bayesian Networks ; Classification Algorithm ; Clustering Algorithm ; K2Algorithm ; Criminal Element Analysis
学位年度：2012
导师：鲁明羽
学科代码：0812
学位授予单位：大连海事大学
论文提交日期：2012-06-12

摘要

公安系统在多年的工作实践中,一方面不断在推进信息化建设,另一方面,其在公安工作专门数据和社会信息都已经有了相当规模的数据积累,运用数据挖掘技术分析犯罪因素是公安系统一个重要且有意义的课题。与传统数据分析技术相比,数据挖掘从已有的数据中提取模式规律,并且把数据提炼成知识。
     本文使用多种分类、聚类方法和提出的改进贝叶斯网络方法对犯罪人员的背景信息,心理信息和基因信息进行综合挖掘,以求发现影响以及造成犯罪的因素。具体研究工作有以下几点：
     1)应用多种分类和聚类方法对犯罪人员数据集进行初步挖掘,分析犯罪因素。在分类中选择了的决策树ID3分类器、决策树C4.5分类器和朴素贝叶斯分类器。选取了聚类方法中的k-means划分聚类和BIRCH层次聚类进行分析。但针对犯罪因素分析这一特殊问题,分类与聚类算法对知识的表达不够细致与清晰。
     2)由于传统K2算法采用随机模式生成变量序列来限制搜索空间,具有一定的盲目性,所以本文提出改进的贝叶斯网络结构学习K2-P算法。新算法通过基于条件独立性的SGS和PC2算法改进贝叶斯网络结构学习,生成蕴含原始数据知识的拓扑图,供全拓扑过滤器生成拓扑序列集,作为下一步结构学习的变量顺序。对比实验可以证明K2-P算法可以搜索到比K2算法更高评分值的贝叶斯网络。
     3)贝叶斯网络结构搜索是一个NP-Hard问题,传统K2算法在寻找每个属性节点其可能的父节点集合时采用贪婪搜索策略,可能会舍弃更优的解,所以本文提出K2-EX算法。通过进行跃迁搜索获得更优的Bayesian Dirichlet评分,进一步,我们定义了一个自适应函数控制跃迁次数。通过在不同数据集上的实验,证明K2-EX算法可以获得更优的网络结构。
     4)最后应用改进的贝叶斯网络算法进行犯罪因素分析,发现了一些有显著关联的属性,例如DRD4基因与犯罪类型,心理因素与犯罪者年龄等。得出了一些对于公安系统有意义的结论。
Public Security systems constantly promote the information construction in the many years of practical work. At the same time, Public Security Bureau has been the very large of the data and information through the accumulation of long-term work. There is an important and meaningful issue for public security system that using data mining technology to solve the problem of the crime-related factor analysis. Compared with traditional data analysis techniques, data mining can find knowledge from the existing data model and extract the data into knowledge.
     In order to find the factors affecting crime, classification methods, clustering methods and Bayesian network methods was introduced to excavate knowledge from criminals of background information, psychological information and genetic information. The main research work in the paper included the following aspects:
     1) We analyzed the data set of criminals by the classification method and clustering methods. We selected the decision tree ID3classifier, the C4.5decision tree classifier and the Naive Bayes classifier in classification method. K-means partition clustering and BIRCH hierarchical clustering were Selected in the clustering method.
     2) The traditional K2algorithm with a variable order to limit the search space, the K2algorithm used a random mode to generate a variable sequence. So this paper presented an improved Bayesian network structure learning K2-P algorithm. New algorithm based on conditional independence SGS and PC2algorithm improved the Bayesian network structure learning and generated the topology map which contains the data knowledge. The topology sequence sets were generated through the full topology filter as the variable order in the next step of structure learning. The experimental results showed that K2-P algorithm could get a Bayesian network which owned a higher Bayesian Dirichlet score than the traditional K2algorithm.
     3) The search of Bayesian network structure is an NP-Hard problem. When the traditional K2algorithm searched the parent node sets for each attribute node, the greedy strategy was used to search structure. A simple greedy strategy might give up a better solution, so we designed the K2-EX algorithm in this paper. The new algorithm could get better the score of Bayesian Dirichlet by jumping search, further, we defined an fitness function to control jumping times. The experimental results proved that the K2-EX algorithm could achieve better network structure on different data sets.
     4) Finally, we carried out criminal factor analysis and found significantly associated attributes through improved Bayesian network algorithm. For example the DRD4gene with the types of crime, psychological factors and age of the offender. We draw some meaningful conclusions for the public security system.

引文

[1]Han J W, Kamber M著.范明,孟小峰译.数据挖掘概念与技术.北京机械工业出版社,2007.
    [2]Sahami M. Learning limited dependence Bayesian classifiers. Proceedings of the 2nd international conference on knowledge discovery and data mining, pp 334-338,1996.
    [3]Kaufman L. and Rousseau P. J. Finding Groups in Data:An Introduction to Cluster Analysis. John Wiley & Sons,1990.
    [4]张连文,郭海鹏著.贝叶斯网引论.北京,科学出版社.2007.
    [5]Pearl J. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufinan, pp 86,1988.1.
    [6]黎雪梅,况利,艾明等.暴力犯罪者人格障碍研究.中国心理卫生杂志.2008,(22)547-549.
    [7]陈熊鹰,吴少勤,董奇.反社会行为的基因环境交互作用研究进展.北京师范大学学报(自然科学版).2008,6,229-231.
    [8]郑永红.犯罪信息工作中的数据挖掘技术.广东公安科技,2009,(1)：39-41.
    [9]杜(?).数据挖掘中关联规则的研究与应用.中国人民解放军理工大学博士学位论文.南京：中国人民解放军理工大学军事通信学院,2000,8.10.
    [10]Heckerman, D. Bayesian networks for data mining, Data Mining and Knowledge Discovery.1997,1:79-119.
    [11]王双成,林士敏,陆玉昌.贝叶斯网络结构学习分析.计算机科学,2000,27(10)：76-79.
    [12]Cheng J. and Greiner R. Learning Bayesian Belief Network Classifiers:Algorithms and System. Proc.14
    [13]Heckerman D. A Tutorial on Learning with Bayesian Network. In Jordan M Learning in Graphical Models. MIT Press, Cambridge, MA.1999.1-57.
    [14]史忠植.知识发现.北京清华大学出版社.2000.1.56.
    [15]毛国君,段立娟,王实等.数据挖掘原理与算法.北京.清华大学出版社,2006,29-32.
    [16]Fayyad U, Piatetsky S G, Smith P. The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM,39(11):27-34,1996.
    [17]Friedman N. Bayesian network classifiers. Machine Learning.1997(29):131-163.
    [18]Neil M, Fenton N, Nielsen L. Building Large-scale Bayesian Networks. The Knowledge Engineering Review,2000,15(3):257-284.
    [19]薛万欣,刘大有,张弘.贝叶斯网中概率参数学习方法.电子学报,2003,31(11)：1686-1689.
    [20]沈海峰.基于贝叶斯网络数据挖掘技术理论及算法的研究.合肥.合肥工业大学.200513
    [21]Keogh E, Pazzani M. Learning Augmented Bayesian Classifiers:A Comparison of Distribution-Based and Classification-Based Approaches. Proc. Int'l Workshop Artificial Intelligence and Statistics, pp 225-230,1999.
    [22]Casella G, Berger R L. Statistical Inference. Pacific Grove, Calif.443,1990.
    [23]John G H, Langley P. Estimating continuous distributions in Bayesian classifiers. Proc.11th Conf. Uncertainty in Artificial Intelligence, pp 338-345,443,1995.
    [24]Wang Z F, Tian J W. an Extension of Tree Augmented Naive Bayes Classifier.2nd ETP/IITA conference on Telecommunication and Information, Phuket, Thailand, APR 03-04, VOL 1, pp 243-246,2011.
    [25]J.R. Quinlan. Induction of decision trees. Machine Learning,1986,1,81-106.
    [26]J.R. Quinlan. C4.5:Programs for machine learning. CA:Morgan Kaufmann,1993.
    [27]朱明.数据挖掘合肥,中国科学技术大学出版社,2002.
    [28]Tian Shengwen, Yang Hongyong, Wang Yilei. An improved K-means clustering algorithm based on spectral method. Lecture Notes in Computer Science,2008,5370 LNCS,p530-536
    [29]Wang Qiang, Ye Yunming, Huang Zhexue. Fuzzy k-means with variable weighting in high dimensional data analysis. Proceedings-The 9th International Conference on Web-Age Information Management,2008,WAIM 2008,p365-372
    [30]Zhang Tian, Ramakrishnan Raghu, Livny Miron. BIRCH:A efficient data clustering method for very large databases[C]. In:Ramakrishnan R. Proc of ACM SIGMOD Conference on Management of Data. Montreal Canada:1999.358-369.
    [3]]钱铭怡,武国城,朱荣春,张莘.艾森克人格问卷简式量表中国版(EPQ-RSC)的修订.心理学报.2000,03.
    [32]王登峰,崔红.中国人人格量表(QZPS)的编制过程与初步结果.心理学报.2003,01.
    [33]封宇.重庆地区暴力罪犯的DRD4基因外显子ⅢVNTR多态性、心理和生活环境的关联性分析.大连海事大学.2011.
    [34]贾伟,师建国,敖磊,张蕊.多巴胺D4受体基因启动子区功能多态性与海洛因依赖的相关性研究.西安市精神卫生中心.2010.
    [35]傅颖.暴力犯罪行为与MAOA基因VNTR多态性、环境、心理特征的关联性研究.大连海事大学.2011.
    [36]N Weder, BZ Yang, H Douglas-Palumberi, J Massey, JH Krystal, J Gelernter, J Kaufman. MAOA genotype, maltreatment, and aggressive behavior:the changing impact of genotype at varying levels of trauma. Biological Psychiatry.2009.
    [37]Cooper G. Herskovits E. A Bayesian method for the induction of probabilistic networks from data. Machine Learning.1992
    [38]Heckerman D, Geiger D and Chickering D. Learning Bayesian networks:The combination of knowledge and statistical data. Machine Learning.1995,20,196-243.
    [39]Chu T, Glymour C, Scheines R, Spirtes P. A Statistical Problem for Inference to Regulatory Structure from Associations of Gene Expression Measurements with Microarrays, Bioinformatics,2003,19, pp.1147-1152.
    [40]Causation, Prediction, and Search (2000),2nd edition, P Spirtes, C Glymour, and R Scheines, MIT Press, Boston.
    [41]C. Glymour R, Scheines, P Spirtes, and K Kelly. Discovering Causal Structure:Artificial Intelligence, Philosophy of Science, and Statistical Modeling,1997. Academic Press, San Diego, CA.
    [42]Spirtes P, Glymour C, Scheines R., Tillman R. Automated Search for Causal Relations:Theory and Practice. In Heuristics, Probability, and Causality:A Tribute to Judea Pearl, edited by Rina Dechter, Hewctor Geffner, and Joseph Halpern, College Publications,2010,467-506.
    [43]Witten H, Frank E. Data mining:Practical Machine Learning Tools and Techniques with Java Implementation. Morgan Kaufmann,prdownload.sourceforge.net/weka/datasets-UCI.jar.
    [44]Merz C, Murphy P, Aha D. UCI repository of machine learning databases. Dept of ICS, University of California, and Irvine, CA. http://www.cs.uci.edu,1995.
    [45]王娜Weka 3.6.3 Explorer.www.cs.waikato.ac.nz/ml/weka.2009.
    [46]Cooper G. Computational complexity of probabilistic inference using Bayesian belief networks. Artificial Intelligence.1990,42,393-405.
    [47]Dagum P. Luby M. Approximating probabilistic inference in Bayesian belief networks is NP-hard, Artificial Intelligence.1993,60,141-153.
    [48]James N, Liu K, Bavy N et al. an improved Naive Bayesian classifier technique coupled with a novel input solution method. IEEE Transactions on Systems Man and Cybernetics Part C-Applications and Reviews,31(2):249-256,2001.
    [49]Pazzani M J. Searching for dependencies in Bayesian classifiers. Proceedings of information, statistics and induction in science,1996.
    [50]Drugana M M, Wiering M A. Feature selection for Bayesian network classifiers using the MDL-FS score. International Journal of Approximate Reasoning. Article, in Press,2010.
    [51]Deng W, Wang G, Wang Y. Weighted Naive Bayes Classification Algorithm Based on Rough Set. Computer Science, vol.34, pp 204-206,2007.
    [52]Ting K, Zheng Z. Improving the performance of boosting for Naive Bayesian classification. Proceedings of the 3rd Pacific-Asia conference on knowledge discovery and data mining, pp 296-305,1999.
    [53]Fayyad U, Piatetsky S G, Smith P. The KDD process for extracting useful knowledge from volumes of data. Communications of the ACM,39(11):27-34,1996.
    [54]Casella G, Berger R L. Statistical Inference. Pacific Grove, Calif.443,1990.
    [55]Oliver J J, Hand D J. On pruning and averaging decision trees. Proceedings of the Twelfth International Conference on Machine Learning, Ca. Morgan Kaufmann,430-437,1995.
    [56]Han J, Per J, Yin Y. Mining frequent patterns without candidate generation. Proceedings of 2000 ACMSIGMOD Int'l Conf on Management of Data (SIGMOD'00). Dallas, TX,1-12, 2000.
    [57]Mitchell T M. Machine Learning. McGraw-Hill,1997.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700