基于内容的敏感图像过滤技术的研究

英文题名：Research on the Technologies of Content-Based Erotic Image Filtering
作者：孙竞媛
论文级别：硕士
学科专业名称：计算机应用技术
中文关键词：敏感图像 ; 肤色检测 ; 纹理模型 ; 人脸检测 ; 直方图概率模型 ; 贝叶斯分类算法
英文关键词：Erotic image ; Detection of Skin-Color ; Texture model ; Detection of Human Face ; Histogram probability model ; Byes Classifier Algorithm
学位年度：2007
导师：申铉京
学科代码：081203
学位授予单位：吉林大学
论文提交日期：2007-04-01

摘要

在中国互联网迅速发展的同时,互联网中的不良信息给网络虚拟世界带来了不和谐之音。防止网络黄毒蔓延的研究已不满足于采用网址封锁和敏感关键词匹配技术,由此引发基于内容的图像过滤技术的研究逐渐深入起来。本文依托于2004年度珠海市科技项目(PC20041101)——“基于内容的敏感图片过滤技术的研究及其在IE浏览器中的实现”,着重于研究前人提出的基于内容敏感图像过滤器中所采用的关键技术——肤色检测技术,改进了基于统计直方图的贝叶斯分类模型,提取了五个效率较高的分类特征,并在此基础上构造了一个敏感图像分类器。
     首先构造了一个基础数据库,用于统计肤色象素点及非肤色象素点的RGB颜色分布其概况,在此基础上进行了后续研究。针对以往基于统计直方图的贝叶斯肤色分类算法的不足,在大量统计数据的基础上建立了一个改进的统计直方图模型,进而提出先验概率及条件概率公式,最后实现了一个新的贝叶斯分类模型。而后本文对实现象素点肤色分类的条件概率数据进行了反复筛选,以求达到较精确的肤色检测效果。为了实现对敏感图像的有效分类,本文提取了效率较高的十个分类特征,经测试后选取五个特征作为分类器的输入向量。其中加入了基于AdaBoost算法的快速人脸检测技术,以降低大头贴式肖像类图像的误检率。本文建立的模型在标准掩码库的检测下正检率达到80.83%,误检率为13.20%。在4624幅测试图像库上的总体正检率达到88.51%(敏感类正检率为71.15%,正常类为91.23%)。
Currently, the Internet is flooded with all kinds of eroticism and pornography with rapid growth, which has terrible influence on the cleanness and harmoniousness of the virtual world. In order to restrain the rapid spread speed of these eroticism information over the Internet, traditional technologies such as blockage based on IP or sensitive keywords matching haven’t work effectively any more. In this situation, the research focus on the image filtering technology has been developing rapidly. Founded on the project“Research on Content-Based Erotic Image Filtering technique and its realization in IE”of Zhuhai Science and Technology Planning Projects in 2004, in this paper, it studies the key technology of Content-Based erotic image filtering, the skin-color detecting technology, and finally constructs the Byes Classifier model based on skin-color statistical histogram. After that, extract five feature vectors for classifying erotic images.
     Erotic images are characteristic of bareness skin, so we use skin detecting models, texture models to detect skin-color area and build binary image, then distill character vector, and finally use corresponding classing algorithm to filtrate images. We construct a more complete image database, containing a marked skin-mask bank of 1442 images and a test image bank of 15890 images, and sign the images using the classification strategy. All the work we have done in this paper is based on the image bank.
     The main work of the dissertation is as follows:
     (1) Construct a database including two types of tables: the statistics table used to store statistic data and the tests table used to store conditional probability of skin-color detecting model. The essential data in both statistics table and tests table is based on the pixels of the images from the standard skin-masked images bank, which contains 1442 images and 0.75 billion pixels. We use Microsoft SQL Server 2000 to build the database with ADO (ActiveX Data Object) database technology, which contains 16,777,216(256* 256*256) rows of records. On the one hand, these records are useful for the research on the distribution of RGB values of skin and non-skin pixels; on the other hand, they provide data for further tests in the model.
     (2) The research of the skin-color detecting model. The skin-color detection seems simple but complicated mainly for the influence of the factors such as race, illumination, noise and so on. At present there are three methods of skin-color detection in common use in the research field: the Chroma Space Algorithm, the Byes Classifier Algorithm based on skin-color statistical histogram and the Seed Diffusion Algorithm based on neighboring information. This paper improves three inadequate places of the Byes Classifier Algorithm based on skin-color statistical histogram mentioned by Jones and Rehg, which are the construction of skin and non-skin models with images containing skin and not containing, statistics for pixels and 32 bins per channel in RGB color space.
     First construct two kinds of RGB histogram model from images containing skin in 256 bins per channel in RGB color space. Based on more suitable model, we promote Byes Classifier Algorithm, then after comparing the auto-generated masked images with hand-generated masked images, we collect and analysis statistical rates of Omission Rate and False Positive Rate, extract the needed prior probability formula and the conditional probability formula, and finally build the Byes Classifier Model. After that, by comparing the cnt’s values in the statistics table, we select the relatively valuable records, and insert them into the test table as the conditional probability. The cnt’s value shows the appeared times of skin-pixels in a row of record. In order to check the correctness and completeness of the selection, we collect statistic rates of Omission Rate and False Positive Rate from the generated marked images and build a check table of the rates of Omission Rate. Then by adding the omitted RGB values of skin pixels, we complement the test tables, and finally obtain the actual conditional probability of the test table in the Byes Classifier Model based on RGB histogram.
     After images’detecting, we adopt one-rank-gray stat as the texture model. The area (such as Yellow of sofa, yellow of woolen blanket etc.) will be masked as non-skin. It decreases the false positive rates and supports the corresponding classing algorithms with valid characteristics.
     Compared with one mentioned by Jones and Rehg, our model decreases the influence of non-skin pixels in a skin-marked image. We evaluate the optimal threshold through estimating the Equal Error Rate and choose the thresholdθ= 0.07 in our training set. Compared with the 76.55% correctness and 14.59% omitted ratios in Jones and Rehg’s Model, in our model, the correctness of skin-color detecting can achieve 80.83% on the test set which contains 1442 images, and the omission rate of 13.20%.
     (3) The feature vector extraction and evaluation for classifying erotic images. Before classifying, we extract ten features that are relatively more appropriate for classifying from masked images and its corresponding original images, and then respectively, we evaluate these features by considering their capability for classification, and finally select five features to form the classification character set. In order to reduce the false positive rate of classification for portrait image effectively, the human face detection mechanism is utilized in the filter. Take into account of both precision and computing speed, in this paper, we use the face detection mechanism proposed by P.Viola, which combining AdaBoost and Cascade technology, and achieved by OPENCV. The results show that the precision of our system can be improved largely (about 10% on our test set) after adding the face detection mechanism into our erotic image classifier.
     (4) Experiments and analysis show that our erotic image classifier can identify the benign images and erotic images effectively, with precision of about 88.51%(while the precision for erotic images recognition is 71.15%, the precision for benign image is 91.23%) on our test set with 4624 images.
     There are still many aspects of our filtering system that need to be improved and perfected, such as more efficient skin-color pixel detecting model, the correctness of the face detection mechanism, the optimization of the system real-time capability. These are also our future work.

引文

[1] 第 19 次中国互联网络发展状况统计报告, http://www.cnnic.cn/index/0E/00/11/.
    [2] 剑歌色诱中国电信业之网络时代的性欲管理《通信市场》杂志社 2006 年 6月 http://www.ctm.com.cn/ctmphp/rcp.php?paper_id=189.
    [3] 网络色情问题与管制: http://www.e21times.com.
    [4] 网络爸爸: http://baba.tueagles.com.
    [5] 科利华学生浏览器:http://www.cleverie.com.cn.
    [6] 美萍反黄专家:http://www.mpsoft.net/shield.htm.
    [7] 火眼金睛:http://www.iflytek.com/.
    [8] 五行卫士:没有网站, 光盘发售.
    [9] 护花使者:http://www.18ie.com/.
    [10] Tiresias Plugin:http://softbbs.pconline.com.cn/topic.jsp?tid=2737345&pageSize=10.
    [11] B. Starynkevitch, M. Daoudi et al., POESIA Software Architecture Definition Document. http://www.poesia-filter.org/pdf/Deliverable_3_1.pdf, Deliverable 3.1:7_9, December, 2002.
    [12] D.A. Forsyth, M. Fleck, and C. Bregler. Finding naked people. In Proc. Fourth European Conference on Computer Vision, 1996:593-602.
    [13] Michael J. Jones and James M. Rehg. Statistical Color Model with Application to Skin Detection. In Proc. of the CVPR ’99, vol.1:274-280.
    [14] 段立娟,崔国勤,高文,张洪明. 多层次特定类型图像过滤方法[J]. 计算机辅助设计与图形学学报,2002,14(5):404-409.
    [15] 胡冠宇. 基于肤色之裸体影像侦测之研究. 台湾国立成功大学,硕士论文. 2004.
    [16] 杨金锋,傅周宇,谭铁牛,胡卫明.一种新型的基于内容的图像识别与过滤方法.通信学报,2004,25(7):93-106.
    [17] 求是科技.Visual C++6.0 数据库开发技术与工程实践[M].人民邮电出版社,2004 年 1 月第 1 版:210-300.
    [18] Visual C++6.0 数据库高级编程. 北京希望电子出版社,2002 年 1 月第 1版:264-268.
    [19] 田欣 . 基于不同颜色空间的肤色模型 [J]. 西安科技学院学报,2001,21(4):369-371.
    [20] Angle E. Angelopoulou The Reflectance Spectrum of Human Skin. Technical Report MS-CIS-99-29, University of Pennsylvania, 1999.
    [21] 颜色空间: http://www.ekany.com/wdg98/cg/tutorial/chapter8/lesson8-6.htm, 2005.3.10.
    [22] 韩海. 在(r,g)和(Cr, Cb)彩色空间上进行肤色检测[J]. 计算机与现代化,2003, 90(2):7-10.
    [23] 姚鸿勋,刘明宝,高文等. 基于彩色图像的色系坐标变换的面部定位与跟踪法[J]. 计算机学报,2000,23(2):158-165.
    [24] 雷明,张军英,董济扬.一种可变光照条件下的肤色检测算法[J].计算机工程与应用.2004,24:123-125.
    [25] 吴相豪,申铉京. 基于像素的三种肤色检测模型的比较与研究[J].计算机应用研究, 2003.9 精扩本: 430~432.
    [26] J. Ruiz-del-Solar et al. Skin Detection using Neighborhood Information. 6th Int. Conf. on Face and Gesture Recognition – FGR 2004,pp. 463~468, Seoul, Korea, May 2004.
    [27] Jinfeng Yang, Xuanjing Shen. Research on Key Technologies of Content-Based Erotic Image Filtering and Its Application. In Proc. Fifth International Conference on Machine Learning and Cybernetics. 2006.
    [28] 直方图: http://www.aswiser.org/Article_Show.asp?ArticleID=180.
    [29] 侯小静. 贝叶斯分类器研究及其在 Web 文档分类中的应用. 郑州大学,硕士论文. 2005:8-9.
    [30] R. M. Haralick. Statistical and Structural Approaches to Texture. Proc. of IEEE. 1979, vol. 67, No. 5, pp. 45-69.
    [31] 数字图像处理(第二版). 北京: 电子工业出版社,2003 年 3 月第 2 版: 445~451.
    [32] 杨金锋. 基于内容敏感图像过滤关键技术研究与应用. 吉林大学,硕士论文. 2005.
    [33] J. P. Marques de Sa’著, 吴逸飞译. 《模式识别―原理、方法及应用》, 清华大学出版社, 北京, 2002.
    [34] Y. Freund and R. E. Schapire. Experiments with a New Boosting Algorithm. In Proc. 13th International Conference on Machine Learning, pp. 148-156, Morgan Kaufmann, 1996.
    [35] Freund, Y. and Schapire, R. E. (1997). “A decision theoretic generalization of on-line learning and an application to boosting”, Journal of Computer and System Science, 55(1): pages: 119–139.
    [36] R. Schapire. “A brief introduction to boosting”. In Proc. 16th International Joint Conference on Artificial Intelligence, 1999.
    [37] R. Schapire, Y. Freund, P. Bartlett, and W. Lee. “Boosting the margin: a new explanation for the effectiveness of voting methods.” In Proc. 14th International Conference on Machine Learning, pages 322-330. Morgan Kaufmann, 1977.
    [38] Viola Paul, Jones Michael. Rapid Object Detection using a Boosted Cascade ofSimple Features. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Kauai, Hawaii,USA, 2001.
    [39] LIENHART R, MAYDT J. An Extended Set of Haar-Like Features for Rapid Object Detection [J]. IEEE ICIP 2002, 2002, 1: 900-903.
    [40] OpenCv 参考手册.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700