网络图像的数据捕获及敏感图像识别的关键技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
互联网的发展和普及,给人们带来了前所未有的信息便利。互联网技术的发展,一方面极大地丰富了普通网络用户对资讯的需求,另一方面也为色情制造者、传播者提供了更为先进的传播手段与渠道。本文以此为背景,依托于2004年度珠海市科技项目(PC20041101)“基于内容的敏感图片过滤技术的研究及其在IE浏览器中的实现”,对基于内容的敏感图像过滤中的若干关键技术进行了研究,提取了五个效率较高的分类特征,并在此基础上构造实现了一个有效的敏感图像分类器。
     本文首先对网络数据包的结构进行了分析,在对网页上的图片数据包进行捕获的基础上,根据网络协议分析了数据包中的内容,实现了图片的重组和还原。为了统计肤色象素点及非肤色象素点的RGB颜色分布概况,构造了一个包括14338幅正常图像和1608幅敏感图像的图像库。本文参考敏感图像自身的特点,在肤色检测的基础上,结合掩码图像和原图像,提取出十个分类特征,通过实验评价选出分类性能较好的五个:肤色面积图像百分比、肤色面积区域百分比、肤色最大连通区域图像百分比、肤色最大连通区域宽与图像宽比例和肤色概率均值,组成分类特征向量。根据敏感图像分类的特点,采用支持向量机算法构造了敏感图像分类器,选择了RBF函数并通过实验得到了分类性能较高的参数,最终在14792幅测试图像库上的总体正检率达到89.40%(敏感类正检率为76.61%,正常类为91.47%)。
The development and popularization of Internet brings unprecedented facilitation on information. The development of Internet technology, on the one hand, has greatly enriched the common net users’demand for information, the other for the eroticism maker and disseminator provided more advanced means and channels of communication. The proliferation of Internet pornography is not only seriously affects physical and mental health of young people, but also brings a lot of inconvenience to the people who use the Internet normally. Traditional technologies such as blockage based on IP or sensitive keywords matching haven’t work effectively any more on the research that how to prevent the spread of Internet pornography, the image filtering technology must be integrated to deal the problem more effective. Founded on“Research on Content-Based Erotic Image Filtering technique and its Application in IE”of Zhuhai Science and Technology Planning Projects in 2004, we study the key technologies of Content-Based erotic image filtering, extract five feature vectors for classifying erotic images, and construct a classifier based on Support Vector Machine.
     This paper discusses several key technologies of Content-Based erotic image filtering, after studying the research results that have presented, we design and realize an effective erotic image filter. The main work of the dissertation is as follows:
     (1) The capture and recombination of datagram. In Ethernet, the MTU, Maximum Transmission Unit is 1500 bytes, an IP packet can transmit data at a maximum length of 1480 bytes without the 20 bytes of the IP head. The TCP packet can transmit data at a maximum length of 1460 bytes without the 20 bytes of the TCP head. As a result, when the data exceeds the maximum length, it will be divided into pieces. Therefore, a picture may be divided into many parts to transmit with packets, we need to recombine these packets and eventually reverted to the picture.
     First, we analyze the head part of the packet in accordance of IP protocol, identify whether the packet is from the website to our computer or not by the source address and the destination address. Second, we analyze the content of the packet according to TCP and HTTP protocol , identify whether the content is picture or not and search for the size, name, type, the end sign and other information of the picture, then recombine the content of the packets with insertion sorting algorithm. We use two different methods considering the different situations in determining whether the data of the picture is accept completely: the first situation is the server gives the key words“Content-Length:”indicating the size of the picture in the HTTP head part of the first patch of the picture in the response packets, we can get the size information of the picture by finding the key words“Content-Length:”and then calculate all the data lengths in this picture’s packets that we have received, if the summation is equal to the size of the picture, it means the data of the picture is complete, otherwise, the data is not complete. The second situation is when the server send the picture, it does not contain the key words“Content-Length:”, so we can’t judge the integrity of the data by calculating the sum of the patches length because of no size information of the picture. We found that the last patch of the picture will be added a seven-byte ending sign at the end of the patch indicating the end of the picture, so for the same picture, if the sequence numbers of the packets are continuous from the first to the last, it means the picture is complete, otherwise, it’s not complete.
     (2) We construct a more complete image database, containing a training image bank of 1154 images and a test image bank of 14792 images, and sign the images using the classification strategy. All the work we have done in this paper is based on the image bank.
     (3) The research of the skin-color detecting model. The skin-color detection seems simple but complicated mainly for the influence of the factors such as race, illumination, noise and so on. At present there are three methods of skin-color detection in common use in the research field: the Chroma Space Algorithm, the Byes Classifier Algorithm based on skin-color statistical histogram and the Seed Diffusion Algorithm based on neighboring information. This paper chooses the Byes Classifier Algorithm based on skin-color statistical histogram.
     (4) The feature vector extraction and evaluation for classifying erotic image. We extract ten features that are propitious to classifying in all from mask image and the relevant origin image before classifying, and evaluate these features considering their capability of classification respectively, then select five features as our character set.
     (5) The construction of the classifier. The common classification methods are clustering method, Bayesian method, neural networks method, k-nearby method, Fisher Linear Discriminant method and Support Vector Machine method. Support Vector Machine classifies by constructing an optimal separating hyperplane in the feature space, it’s suitable for our problem which divide images into sensitive and non-sensitive images by the eigenvector of the images. Support Vector Machine built on the basis of statistical learning theory, based on the principle of SRM, does not require prior knowledge of the specific issues, it can work good in the limited training samples circumstances, so we ultimately choose Support Vector Machine to construct the image classifier. We choose the RBF after the evaluation of the four kernel functions: the linear function, the polynomial function, the Gaussian function and the Sigmoid function, because the Gaussian function (Radius Basis Function, RBF) has the following advantages: first, the RBF maps the data to the high-dimensional space to solve the nonlinear relationship problem between the tags and attributes, second, the Sigmoid function is nearly the same to RBF when it takes certain parameters, third, the polynomial function is more difficult in model selection because it has more functions than RBF.
     RBF has two parameters that we need to regulate, different parameters will make corresponding classification of different identification accuracy, in order to find the best parameters, we used the m-fold cross validation method and have got parameters of high recognition rate.
     Experiments and analysis show that our erotic image classifier can identify the benign images and erotic images effectively, with precision of about 89.39%(while the precision for erotic images recognition is 76.61%, the precision for benign image is 91.47%) on our test set.
     There are many places of our filtering system that need to be improved and perfected, such as more efficient skin-color pixel detecting model, the detection of human face ,human body and special parts of human body, these are also our future work.
引文
[1] 信息网络安全,2007.5,总第 77 期
    [2] 第 21 次中国互联网络发展状况统计报告, http://www.cnnic.cn/index/0E/00/11/index.htm
    [3] 曹宁,国外网络色情行业与 IT 业火车头,放飞技术网, http://www.frontfree.net/view/news_1496.html
    [4] 网络爸爸,展翅鸟科技公司,http://baba.tueagles.com
    [5] 科利华学生浏览器,科利华多媒体教育技术有限公司, http://www.cleverie.com.cn
    [6] 美萍反黄专家,美萍软件工作室,http://www.mpsoft.net/shield.htm
    [7] 火眼金睛,中国科学技术大学迅飞信息科技有限公司, http://www.iflytek.com/
    [8] 五行卫士,清华五行信息产业有限公司,人民日报,1999.4.12 ,第 5 版
    [9] 护花使者,飞涛软件工作室,http://www.18ie.com/
    [10] Tiresias Plugin, wingsoft company, http://softbbs.pconline.com.cn/topic.jsp?tid=2737345&pageSize=10
    [11] B. Starynkevitch, M. Daoudi et al. POESIA Software Architecture Definition Document. Deliverable 3.1:7_9, December,2002, http://www.poesia-filter.org/pdf/Deliverable_3_1.pdf
    [12] J.Z.Wang,J.Li,G.Wiederhold, et al. System for screening objectionable images[J]. Computer Communications, 1998, 21(15): 1355-1360
    [13] Forsyth DA,Fleck MM. Automatic detection of human nudes. International Journal of Computer Vision,1999,32(1):63~77
    [14] Michael J. Jones and James M. Rehg. Statistical Color Model with Application to Skin Detection. In Proc. of the CVPR ’99, vol.1,274-280
    [15] 胡冠宇,基于肤色之裸体影像侦测之研究,台湾国立成功大学,硕士论文,2004
    [16] 杨金锋,傅周宇,谭铁牛,胡卫明,一种新型的基于内容的图像识别与过滤方法,通信学报,2004,25(7):93-106
    [17] Forsyth DA,Fleck MM. Automatic detection of human nudes. International Journal of Computer Vision,1999,32(1):63~77
    [18] 田欣,基于不同颜色空间的肤色模型[J].西安科技学院学报,2001,21(4):369-371
    [19] Angle E. Angelopoulou The Reflectance Spectrum of Human Skin. Technical Report MS-CIS-99-29, University of Pennsylvania, 1999
    [20] 韩海,在(r,g)和(Cr, Cb)彩色空间上进行肤色检测[J],计算机与现代化,2003, 90(2):7-10
    [21] 姚鸿勋,刘明宝,高文等,基于彩色图像的色系坐标变换的面部定位与跟踪法[J],计算机学报,2000,23(2):158-165
    [22] 雷明,张军英,董济扬,一种可变光照条件下的肤色检测算法[J],计算机工程与应用,2004,24:123-125
    [23] 吴相豪,申铉京, 基于像素的三种肤色检测模型的比较与研究[J].计算机应用研究, 2003.9 精扩本: 430~432
    [24] J. Ruiz-del-Solar et al. Skin Detection using Neighborhood Information. 6th Int. Conf. on Face and Gesture Recognition – FGR 2004,pp. 463~468, Seoul, Korea, May 2004
    [25] R. M. Haralick. Statistical and Structural Approaches to Texture. Proc. of IEEE. 1979, vol. 67, No. 5, pp. 45-69
    [26] 数字图像处理(第二版). 北京: 电子工业出版社,2003 年 3 月第 2版: 445~451
    [27] 杨金锋,基于内容敏感图像过滤关键技术研究与应用,吉林大学,硕士论文,2005
    [28] V. Vapnik , The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995
    [29] V. Vapnik ,Statistical learning theory,Wiley,1998
    [30] Chapelle, V. Vapnik et al, Choosing multiple parameters for support vector machine, Machine learning, 2002, Vol .46, pp.131-159
    [31] Richard O. Duda, 模式分类,机械工业出版社,2004

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700