面向主题的网页过滤机制研究

设为首页

收藏本站

网站地图 | English | 公务邮箱

远程访问

NSTL服务站

面向主题的网页过滤机制研究

详细信息本馆镜像全文| 推荐本文 | | 获取CNKI官网全文

英文题名：Research on Scheme of Topic-Specific Web Pages Filtering
作者：张海波
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：网页 ; 主题信息 ; 过滤 ; 神经网络
英文关键词：Web Pages ; Topic Information ; Filtering ; Neural Network
学位年度：2007
导师：蒙应杰
学科代码：081202
学位授予单位：兰州大学
论文提交日期：2007-04-01

摘要

随着Internet的日益普及和迅猛发展，人们对网络的依赖程度越来越高，但Internet的开放性、平等性、无界性等特征又导致了网络的无限制滥用，大量的垃圾及敏感信息充斥于网络，如何滤除这些垃圾及敏感信息，消除网络带来的消极及负面影响已成为Internet信息服务须解决的关键问题之一。解决这一问题的最有效技术手段就是进行信息过滤。
     文章在研究信息过滤一般原理及常用网页过滤技术的基础上，以需求为驱动，从功能的角度出发，提出并构建了一个基于主题的网页过滤体系，并对该体系进行了较为深入的研究，主要的研究工作和取得的创新成果有以下几个主要方面：
     首先，分析了目前Internet中传播的各种信息流，根据过滤需求对网络中需要过滤的信息进行了分类，明确定义了研究的主题领域，在此基础上，设计了一个面向主题的信息过滤系统TSIFS，该系统采用分层的网页过滤策略，在信息过滤的分类方案中引入了神经网络技术，利用神经网络的学习能力及适应性弥补一般过滤机制的不足，从而可以提高了网页过滤的准确性。
     其次，为了处理的方便性，通过归一化策略将Web页面包含的多类型数据变换为文本信息进行处理，在这一变换过程中结合了主题信息的过滤特征，利用主题专业词汇及人工编辑辞典完成了文本向量的表示，设计了一个新的特征词权重函数；另外还提出并设计了一种页面字符编码的判别算法。
     再次，利用BP网络构建了基于神经网络的过滤信息分类模型，构造了TSIFS中的过滤引擎处理机制，并对涉及的输入向量正规化、参数选择等关键问题进行了重点讨论。
     最后，文章通过仿真实验对构造的基于主题的过滤系统进行了可行性、有效性、准确性等方面的实验验证和分析。
In the wake of more popularization and swift development of Internet, the manners of people's querying information have been greatly changed and Internet has played more and more important part in our life. However, some features of Internet such as openness, equality, unboundness and etc. have also brought about the non-restricted abuse of the network: A lot of information noise and sensitive information, which can decrease the density of the useful information, flood it. Therefore how to filter these unwanted messages and eliminate negative influence has become one of the key questions in the field of Internet information service. Fortunately via information filtering, the most effective method, people can solve the problem in effect. In order to facilitate the filtering, recently techniques of machine learning have been applied to classify documents automatically in many researches.
     Based on the research into general theory of information filtering and common technologies of web pages filtering, from the point of function, a topic-specific web pages filtering architecture is brought forward and constructed in this thesis, in which details have been deeply studied as well. The main work and creative results are as follows:
     Firstly, analyses different information streams currently transmitted through the Internet and classifies them according to filtering requirements; then definitely defines the concerned topic-specific domains. Moreover, designs a topic-specific information filtering system(TSIFS), which adopts a layered filtering strategy and introduces Neural Network categorization into the classified scheme of information filtering. The learning capacity and adaptability of Neural Network categorization can cover the shortage of filtering, so the veracity of filtering will be increased.
     Secondly, multiple types of data contained in web pages are transformed into text formatting to predigest the disposition. During this process the filtering features of topic information is considered, vectorization of text with focused vocabulary is accomplished, classification efficiency degradation is put forward and a new weighting function of key words is designed.
     Thirdly, an information classifying model based on Back Propagation Network is constructed and the scheme of filtering engine including normalization of input-vector and selection of network parameters, etc. is also discussed.
     Finally, emulation experiment of the proposed topic-specific filtering architecture is given and analyzed to prove out its feasibility, efficiency and veracity.

引文

[1] Belkin N.J., Croft, B.W. Information Filtering and Information Retrieval:Two Sides of The Same Coin?[J]. Communications of the ACM, 1992,35(12):29-38.
    [2] Jacques Cohen. Special Issue on Information Filtering[J]. Communication of the ACM, 1992,35(12).
    [3] Guduvada V.N. Information Retrieval on the World Wild Web[J]. IEEE Interact Computing,1997,1(5):58-68.
    [4] Robertson S.E., Beaulieu M. Research and evaluation in information retrieval[J]. Emerald: Journal of Documentation, 1997,53(1):51-57.
    [5] Hjorland B., Pedersen K.N.A substantive theory of classification for information retrieval[J]. Emerald: Journal of Documentation,2005,61 (5):582-597.
    [6] 黄晓斌，邱明辉．网络信息过滤系统研究[J]．情报学报，2004，23(3)：326-332．
    [7] 黄晓斌，夏明春．网络信息过滤的成本效益分析[J]．情报学报，2003，21(11)：1129-1132．
    [8] 李石君，李洲，余军等．基于URL过滤与内容过滤的网络净化模型[J]．计算机技术与发展，2006，16(1)：5-7．
    [9] 蒙应杰，张海波等．基于角色授权的Web Service访问控制模型[J]．兰州大学学报(自然科学版)，2007，43(2)：84-88．
    [10] 蒙应杰，张海波等．基于Web Service的XML成组加密研究及实现[J]．计算机应用研究，2006增刊(下)：914-915．
    [11] 张旭，张新慧．数字图书馆信息过滤系统研究[J]．现代情报，2005，(7)：92-94．
    [12] 张晓冬，张书杰，王万亭．信息过滤的模糊聚类模型[J]．计算机工程与应用，2002，(9)：34-36．
    [13] Birger H, Karsten N.P. A substantive theory of classification for information retrieval[J]. Emerald:Journal of Documentation, 2005,61 (5):582-597.
    [14] 赵俊玲．网络信息过滤在美国公共图书馆中的应用[J]．图书馆理论与实践，2004，(1)：84-85．
    [15] 何军，周明天．信息网络中的信息过滤技术[J]．系统工程与电子技术，2001，23(11)：76-79．
    [16] Cooper W S. Getting beyond Boole[J]. Information Processing & Management, 1988,24(3):243-248.
    [17] Maron, M.E. et al. On relevance, probabilistic indexing and information retrieval[J]. Journal of the ACM,1960,7(3):216-244.
    [18] Saiton G, Wong A, Yang C S. A vector space model for automatic indexing[J]. Communications of the ACM,1975, 18(5):613-620.
    [19] 梅海燕．信息过滤问题的研究[J]．现代图书情报技术，2002，(2)：44-47．
    [20] 程妮，崔建海，王军．国外信息过滤系统的研究综述[J]．现代图书情报技术，2005，(6)：30-38．
    [21] Ding C., Chi C.H., Deng J., et al. Centralized Content Based Web Filtering and Blocking: How Far Can It Go[J]. IEEE International Conference on Systems, Man, and Cybernetics (SMC), 1999,2:115-119.
    [22] Malone T.,Grant K.,Turbak F.,et al. Intelligent information sharing systems. Communications ofACM, 1987,30(5):390-402.
    [23] Maria R.Lee. Context-Dependent Information Filtering. http://www.dl.slis.tsukuba.ac.jp/ISDL97/proceedings/maeia/maria.html, 2006-03-10.
    [24] Daniela Godoy, Analf a Amandi. User Profiling for Web Page Filtering[J]. IEEE Intemet Computing,2005:56-64.
    [25] Shardanand U, Maes P. Social Information Filtering:Algorithms for Automating Word of Mouth. In:Conf. proc. on Human factors in computing systems(ACM CHI'95),Denver, 1995:210-217.
    [26] Su Xiaoyuan,Taghi M.K. Collaborative Filtering for Multi-class Data Using Belief Nets Algorithms[J]. Proceedings of the 18th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'06), Arlington,VA, USA,2006:497-504.
    [27] 李绍滋，周昌乐，陈火旺．基于P2P网络的信息过滤与推荐技术研究[J]．计算机工程，2006，32(8)：45-47．
    [28] 于海燕，陈晓江，冯健等．Web文本内容过滤方法的研究[J]．微电子学与计算机，2006，23(9)：51-54．
    [29] 周慧芳．因特网中不良信息的过滤技术及存在问题[J]．情报杂志，2004，(6)：25-26．
    [30] 张阳，李建良，李战怀．一种基于域名的非法网站过滤技术[J]．计算机工程与应用，2004，(14)：170-172．
    [31] Churcharoenkrung N,Kim Y.S.,Kang B.H. Dynamic Web Content Filtering based on User's Knowledge [J].Proceedings of the International Conference on Information Technology:Coding and Computing (ITCC'05),2005,1:184-188.
    [32] 田稷．语义Web与网络信息过滤[J]．情报理论与实践，2004(2)：193-195．
    [33] 周澔宇．基于URL的网页内容过滤器的设计与实现[J]．计算机工程，2006，32(7)：81-83．
    [34] 阮彤，冯东雷，李京．基于贝叶斯网络的信息过滤模型研究[J]．计算机研究与发展，2002，39(12)：1564-1571．
    [35] Smeulders A.W.M.,Worring M.,Santini S.,et al. Content-Based Image Retrieval at The End of Early Years [J].IEEE Trans. On PAMI,2000,22(12): 1349-1380.
    [36] Aditya V.,Mario A.T.,Anil K.J.,et al. Image classification for content-based indexing[J].IEEE Transaction on Image Processing,2001,10(1): 117-130.
    [37] 周黎，王士林，李生红等．基于图像内容的过滤网关技术[J]．信息安全与通信保密，2006,(1)：66-68．
    [38] Su Kuan-Lun. Pornocide-Design and Implementation of a Content-based Objectionable Image Filtering System[D].Taiwan: National Talwan University,2002.
    [39] Ruiz-del-solar J,Castaneda V, Verschae R, et al. Characterizing Objectionable Image Content (Pornography and Nude Images) of specific Web Segments:Chile as a case study[J].Proceedings of the Third Latin American Web Congress(LAWEB'05),2005:269-278.
    [40] 吴永和，马晓兰，祝智庭．基于中国网络教育内容分级标准的网页内容过滤体系研究[J]．电化教育研究，2006，(10)：41-45．
    [41] World Wide Web Consortium(W3C).Plafform for Internet content selection, http://www.w3.org/PICS,2005-07-08.
    [42] 张化光，孟祥萍．智能控制基础理论及应用[M]．北京：机械工业出版社，2005．136-199．
    [43] 孙增圻，张再兴，邓志东．智能控制理论与技术[M]．北京：清华大学出版社，1997．125-237．
    [44] 郑玉明，史晶蕊，廖湖声．文本分类的神经网络模型[J]．计算机工程，2005，31(21)：37-39．
    [45] 厉亮．面向主题的Web信息博物馆的研究[D]．兰州：兰州大学，2004．
    [46] 邱均平，邹菲．关于内容分析法的研究[J]．中国图书馆学报，2004，(2)：12-17．
    [47] 周黎明，邱均平．基于网络的内容分析法[J]．情报学报，2005，24(5)：594-599．
    [48] 欧阳柳波，李学勇，李国徽等．专业搜索引擎搜索策略综述[J]．计算机工程，2004，30(13)：32-33．
    [49] 谢群英．基于Web的主题信息采集系统研究与设计[D]．兰州：兰州大学，2005．
    [50] 康平波，田永鸿，黄铁军．智能化网页采集工具的设计与实现[J]．计算机工程，2004，30(4)：88-89．
    [51] Selberg E,Etzioni O.The MetaCrawler Architecture for Resource Aggregation on the Web[J]. IEEE Expert,1997,12(1):8-14.
    [52] Rajeev K.M.KhojYantra: An Integrated MetaSearch Engine. with Classication, Clustering and Ranking[D].India:Indian Institute of Teehnology, Kanpur,2000.
    [53] ICRA. http://www.fosi.org/icra/,2006-09-27.
    [54] 周幼兰．元数据环境下国际华文书目交换的展望．http://www.libnet.sh,cn/dcchina/hywj.htm,2004-11-21.
    [55] Los Angeles.Buliding Topic-specific search Engines:A Data Mining Approach.a dissertation of the degree Doctor Philosophy,2001.
    [56] Margaret M.F, David A.F, Chris B.Finding Naked People[J].Proceedings of the 4th European Conference on Computer Vision, 1996,2:593-602.
    [57] James Z.W, Jia L, Gio W, et al.System for Screening Objectionable Images[J].Computer Communications,1998, 21(15):1355-1360.
    [58] Jiao F, Gao W, Duan L,et al.Detecting Adult Image Using Multiple Features[J].2001 International Conferences on IEEE Beijing China,2001,3:378-383.
    [59] Lee P.Y.,Hui S.C.,Fong A.C.M. A structural and content-based analysis for Web filtering[J], Internet Research:Electronic Networking Applications and Policy,2003,13(1):27-37.
    [60] 吴鹏飞，孟祥增，刘俊晓等．基于结构与内容的网页主题信息提取研究[J]．山东大学学报(理学版)，2006，41(3)：131-134．
    [61] Lin Shian-Hua, Chen Meng Chang, Ho Jan-Ming, et al.ACIRD:Intelligent Internet Document Organization and Retrieval[J]. IEEE Transactions on Knowledge and Data Engineering,2002,14(3):599-614.
    [62] 于满泉，陈铁睿，许洪波．基于分块的网页信息解析器的研究与设计[J]．计算机应用，2005，25(4)：974-976．
    [63] Song Ruihua, Liu Haifeng, Wen Ji-Rong, et al.Learning Block Importance Models for Web Pages[J].WWW 2004,New York, 2004:203-211.
    [64] Demehak C.,Friis C.,La Porte T.M. Webbing governance:national differences in constructing the face of public organizations. Handbook of Public Information Systems. Marcel Dekker, NYC,2001.
    [65] Palmer J.W.,Griffith D.A. An emerging model of Web Site design for marketing[J]. Communications of the ACM,1998, 41 (3):45-51.
    [66] Chou C.Interaetivity and interactive functions in web-based learning systems: a technical framework for designers[J]. British Journal of Educational Technology, 2003,34(3):265-279.
    [67] 曾致远，张莉．基于向量空间模型的网页文本表示改进算法[J]．计算机工程，2006，32(3)：134-135．
    [68] 吕宏伟，唐小力，王申．网页内容过滤技术中的特征提取[J]．计算机工程与应用，2004，(31)：145-146．
    [69] 邓蓉晖，王要武．基于神经网络的建筑企业竞争力评估方法研究[J]．哈尔滨工业大学学报，2006，38(3)：489-494．
    [70] Teh C.I., Wang K.S.,Goh A.T.C.,et al.Prediction of pile capacity using neural networks [J].Journal of Computing in Civil Engineering, 1997,11 (2): 129-138.
    [71] SquidGuard. http://www.squidguard.org,2006-12-29.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700