基于本体的网页文本分类的研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
传统的文本分类方法大都采用了基于词频的统计方法来表征文本,基于关键词加权重的向量空间模型(Vector Space Model,VSM)对文本进行分类,普遍缺乏语义信息的导引,得到的文档特征表示只是单纯的词汇堆砌物。为了克服传统文本分类方法中基于关键词匹配带来的局限性,充分利用网页文本中的语义信息辅助分类,本文中引入了领域本体WordNet,将语言学知识有效地融合到文本向量空间的表示中,提出了一种基于本体的网页文本分类算法并给出了系统的实现框架。该算法从语义的角度考虑文档的实际内容信息,借助于WordNet中概念的层次关系以及概念间关系(也即语义)的细致描述以及其它与本体有关的方法来计算特征之间的语义相似度,进行语义扩展以缩减文本特征的维数,实现相似特征的合并以减小相似特征分离对分类结果的影响,并以此构造了分类器。这种方法改进了传统分类方法中相似度的计算仅来自于数据本身的统计信息,综合了概念间的语义关系及客观发生的统计信息,有助于更准确的模拟客观世界的原貌,并发现其中隐含的规律或模式,使得分类的结果更接近于人的理解,也更准确,最后实验证明了该方法的有效性。
Traditional text classification methods mostly use term-frequency to denote the text, and classify the text by calculating the term weight in Vector Space Model, so it can not apply the useful semantic information to its classification process, the denotation of the text is only a set of words without any semantic information. In order to overcome the limitation of the classic text classification methods, and to make full use of the semantic information in the text to help the classifying process, this paper introduces WordNet, denotes the text with lingual knowledge and proposes an ontology-based web document classification algorithm together with its system framework. In this algorithm, we take in consideration of semantic information and make use of WordNet additional with other ontology related methods to construct the classifier, calculate the similarity of the property value for different abstract hierarchy, improve the classic similarity-calculating method which uses only the static information from the data. This method combines the static information with semantic relation between concepts, simulates the real world more concisely, try to find out the implicit principle or module, so the result is more like the understanding process of human-being and at the same time a better accuracy, at last we prove its effectiveness using experiments.
引文
[1]Salton G et al.Term weighting approached in automatic text retrieval.Information Processing and Management,1988,24(5):513-523.
    [2]Prabowo R,et al.Ontology-based Automatic Classification for the Web Pages:Design,Implementation and Evaluation.http://csdl.computer.org/comp/proceedings/wise/2002/1766/00/17660182abs.htm.
    [3]刘娇蛟,龚丽,李建华.基于本体实现对网页文本的自动主题分类.计算机工程,2003(7).
    [4]Neches R,Fikes R E,Gruber T R,et al.Enabling Technology for Knowledge Sharing.AI Magazine,1991,12(3):36-56.
    [5]T R Gruber.A Translation Approach to Portable Ontology Specifications.Knowledge Acquisition.1993,5(2):199.
    [6]张晓林.Semantic Web与基于语义的网络信息检索.情报学报,2002(4):413-417.
    [7]Perez A G,Benjamins V R.Overview of Knowledge Sharing and Reuse Components:Ontologies and Problem-Solving Methods.In:Stockholm V R,Benjamins B,Chandrasekaran A,eds.Proceedings of the IJCA1299 workshop on Ontologies and Problem-Solving Methods(KRR5)1999,1-15.
    [8]潘宇斌,陈跃新.基于Ontology的自然语言理解.计算技术与自动化,2003(4):71-74.
    [9]阮明淑,温达茂.应用与知识组织之初探[J].佛教图书馆馆讯,1991,(32):6-17.
    [10]Nicola Guarino.Formal Ontology and Information Systems[C].Proceedings of FOIS'98,1998.3-17.
    [11]Perez A G,Benjamins V R.Overview of Knowledge Sharing and Reuse Components:Ontologies and Problem-Solving Methods.Workshop on Ontologies and Problem-Solving Methods:Lessons Learned and Future Trends(IJCAI99),deAgosto,Estocolmo,1999.
    [12]A.Gomez-Perez.Some ideas and examples to Evaluate Ontologies.Technical report KSL-94-65 Knowledge system laboratory.Stanford University,1994:1-18.
    [13]Gruber T R.Towards Principles for the Design of Ontologies Used for Knowledge Sharing[J].International Journal of Human-Computer Studies,1995,43:907-928.
    [14]A Gomez-Perez.Knowledge Sharing and Reuse[R].Handbook on Applied Expert Systems,1998.
    [15]J Arpirez,A Gomez-Perez,et al(Onto)2 Agent An Ontology-based WWW broker to select Ontologies[C].ECAI,1998,16-24.
    [16]朱礼军.万维网环境下基于领域知识的信息资源管理模式研究:[博士学位论文].北京:中国农业大学,2004.
    [17]杨秋芬,陈跃新.Ontology方法学综述[J].计算机与信息技术,2001,10:2-6.
    [18]王晓东.基于Ontology知识库系统建模与应用研究[D].上海:华东师范大学.
    [19]Gomez-Perez A,Fernundez M,De Vicente A J.Towards a Method to Conceptualize Domain Ontologies[A].ECAI-96 Workshop on Ontological Engineering[C].Budapest,1996.
    [20]Arpirez J,Perez A G,Lozano A,et al.(Onto)~2 agent:An Ontology-based WWW Broker to Select Ontologies.In:Comez-Perez A,Benjaming V R,eds.Proceedings of the Workshop on Application of Ontologies and Problem-solving Methods UK,1998:16-24.
    [21]Ontobroker.http://ontobroker.aifb.uni-karlsruhe.de
    [22]SKC.http://www-db.stanford.edu/skc.
    [23]Wordnet.http://Pwww.cogsci.princeton.edu
    [24]Framenet.http://www.icsi.berkeley.edu
    [25]GUM.http://www.darmstadt.gmd.de/publish/komet/gen2um/newUM.html
    [26]SENSUS.http://www.isi.edu/natural-language/resourcesPsensus.html
    [27]Mikrokmos.http://crl.nmsu.edu/Research/Projects/mikro/
    [28]Fabrizio sebastiani.A Tutorial on Antomated Text categorization,proceedings of ASAI-99,1st,Argentienan symposium on Artificial Intelligence,1999.
    [29]JJ.ROcchio.Relevance feedback in information retrieval.In the SMART Retrieval System-Experiments in Automatic Document Processing,1971:313-323.
    [30]George A.Miller,Rechard Beckwith,Christinane Fellbaum,Derek Gross and Katherine Miller.Intrduction to WordNet:An on-line Lexical Database,Cognitive Science laboratory Princeton University,1993:81-77.
    [31]何元娇,张国英.基于本体语义的简单向量距离分类方法.北京石油化工学院学报,2007.
    [32]Roy R,Mili H,Blettner M.Devolopment and application of a metric on semantic nets[J].IEEE Transaction on System,Man and Cybernetics,1989,19(1):17-30.
    [33]杨力,左春,王裕国.基于语义距离的k-最近邻分类方法[J].软件学报,2005,16(12):2054-2062.
    [34]D.W.Embley,N.Fuhr,C.P.Klas,T.Roelleke.Ontology suitability for uncertain extraction of information from multi-record web documents.In Proceedings of the Workshop on Agenten,Datenbanken and Information Retrieval(ADI'99),Rostock-W- amemuende,Germany,Sep 30-Oct 1 1999.

© 2004-2018 中国地质图书馆版权所有 京ICP备05064691号 京公网安备11010802017129号

地址:北京市海淀区学院路29号 邮编:100083

电话:办公室:(+86 10)66554848;文献借阅、咨询服务、科技查新:66554700