基于元数据的web信息提取方法研究

英文题名：Research on Web Information Extraction Based on Metadata
作者：武琼
论文级别：硕士
学科专业名称：计算机软件与理论
中文关键词：web信息提取 ; 元数据 ; 文本矩阵 ; 平面聚类 ; C均值聚类
英文关键词：Web Information Extraction ; Metadata ; Text Metrix ; Faintness ; Plane Clustering ; C_Average Clustering
学位年度：2003
导师：郑家恒
学科代码：081202
学位授予单位：山西大学
论文提交日期：2003-06-01

摘要

web信息提取是当前比较活跃的一个研究课题，但web数据的大量性，异构性和动态性，是web信息提取的一个桎梏。对于结构化的Web数据，已经有较为成熟的解决方法；而对于非结构化的Web数据，由于传统数据库的底层问题，不能用来处理非结构化数据，迫切希望能提出一种方法进行非结构化数据的处理。为了解决这个问题，很多研究者提出了为web数据建立元数据，可将非结构化数据变成结构化或半结构化数据。但由于web数据形式的多样性，很难为多样性的数据建立一种统一标准的元数据。
     本为为web数据中的文本数据建立了一种Dublin Core文本元数据表，将web文本这种非结构化数据结构化。Web文本元数据分为描述性元数据和语义性元数据，描述性元数据通过分析HTML源文件直接得到，本文的主要工作有以下四部分：
     1 对HTML源文件进行分析，将标记流和文本流分开，根据标记流，提取题目元数据项；根据文本流将文本形式化为一个矩阵模型，在矩阵模型基础上提取文本的作者元数据项。
     2．利用模糊数学的相关知识，为文本建立了文本状态模糊集与模糊相似矩阵，由此可提取文本的主题关键字元数据项；采用文本分类基本思想，提取题材元数据项。
     3．为了提取内容元数据项，首先，利用模糊相似矩阵对冗长句处理，形成内容侯选句WHJ1；其次，在内容候选句WJH1中，利用模糊序贯决策论对冗长段进行处理，形成内容候选句WHJ2；最后，利用平面聚类和C__均值聚类算法对内容候选句集WHJ2进行聚类，然后将每一类中的相关性较小的句子剔除，最后形成文本内容元数据项。
     4．试验结果表明：本系统对语义性元数据项填写取得很好效果。
Web information extraction is a currently lovely research fileld, but the mass , isomer and dynamics of web data is a difficult of web information extraction. We can divide web data into two kinds: structural data and unstructured data. We have maturer methords to deal with structural data. However, because traditional database bottom can not deal with unstructured data, a wey that deal with unstructured data need be presented. Many scientists present web matedata in order to slove the problem .Web metadata can transform unstructured data into structural data. It is difficult to construct a metadata standard for web data. This paper construct a Dublin Core metadata for web text data. This kind of metadata can convert web text data which is unstructured data into structurual data.
    In this paper, we divide Dublin Core metadata into tracing metadata and contental metadata. We fill in tracing metadata by HTML. The mostly research of this paper is filling in contental metadata..
    (1)On the base of HTML, we can extract DC.title. In order to extract contental metadata we construct matrix model for web text, by which DC.title And DC.creater can be filled in.
    (2) On the base of matrix model we combine correlational knowledge of faint math to fill in DCsubject and DC.type.
    (3) Extracting DC.descriotion is a difficult of this paper. In order to fill in DC.description we divide three steps. Firstly, we deal with lengthy sentences by faint similar matrix and form DC.description candidateal sentences WJH1. Secondly, we deal with lengthy paragraph by faint control and form DC.description candidateal sentences WJH2. Lastly, we deal with WJH2 by plane clustering and C_average clustering.
    (4) The result of experiments show semantic metadata receive a good performance by our information extraction systerm.

引文

[2] S.Chawathe, H.Garcia-Molina, J. Hammer, K. Ireland, Y.Papakonstantinou, J. Ullman, J.Widom. The TSIMMIS Project: Integration of Hetermteneous Information Sources. In Proceedings of IPSJ Conference, page 7-18,Tokyo, Japan, October 1994.
    [3] Mark Craven. Learning to extract symbolic knowledge from the world-wide web. In Proc of the AAAI Fifteen National Conf. On Artificial Intelligence,1998.
    [4] N.Kushmerick, R.Doorenbos, and D.Weld. Wrapper reduction for information extraction. In Proceedings of the 15 Internationso Jiont Conference on Artificial Intelligence, 1997.
    [5] S.Abiteboul, P.Buneman, and D.Suciu. Data on the Web_From Relation Semistructured Data and XML. Morgan kaufmann,2000.
    [6] J.A.McCann, A. MacFrlane and H.M.Lidderll. A Common Data Model for metadata in Interoperable Environments. Second International Baltic Workshop On Database and Information Systems, 1996.
    [7] S.Weibel. Metadata: the Foundation of Rescovery. D_lib: The magazing of Digital Library Research, July 1995.
    [8] Vipual Kashyap and Amit Sheth. Semantic Heterogeneity in Glibal Information Systems:The Role of Metadata, Content and Ontologies Academic Press, 1997.
    [9] David Beckett, Iafa templates in use as intemet metadata, Proceedings of the WWW 4th Conference, 1995.
    [10] Robert Baumgartner, Visual Web Information Extraction With Lixto,1998.
    [11] J.A.McCann, Internet Distribution & Dilivery System, 1998.
    [12] Weibel,S.L.,Kunze,J,A.,Lagoze,C,st .Dublin Core metadata for resource discovery. Internet RFC 2413,1998.
    [13] David D. Lewis. Feature Selection and Feature for Text Categorization. In speech and Natural Language Workshop, 1992.
    [14] Berners-Lee, T. The World Wide Web: past, present and future.
    [15] Kunze, J.A. Encoding Dublin Core metadata in HTML. Intemet Draft, 1999.
    [16] S.Lawrence. Searching the world wide web. Science, 280(4):98-100,1998,
    [17] H.P.Luhn, The Automatic Creation of Literature Abstracts ,IBM Journal of Research and Development, 1958,2(2): 159-165.


    [18] Marti A. Hearst, Multi-Paragraph Segmentation of Expository Text, Processdings of rocessdings of the 32 Annual Meeting of the Association for Computational Lmguistics(ACL94), June 1994.
    [19] G.Salton, A.Wong, C.S.Yang, A Vector Space Model for Automatic Indexing, Communicating of the ACM, 1995,Vol.18,
    [20] G.Salton, A.Singhal, M.Mitra & C.Buckley, Automatic Text Structuring and b Summarization, Information Processing & Management 1997, Vol:33 No.2,193-207.
    [21] 何新贵等，中文文本的关键词的自动抽取和模糊分类方法。中文信息学报，13，1，1999．
    [22] 大规模中文语句的检索，分类与摘要研究复旦大学学位论文黄萱箐．
    [23] 王永成，许惠敏，OA-1．4版中文自动摘要系统，高技术通讯，1998，1．
    [24] 刘挺，吴岩，王凯铸，绍艳秋，意义段划分问题研究，语言工程，清华大学出版社 1997．
    [25] 林宏非，战学刚，姚天顺，文本层次分析与文本浏览，中文信息学报，第13卷，第4期，7-15．
    [26] 刘挺，吴岩，王开铸，基于信息抽取和文本生成的自动文摘系统设计，情报学报，1997年12月，第16卷(增刊)．
    [27] 史忠植等，《知识发现》，清华大学出版社，2002，北京。
    [28] 吴立德等箸，《大规模中文文本处理》复旦大学出版社，1997，上海。
    [29] 严蔚敏，吴伟民编著，《数据结构》(第二版)，清华大学出版社，1992，北京。
    [30] 王继成张福炎等。web信息检索研究进展计算机研究与发展 2001年2月。
    [31] 汪诚义编著，《模糊数学引论》(第三版)北京工业大学出版社，1996，北京。

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700