基于标签路径特征的Web新闻内容抽取研究

英文题名：Extracting Web News Using Tag Path Features
作者：吴共庆
论文级别：博士
学科专业名称：计算机应用技术
中文关键词：信息抽取 ; Web新闻 ; 区分路径模式挖掘 ; 标签路径特征 ; NP-complete问题 ; 在线抽取
英文关键词：Information Extraction ; Web News ; Distinguishing Path Pattern Mining ; Tag Path Feature ; NP-Complete Problem ; On-Line Extraction
学位年度：2012
导师：吴信东 ; 胡学钢
学科代码：081203
学位授予单位：合肥工业大学
论文提交日期：2012-08-01
答辩委员会主席：曾志刚

摘要

Web新闻内容抽取是Web智能信息处理过程中的一个非常重要的步骤,是情报获取与安全、网络舆情监测、移动终端个性化推荐服务、异构Web数据集成、信息检索、搜索引擎等研究与应用的基础。因此,面向Web新闻内容抽取领域中的相关问题开展研究,具有重要的研究和应用价值。
     实例分析和进一步研究发现,许多新闻网站具有类似的布局结构和风格,网页内容布局与其解析树的标签路径之间存在隐含的关联性。传统的路径表达式过于刚性,在Web信息抽取过程中难以适应HTML文档结构的细微变化,影响信息抽取的准确率；此外,Web新闻网页具有海量异构的特点,对手工构造包装器技术以及基于规则学习的包装器技术的通用性提出了挑战。为此,本文开展基于标签路径特征的Web新闻内容抽取研究,研究内容涉及两方面：面向特定网站,研究基于路径模式知识的高精度Web新闻内容抽取模型和方法；面向开放环境,研究基于标签路径特征的通用Web新闻内容抽取模型和方法。
     主要研究内容如下：
     (1)在研究网页内容布局与其解析树的路径模式之间存在隐含关联性的基础上,提出了一种新颖的Web信息抽取系统模型—基于区分路径模式的Web新闻内容抽取模型PP-WNE。在此基础上,定义了一种特殊的适用于Web新闻内容抽取的路径模式—区分路径模式,并提出一种区分路径模式挖掘方法,解决了抽取模式知识库的构建问题。以中文、英文网站上随机选取的网页为实验数据集,实验结果表明,通过采用合理设置的容噪阈值,基于路径模式挖掘的新闻网页内容抽取方法的F值可达到98%以上,同时也验证了路径模式应用于Web新闻内容信息抽取领域的可行性和有效性。
     (2)为解决基于路径模式的Web信息抽取模型PP-WNE中知识库规模的优化问题,提出区分路径模式覆盖问题,并证明了区分路径模式覆盖问题是一个NP-complete问题。为求解区分路径模式覆盖问题的近似最优解,定义了一种特殊的区分路径模式—极小区分路径模式,在此基础上,设计了一个求解区分路径模式覆盖问题的多项式时间(in|n|+1)近似算法MPM,其中,n为训练样本中正例的规模。在测试数据集上的实验结果表明,MPM算法可有效优化区分路径模式集,并且在节点级评估标准和文本级评估标准下均可达到98%以上的抽取精度、召回率和F值。
     (3)面向开放环境Web新闻内容抽取的需求,设计了一种文本标签路径比特征,描述了基于网页解析树节点遍历的文本标签路径比计算过程,提出基于文本标签路径直方图区分内容和非内容的阈值方法CEPR,有效地解决了在线Web新闻内容抽取的问题;提出了基于路径编辑距离的加权高斯平滑方法,有效地提高了CEPR算法在抽取短文本方面的能力,并解决了新闻内容中非新闻内容过滤的问题。CEPR是一种快速的、通用的、无需训练的网页内容抽取算法,可抽取多种来源、多种风格、多种语言的Web信息网页。在CleanEval测试数据集上的实验结果表明,大多数情况下,CEPR方法优于CETR等抽取方法。
     (4)设计并实现了一个HTML新闻网页过滤与总结系统NFaS。其中,提出并实现了一种基于URL特征、网页结构特征、内容属性特征相结合的Web新闻网页自动识别方法,有效地解决了Web新闻网页自动识别问题；采用Web新闻内容抽取技术,有效地解决了Web新闻网页过滤问题；采用一种基于词语语义联系的关键词抽取方法,通过词汇链构造词语语义联系图,抽取出高质量的关键词,完成Web新闻的总结任务。在测试数据集上的评估结果验证了NFaS系统的有效性。
Web news extraction plays an important role in intelligent Web information processing. It settles a foundation for research and development in information acquisition, information security, Internet sentiment monitoring, personalized recommendation for mobile users, integration of heterogeneous Web data sources, information retrieval, and search engines. Therefore, key issues of Web news extraction have both research and application values.
     Many Web news sites have similar structures and layout styles. Our extensive case studies have indicated that there exists potential relevance between Web content layouts and tag path patterns on the parsing trees. The traditional path expression is too rigid to adapt to slight changes of HTML structures, which affects the accuracy of information extraction. In addition, massive and heterogeneous Web news data brings a challenge to the wrappers based on handcrafted or rule-based learning. Motivated by these observations, this dissertation explores a novel research topic on Web news extraction using tag path features. Our research consists of two components. For specific websites, we focus on highly accurate Web news extraction based on tag path patterns. For an open environment, we put forward a generic Web news extraction model using tag path features.
     The main contributions of this dissertation are as follows:
     (1) Based on potential relevance between Web content layouts and tag path patterns on parsing trees, we propose a novel Web news extraction model PP-WNE, which uses tag path patterns as the extraction knowledge. Based on this model, a special tag path pattern-the distinguishing tag path pattern-which is adapted to Web news extraction is defined, and a distinguishing tag path pattern mining method is designed to construct the extraction knowledge base. Experimental results show that the Web new extraction method using tag path patterns can achieve better performance with an F-score more than98%on real-world datasets. These datasets are randomly selected from Chinese and English Web news sites. These results validate the feasibility and effectiveness of our Web news extracting method using tag path pattern;
     (2) To optimize the scale of the knowledge base in PP-WNE, we propose a distinguishing tag-path-pattern covering problem, which is proved to be a NP-complete problem. To obtain a near-optimal solution of the distinguishing tag-path-pattern covering problem, a special distinguishing tag path pattern-the minimal distinguishing tag path pattern is defined. A polynomial-time (ln|n|+1)-approximation algorithm, MPM, is designed, where n is the scale of positive samples. Experimental results show that the MPM algorithm can optimize the scale of the distinguishing tag path patterns, and meanwhile, it can also achieve better performance with precision, recall and F-score all above98%on real-world datasets by both node-level and text-level evaluation criteria;
     (3) To meet the requirements of Web news extraction in an open environment, we design a TTPR feature (Text to Tag Path Ration feature), and describe the calculation process of the TTPR feature by traversing the parser tree of a web page. A threshold method CEPR, which can solve the on-line Web news extraction problem effectively, is designed to distinguish the content from the non-content by the histogram of TTPR. With the combination of a Gaussian smoothing method weighted by the tag path edit distances, the ability of CEPR in extracting short text is improved significantly. CEPR is a Web news extraction algorithm with the merits of a fast, general and no-training process. It can extract Web pages across multi-resources, multi-styles, and multi-languages. The experimental results on the CleanEval datasets show that CERP outperforms CETR and other start-of-art extraction methods in most cases;
     (4) An HTML Web News Filtering and Summarization system (NFaS) is designed and implemented. In this system, a Web page identification method is proposed by using URL features, structural features, and content features. This method can solve the automatic identification problem of Web news effectively. Furthermore, Web news extraction is used to accomplish the task of Web news filtering. Finally, lexical chains are used to represent semantic relations for summarizing the Web news by extracting keywords with high quality. The effectiveness of NFaS has also been evaluated on real-world datasets.

引文

[Adelberg,1998] B. Adelberg, NoDoSE:A tool for semi-automatically extracting structured and semi-structured data from text documents, SIGMOD Record,1998,27(2):283-294.
    [Arasu and Garcia-Molina,2003] A. Arasu and H. Garcia-Molina, Extracting structured data from Web pages, In:Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'03), San Diego, California,2003,337-348.
    [Arocena and Mendelzon,1998] G.O. Arocena and A.O. Mendelzon, WebOQL:Restructuring documents, databases, and Webs, In:Proceedings of the 14th IEEE International Conference on Data Engineering (ICDE'98), Orlando, Florida, USA, Feb 23-27,1998, 24-33.
    [Baroni et al.,2008] M. Baroni, F. Chantree, A. Kilgarriff, and S. Sharoff, Cleaneval:a Competition for Cleaning Web Pages. In:Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), Marrakech, Morocco,28-30 May 2008, 638-643.
    [Bar-Yossef and Rajagopalan,2002] Z. Bar-Yossef and S. Rajagopalan, Template detection via data mining and its applications. In:Proceedings of the 11th international conference on World Wide Web (WWW'02). Honolulu, Hawaii, USA,7-11 May 2002,580-591.
    [Cafarella et al.,2008] M.J. Cafarella, A.Y. Halevy, D.Z. Wang, E. Wu, and Y. Zhang, Webtables: exploring the power of tables on the web, PVLDB,2008,1(1):538-549.
    [Cai et al.,2003] D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, Extracting content structure for web pages based on visual representation, In:Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications (APWeb'03), Berlin, Heidelberg,2003, 406-417.
    [Cai et al.,2004] D. Cai, X. He, J.-R. Wen, and W.-Y. Ma, Block-level link analysis. In: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR'04). Sheffield, UK, July 25-29,2004, 440-447.
    [Califf and Mooney,1999] M. Califf and R. Mooney, Relational learning of pattern-match rules for information extraction, In:Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence (AAAI'99/IAAI'99), Orlando, Florida, USA, July 18-22,1999,328-334.
    [Cao et al.,2009]曹冬林,廖祥文,许洪波,白硕.基于网页格式信息量的博客文章和评论抽取模型[J].软件学报,2009,20(5)：1282-1291.
    [Carme et al.,2004] J. Carme, A. Lemay, and J. Niehren, Learning node selecting tree transducer from completely annotated examples, In:Proceedings of the 7th International Colloquium on Grammatical Inference,2004,91-102.
    [Chakrabarti et al.,2001] S. Chakrabarti, M. Joshi, and V. Tawde, Enhanced topic distillation using text, markup tags, and hyperlinks, In:Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR'01). ACM, New York, NY, USA,2001,208-216.
    [Chang and Lui,2001] C.-H. Chang and S.-C. Lui, IEPAD:Information extraction based on pattern discovery, In:Proceedings of the 10th International Conferenceon World Wide Web (WWW'01), Hong-Kong,2001,223-231.
    [Chang and Kuo,2004] C.-H. Chang and S.-C. Kuo, OLERA:A semi-supervised approach for Web data extraction with visual support, IEEE Intelligent Systems,2004,19(6):56-64.
    [Chang et al.,2006] C.-H. Chang, M. Kayed, M. R. Girgis, and K. Shaalan, A survey of web information extraction systems, IEEE Transactions on Knowledge and Data Engineering, 2006,18(10):1411-1428.
    [Chekuri et al.,1997] C. Chekuri, M. Goldwasser, P. Raghavan, and E. Upfal, Web search using automatic classification, In:Proceedings of the 6th International World Wide Web Conference (WWW'97),1997.
    [Chidlovskii et al.,2000] B. Chidlovskii, J. Ragetli, and M. de Rijke, Wrapper generation via grammar induction, In:Proceedings of the 11th European conference on machine learning, LNCS,2000, (1810):96-108.
    [Crescenzi and Mecca,1998] V. Crescenzi and G Mecca, Grammars have exceptions, Information Systems,1998,23(8):539-565.
    [Crescenzi et al.,2001] V. Crescenzi, G Mecca, and P. Merialdo, RoadRunner:towards automatic data extraction from large Web sites, In:Proceedings of the 26th International Conference on Very Large Database Systems (VLDB'01), Rome, Italy,2001,109-118.
    [Ding,2004]丁春.关键词标引的若干问题探讨[J].编辑学报,2004,16(2)：105-106.
    [Doddington et al.,2004] G Doddington, A. Mitchell, M. Przybocki, L. Ramshaw, S. Strassel, R. Weischedel, The Automatic Content Extraction (ACE) program-tasks, data, and evaluation, In:Proceedings ofLREC 2004,2004,837-840.
    [Dong and Dong,2006] Z.-D. Dong and Q. Dong, HowNet and the Computation of Meaning. Singapore:World Scientific Publishing Company,2006.
    [Cormen et al.,2001] T.H. Cormen, C.E. Leiserson, R.L. Rivest and C. Stein, Introduction to Algorithms, Second Edition, The MIT Press,2001.
    [Dumais and Chen,2000] S. Dumais and H. Chen, Hierarchical classification of Web content, In: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR'00), ACM, New York, NY, USA,2000, 256-263.
    [Fan et al.,2001]范焱,郑诚,王清毅,蔡庆生,刘洁.用Naive Bayes方法协调分类Web网页[J].软件学报,2001,12(9)：1386-1392.
    [Freitag,1998] D. Freitag, Information extraction from HTML:Application of a general learning approach, In:Proceedings of the 15th Conference on Artificial Intelligence (AAAI-98), 1998,517-523.
    [Freitag and Kushmerick,2000] D. Freitag and N. Kushmerick, Boosted wrapper induction, In: Proceedings of the 17th National Conference on Artificial Intelligence and 12th Innovative Applications of Al Conference,2000,577-583.
    [Furnkranz,1999] J. Furnkranz, Exploiting structural information for text classification on the WWW, In:Proceedings of the Third International Symposium on Advances in Intelligent Data Analysis (IDA'99), Springer-Verlag, London, UK,1999,487-498.
    [Gatterbauer et al.,2007] W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krupl, and B. Pollak, Towards domain independent information extraction from web tables, In:Proceedings of the 16th international conference on World Wide Web, New York, NY, USA,2007,71-80.
    [Gibson et al.,2005] D. Gibson, K. Punera, and A. Tomkins, The volume and evolution of web page templates,In:Proceedings of Special Interest Tracks and Posters of the 14th International Conference on World Wide Web (WWW'05) 9, New York, NY, USA, ACM Press,2005,830-83.
    [Glover et al.,2002] E.J. Glover, K. Tsioutsiouliklis, S. Lawrence, D.M. Pennock, and GW. Flake, Using Web structure for classifying and describing Web pages, In:Proceedings of the 11th international conference on World Wide Web (WWW'02), ACM, New York, NY, USA, 2002,562-569.
    [Gottron,2008] Thomas Gottron, Content Code Blurring:A New Approach to Content Extraction. In:Proceedings of the 19th International Conference on Database and Expert Systems Application (DEXA'08). IEEE Computer Society, Washington, DC, USA,2008,29-33.
    [Grishman and Sundheim,1996] R. Grishman, B. Sundheim, Message understanding conference-6: A brief history, In:Proceedings of the 16th Conference on Computational Linguistics, 1996,(1):466-471.
    [Grishman,1997] R. Grishman, Information extraction:Techniques and challenges, Lecture Notes In Computer Science,1997,1299:10-27.
    [Guan and Wong,1999] T. Guan and K.F. Wong, KPS-aWeb information mining algorithm, In: Proceedings of the 8th International World Wide Web Conference (WWW'99),1999, 1495-1507.
    [Gulhane et al.,2011] P. Gulhane, A. Madaan, R. Mehta, J. Ramamirtham, R. Rastogi, S. Satpal, S.H. Sengamedu, A. Tengli, and C. Tiwari, Web-Scale Information Extraction with Vertex, In:Proceedings of 2011 IEEE 27th International Conference on Data Engineering (ICDE '11), Hannover, Apr.11-16,2011,1209-1220.
    [Gupta and Sarawagi,2009] R. Gupta and S. Sarawagi, Answering table augmentation queries from unstructured lists on the web, PVLDB,2009,2(1):289-300.
    [Hammer et al.,1997] J. Hammer, J. McHugh, and Garcia-Molina, Semistructured data:the TSIMMIS experience, In:Proceedings of the 1st East-European Symposium on Advances in Databases and Information Systems(ADBIS'97), St. Petersburg, Rusia, September 2-5, 1997,1-8.
    [Hogue and Karger,2005] A. Hogue and D. Karger, Thresher:Automating the Unwrapping of Semantic Content from the World Wide, In:Proceedings of the 14th International Conference on World Wide Web (WWW'05), Japan,2005,86-95.
    [Hsu and Dung,1998] C.-N. Hsu and M. Dung, Generating finite-state transducers for semi-structured data extraction from the web, Journal of Information Systems,1998, 23(8):521-538.
    [Hu and Meng,2004]胡东东,孟小峰.一种基于树结构的Web数据自动抽取方法[J].计算机研究与发展,2004,41(10)：1607-1613.
    [Hu et al.,2005] Y. Hu, G Xin, R. Song, G. Hu, S. Shi, Y. Cao, and H. Li, Title extraction from bodies of HTML documents and its application to web page retrieval, In:Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,2005,250-257.
    [Hu et al.,2006]胡国平,张巍,王仁华.基于双层决策的新闻网页正文精确抽取[J].中文信息学报,2006,20(6)：1-10.
    [Kan,2004]M.-Y. Kan, Web page categorization without the web page, In:Proceedings of the 13th international World Wide Web conference on Alternate track papers \& posters (WWW Alt.'04), ACM, New York, NY, USA,2004,262-263.
    [King and Lowe,2003] G. King, W. Lowe, An automated information extraction tool for international conflict data with performance as good as human coders:A rare events evaluation design, International Organization,2003,617-642.
    [Kleinberg,1999] J.M. Kleinberg, Authoritative sources in a hyperlinked environment, Journal of the ACM,1999,46(5):604-632.
    [Kosala et al.,2002] R. Kosala, J. Van den Bussche, M. Bruynooghe, and H. Blockeel, Information extraction in structured documents using tree automata induction, In:Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD),2002,299-310.
    [Kosala et al.,2003] R. Kosala, M. Bruynooghe, H. Blockeel, and J. Van den Bussche, Information extraction from web documents based on local unranked tree automaton inference, In: Proceedings of the International Joint Conference on Artificial Intelligence (LICAI),2003, 403-408.
    [Kosala et al.,2006] R. Kosala, H. Blockeel, M. Bruynooghe, and J. Van den Bussche, Information extraction from structured documents using k-testable tree automaton inference, Data and Knowledge Engineering,2006,58(2):129-158.
    [Kushmerick et al.,1997] N. Kushmerick, D. Weld, and R. Doorenbos, Wrapper induction for information extraction, In:Proceedings of the Fifteenth International Conference on Artificial Intelligence (IJCAI'97), Nagoya, Aichi, Japan, August 23-29,1997,729-735.
    [Kwon and Lee,2000] O.-W. Kwon and J.-H. Lee, Web page classification based on k-nearest neighbor approach, In:Proceedings of the 5th International Workshop on Information Retrieval with Asian Languages (IRAL'00), ACM,2000,9-15.
    [Laender et al.,2002a] A. H. F. Laender, B. A. Ribeiro Neto, A. S. da Silva, J. S. Teixeira, A brief survey of web data extraction tools, SIGMOD Rec,2002,31(2):84-93.
    [Laender et al.,2002b] A. H. F. Laender, B. Ribeiro-Neto, and A. S. DA Silva, DEByE-Data Extraction by Example, Data and Knowledge Engineering,2002,40(2):121-154.
    [Li and Chen,2003]李晶,陈恩红.Web信息抽取[J].计算机科学,2003,30(6)：78-81.
    [Li et al.,2003]李保利,陈玉忠,俞士汶.信息抽取研究综述[J].计算机工程与应用,2003,(10)：1-5,66.
    [Li et al.,2004]李素建,王厚峰,俞士汶,辛乘胜.关键词自动标引的最大熵模型应用研究[J].计算机学报,2004,27(9)：1192-1197.
    [Liu and Li,2002] Q. Liu and S.-J. Li, Word Similarity Computing Based on How-net. Computational Linguistics and Chinese Language Processing,2002,7(2):59-76.
    [Liu and Zhai,2005] B. Liu and Y. Zhai, NET-A system for extracting web data from flat and nested data records, In:Proceedings of 6th International Conference on Web Information Systems Engineering (WISE'05),2005,487-495.
    [Liu et al.,2000] L. Liu, C. Pu, and W. Han, XWRAP:An XML-enabled wrapper construction system for Web information sources, In:Proceedings of the 16th IEEE International Conference on Data Engineering (ICDE),2000,611-621.
    [Liu et al.,2003] B. Liu, R. Grossman, and Y. Zhai, Mining data records in Web pages, In: Proceedings of the Ninth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (KDD'03),2003,601-606.
    [Liu et al.,2006] W. Liu, X. F. Meng, and W. Y. Meng, Vision-Based Web data records extraction, In:Proceedings of the 9th SIGMOD Int'l Workshop on Web and Databases(WebDB 2006), 2006.
    [Liu et al.,2007]刘远超,王晓龙,徐志明,刘秉权.基于粗集理论的中文关键词短语构成规则挖掘[J].电子学报,2007,35(2)：371-374.
    [Ma et al.,2009]马安香,张斌,高克宁,齐鹏,张引.基于结果模式的Deep Web数据抽取[J].计算机研究与发展,2009,46(2)：280-288.
    [Mayer et al.,2006] M. A. Mayer, V. Karkaletsis, K. Stamatakis, A. Leis, D. Villarroel, C. Thomeczek, M. Labsky, F. Lopez-Ostenero, T. Honkela, MedIEQ-Quality labelling of medical web content using multilingual information extraction, Studies in Health Technology and Informatics,2006, (121):183-190.
    [Merchant et al.,1996] R. Merchant, M. E. Okurowski, N. Chinchor, The multilingual entity task (MET) overview, In:Proceedings of Annual Meeting of the ACL,1996,445-447.
    [Miao et al.,2009] G. Miao, J. Tatemura, W.-P. Hsiung, A. Sawires, and L. E. Moser, Extracting data records from the web using tag path clustering, In:Proceedings of the 18th international conference on World wide web (WWW'09), New York, NY, USA,2009, 981-990.
    [Muslea,1999] I. Muslea, Extraction patterns for information extraction tasks:A survey, In: AAAI-99 Workshop on Machine Learning for Information Extraction,1999,1-6.
    [Muslea et al.,1999] I. Muslea, S. Minton, and C. Knoblock, A hierarchical approach to wrapper induction, In:Proceedings of the 3rd International Conference on Autonomous Agents (AGENTS'99), Seattle, Washington, USA, May 1-5,1999,190-197.
    [Nenkova et al.,2003] A. Nenkova, B. Schiffman, A. Schlaiker, S. Blair-Goldensohn, R. Barzilay, S. Sigelman, V. Hatzivassiloglou, K. McKeown, Columbia at the Document Understanding Conference 2003, In:Proceedings of the document understanding workshop DUC'03, 2003.
    [Parapar and Barreiro,2007] J. Parapar and A. Barreiro, An Effective and Efficient Web News Extraction Technique for an Operational NewsIR. System, In:ⅪⅡ Conferencia de la Asociacion Espanola para la Inteligencia Artificial CAEPIA-TTLA 2007. Actas Vol II, Salamanca, Spain.12-16 November 2007,319-328.
    [Pasternack and Roth,2009] J. Pasternack and D. Roth, Extracting article text from the web with maximum subsequence segmentation, In:Proceedings of the 18th international conference on World wide web (WWW'09), Madrid, Spain,2009,971-980.
    [Raeymaekers et al.,2008] S. Raeymaekers, M. Bruynooghe, and J. Bussche, Learning (k,l)-contextual tree languages for information extraction from web pages, Journal of Machine Learning,2008,71(2-3):155-183.
    [Reis et al.,2004] D.C. Reis, P.B. Golgher, A.S. Silva, and A.F. Laender, Automatic web news extraction using tree edit distance, In:Proceedings of the 13th International Conference on World Wide Web (WWW'04), New York, NY, USA,2004,502-511.
    [Ribeiro-Neto et al.,1999] B. A. Ribeiro-Neto, A. H. F. Laender, and A. S. DA Silva, Extracting semi-structured data through examples, In:Proceedings of the Eighth ACM International Conference on Information and Knowledge Management (CIKM),1999,94-101.
    [Sandoval-Almazan et al.,2009] R. Sandoval-Almazan, S. Mellouli, F. Bouslama, A new framework for analyzing political news, In:Proceedings of the 10th Annual International Conference on Digital Government Research,2009,328-329.
    [Saggion et al.,2007] H. Saggion, A. Funk, D. Maynard, K. Bontcheva, Ontology-based information extraction for business intelligence, Lecture Notes in Computer Science, 2007, (4825):843-856.
    [Sahuguet and Azavant,2001] A. Saiiuguet and F. Azavant, Building intelligent Web applications using lightweight wrappers, Data and Knowledge Engineering,2001,36(3):283-316.
    [Salton et al.,1973] G Salton, A. Wong, C.S. Yang, On the specification of term values in automatic indexing. Journal of Documentation,1973,29(4):351-372.
    [Satpal et al.,2011] S. Satpal, S. Bhadra, S. Sellamanickam, R. Rastogi, and P. Sen, Web information extraction using markov logic networks, In:Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD'11), ACM, New York, NY, USA,2011,1406-1414.
    [Shen et al.,2006] D. Shen, J.-T. Sun, Q. Yang, and Z. Chen, A comparison of implicit and explicit links for web page classification, In:Proceedings of the 15th international conference on World Wide Web (WWW'06), ACM, New York, NY, USA,2006,643-650.
    [Simon and Lause,2005] K. Simon and G Lausen, ViPER:Augmenting automatic information extraction with visual perceptions, In:Proceedings of the 14th ACM international Conference on Information and Knowledge Management (CIKM'05),2005,381-388.
    [Soderland,1999] S. Soderland, Learning information extraction rules for semi-structured and free text, Journal of Machine Learning,1999,34(1-3):233-272.
    [Sun and Guan,2004]孙承杰,关毅等.基于统计的网页正文信息抽取方法的研究[J].中文信息学报,2004,18(5)：17-22.
    [Suo et al.,2006]索红光,刘玉树,曹淑英.一种基于词汇链的关键词抽取方法[J].中文信息学报,2006,20(6)：25-30.
    [TAC,2010]Text Analysis Conference, http://www.nist.gov/tac/,23-Feb-2010.
    [Tong and Dean,2008] S. Tong and J. Dean, System and methods for automatically creating lists, US Patent:7350187, Mar 2008.
    [Turney,1999] P.D. Turney, Learning to extract keyphrases from text, National Research Council, Canada, NRC Technical Report-ERB-1057,1999.
    [Wang and Cohen,2007] R. C. Wang and W. W. Cohen, Language-independent set expansion of named entities using the web,In:Proceedings of the 2007 Seventh IEEE International Conference on Data Mining (ICDM'07), Washington, DC, USA,2007,342-350.
    [Wang and Lochovsky,2003] J. Wang and F.H. Lochovsky, Data extraction and label assignment for Web databases, In:Proceedings of the 12th International Conference on World Wide Web (WWW'03), Budapest, Hungary,2003,187-196.
    [Wang et al.,2009] J. Wang, C. Chen, C. Wang, J. Pei, J. Bu, Z. Guan, and W.V. Zhang, Can we learn a template-independent wrapper for news article extraction from a single training site? In:Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD'09), ACM, New York, NY, USA,2009,1345-1354.
    [Weninger et al.,2010] T. Weninger, W.H. Hsu, and J. Han, CETR:content extraction via tag ratios. In:Proceedings of the 19th international conference on World Wide Web (WWW'10), Raleigh, North Carolina, USA,2010,971-980.
    [Weninger et al.,2012] T. Weninger, F. Fumarola, R. Barber, Jiawei Han, and D. Malerba, Unexpected results in automatic list extraction on the web, SIGKDD Explorations Newsletter,2012,12(2):26-30.
    [Witten et al.,1999] I.H. Witten, G.W. Paynter, E. Frank, C. Gutwin, and C.G. Nevill-Manning, KEA:Practical automatic keyphrase extraction, In:Proceedings of the fourth ACM conference on Digital libraries (DL'99), ACM, New York, NY, USA,1999,254-255.
    [Wu and Wu,2012] G. Wu, X. Wu, Extracting Web News Using Tag Path Patterns, In:Proceedings of the 2012 IEEE/WIC/ACM International Conference on Web Intelligence (WI'12), December 4-7,2012, Macau, China,588-595.
    [Wu et al.,2009] G.-Q. Wu, X. Wu, X.-G. Hu,H.-G. Li, Y. Liu, and R.-G. Xu, Web news extraction based on path pattern mining, In:Proceedings of the 6th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD'09), Tianjin, China, August 14-16, 2009,7:612-617.
    [Wu et al.,2010] X. Wu, G.-Q. Wu, F. Xie, Z. Zhu, X.-G. Hu, H. Lu, and H. Li, News filtering and summarization on the Web, IEEE Intelligent Systems,25(2010),5:68-76.
    [Wu et al.,2011] X. Wu, F. Xie, G. Wu, and W. Ding, Personalized News Filtering and Summarization on the Web, In:Proceedings of the 23rd IEEE International Conference on Tools with Artificial Intelligence (ICTAI'11), Boca Raton, Florida, USA, Nov.7-9, 2011,414-421.
    [Xue et al.,2007] Y. Xue, Y. Hu, G. Xin, R. Song, S. Shi, Y. Cao, C.-Y. Lin, and H. Li, Web page title extraction and its application, Information Processing and Management:an International Journal,2007,43(5):1332-1347.
    [Yang et al.,2002] Y. Yang, S. Slattery, and R. Ghani, A study of app roaches to hypertext categorization, Journal of Intelligent Information Systems,2002,18,2-3 (March 2002), 219-241.
    [Yang et al.,2008] S.-H. Yang, H.-L. Lin, and Y.-B. Han, Automatic data extraction from template-generated web pages, Journal of Software,2008,19(2):209-223.
    [Yu et al.,2003]于琨,蔡智,糜仲春,蔡庆生.基于路径学习的信息自动抽取方法[J].小型微型计算机系统,2003,24(12)：2147-2149.
    [Zhai and Liu,2005] Y. Zhai and B. Liu, Web data extraction based on partial tree alignment, In: Proceedings of the 14th International Conference on World Wide Web (WWW'05), Japan, 2005,76-85.
    [Zhai and Liu,2006] Y. Zhai and B. Liu, Structured data extraction from the web based on partial tree alignment, IEEE Transactions on Knowledge and Data Engineering,2006,18(2): 1614-1628.
    [Zheng et al.,2007] S. Zheng, R. Song, and J.-R. Wen, Template-independent news extraction based on visual consistency. In:Proceedings of the 22nd national conference on Artificial intelligence (AAAI'07). Vancouver, British Columbia, Anthony Cohn (Ed.),2007, Vol.2. AAAI Press, pp.1507-1512.
    [Zhou et al.,2009]周佳颖,朱珍民,高晓芳.基于统计与正文特征的中文网页正文抽取研究[J].中文信息学报,2009,23(5)：80-85.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700