网络信息自动化高效抽取技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着互联网爆炸式的发展和普及,网络信息已经成为了一种宝贵的信息数据资源。海量的网络数据使得数据分析与挖掘系统进入了一个新时代,越来越多的网络应用系统需要对来自不同数据源的结构化数据进行抽取、挖掘和整合。然而,由于网页文档的半结构化性质,网页上呈现的数据往往不能被机器自动地抽取和理解,因此,网络信息抽取的研究目标在于提取网页的结构化数据。互联网数据的海量规模与高度异构的特征,为网络信息抽取工作带来了巨大的挑战。
     本文围绕网络信息的海量规模与高度异构的特征,分数据记录抽取和数据单元抽取两个层次,对自动化、高效抽取网络信息的技术展开了相关研究,研究内容包括以下四个方面:
     1.针对网络信息高度异构的特点,提出新的自动化的基于锚点树的数据记录的抽取方法(Mining data records Based on Anchor Trees,MiBAT)。首先分析了当数据记录含有一定的不规则内容时(例如用户原创内容)时,现有的基于相似度检测的自动化方法并不能取得理想的抽取效果。本文提出锚点的概念,对应数据记录中的某些关键的数据单元。例如,每个用户创建、发表的帖子记录(例如在线论坛帖子、用户评论等)都含有发帖时间这个关键的数据单元,可以作为由领域约束获得的锚点。本文提出MiBAT方法,利用领域约束检测出锚点,然后围绕包含锚点的DOM(Document Object Model)子树,完成数据记录的自动化抽取工作。实验表明,与以往的自动化的数据记录抽取方法相比,MiBAT方法可以较好的克服数据记录的不规则性,具有较高的抽取准确度。
     2.针对数据记录层次的网络信息的海量规模的特点,提出快速高效的锚点树的寻找算法。传统的网络信息挖掘算法采用自上而下的枚举DOM子树的方式,按照这种方式设计锚点树寻找算法,MiBAT的时间复杂度为O(n2),其中n是输入网页的DOM树的结点的数量。本文提出一个新的基于标签路径自底向上聚集的锚点树寻找算法,使得MiBAT的时间复杂度降到O(nlogn)。实验表明,新的锚点树寻找算法极大地提高了MiBAT方法的运行效率,同时保持较高的抽取准确度。
     3.针对网络信息的跨领域异构的特点,提出不依赖领域约束的通用锚点的检测方法。锚点的概念最初由领域约束而来,对应于领域相关的数据单元。在实际应用时,对不同的领域,需要预先指定相应的领域约束,这在某种程度上限制了MiBAT方法的自动化应用。本文对此进行扩展,提出通用锚点的概念及其检测和应用方法。实验表明,应用通用锚点时,MiBAT方法可以应用于不同的领域的信息抽取任务,具有较高的准确度,不需要人为指定领域约束。
     4.针对数据单元层次的网络信息的海量规模的特点,研究快速高效的DOM树匹配算法,应用在数据单元抽取对齐任务中。现有的广泛应用的树匹配方法的复杂度是O(n2),并不适合海量规模的网络信息抽取任务。本文提出一个新的基于标签路径序列的最长公共子列(Longest Common Subsequence,LCS)的方法。利用LCS问题的稀疏性质,算法复杂度可以达到O(rlogn),其中r等于两棵树上具有相同标签路径的结点对的数量;当两棵树的候选匹配较为稀疏时,r≈O(n),算法的复杂度接近O(nlogn)。实验表明,与现有的广泛应用的DOM树匹配方法相比,本文提出的方法具有更高的运行效率,同时保持较为一致的树匹配准确度和数据单元对齐准确度。
     综上所述,本文在数据记录抽取和数据单元抽取两个层次上,提出了自动化的、高效的网络信息抽取方法,能够较好的适应网络信息高度异构和海量规模的特点,具有较大的理论价值和实际应用价值。
The World Wide Web has become an important resource of information due to itsexplosive growth and spread in the past two decades. The tremendous amount of web datahas opened a new era for data analysis and mining systems. More and more web applica-tions need to extract, mine, and integrate data from enormous data sources. However, dueto the semi-structure characteristic of web pages, web data exhibited on web pages is notdirectly consumable by machines. Web information extraction aims at extracting struc-tured data from web pages, which is a very challenging problem due to the large-scaleand highly-heterogeneous characteristic of web information.
     Aiming at handling the large-scale and highly-heterogeneous characteristics of webinformation, this dissertation studies automatic and efcient technologies for web infor-mation extraction, conducted on two levels of data records and data units respectively.The research content includes:
     1. Targeting the high-heterogeneity of web information, a novel automatic datarecord extraction method called MiBAT (Mining data records Based on Anchor Trees) isproposed. Existing similarity-based automatic approaches cannot extract web data record-s accurately when a large amount of unstructured content exists (e.g., user-generated con-tent). This paper presents the concept of pivots, which correspond to some key data units.For example, almost every data record created and posted by users (e.g., online forumposts, user reviews, etc.) contains the publication date as a key data unit, which is apivot derived by domain constraints. The proposed MiBAT method detects pivots basedon domain constraints, identifies anchor trees that are DOM (Document Object Model)sub-trees containing the pivots, and finally extracts data records around the anchor treesautomatically. Experimental results show that, compared to existing approaches, MiBATis able to overcome the irregularity of data records caused by unstructured content, result-ing in high accuracy.
     2. Targeting the large-scale of web information on the level of data records, a fastand efcient anchor tree finding algorithm is proposed. In web mining community, thetraditional mining approach is to enumerate sub-trees in a top-down manner; followingthis approach, the time complexity of MiBAT is O(n2), where n is the number of nodeson the DOM tree of the web page. In this paper, a novel anchor tree finding algorithm is presented based on aggregating tag paths in a bottom-up fashion, which enables MiBAT torun in O(n log n) time. Experimental results demonstrate that the new method significantlyimproves the efciency of MiBAT while remaining high accuracy.
     3. Targeting the cross-domain high-heterogeneity of web information, the conceptof generic pivots is proposed. The concept of pivots origins from domain constraints,corresponding to some domain-dependent key data units. In real applications, diferentdomain constraints are required to be identified for diferent domains, which limits theapplicability of MiBAT to some extent. To resolve the domain dependency, this paperexpands the concept of pivots and proposes generic pivots. Experimental results suggestthat, when using generic pivots, MiBAT is applicable to diferent domains achieving highaccuracy, without any pre-defined domain constraints.
     4. Targeting the large-scale of web information on the level of data units, a fastand efcient method for DOM tree matching is proposed for data unit alignment andextraction. The most widely used tree matching algorithm runs in O(n2) time, whichis not appropriate for web-scale processing. This paper proposes a novel tree matchingmethod based on the longest common subsequence (LCS) of the tag path sequences ofDOM trees. By exploring the inherent sparsity of the LCS problem, the proposed treematching method runs in O(r log n) time, where r is the number of pairs of nodes thathave identical tag paths from the two trees; when the matching is sparse, r≈O(n),and the algorithm runs in O(n log n) time approximately. Extensive experimental resultsdemonstrate that, compared to the existing method, the proposed approach significantlyimproves the running efciency and also achieves similar tree matching results as well asdata unit alignment results.
     In summary, this dissertation presents technologies for automatic and efcient webinformation extraction on both levels of data records and data units, which can well handlethe large-scale and highly-heterogeneous characteristics of web information with boththeoretical and application value.
引文
[1] Chang C H, Kayed M, Girgis M R, et al. A Survey of Web Information Extrac-tion Systems[J]. IEEE Transactions on Knowledge and Data Engineering,2006,18(10):1411–1428.
    [2] Baumgartner R, Gottlob G, Herzog M. Scalable Web Data Extraction for OnlineMarket Intelligence[J]. Proceedings of the VLDB Endowment,2009,2(2):1512–1523.
    [3] Liu J, Song Y I, Lin C Y. Competition-based User Expertise Score Estimation[C].Proceeding of34th International ACM SIGIR Conference on Research and Devel-opment in Information Retrieval. Beijing, China: ACM,2011:425–434.
    [4] Liu B. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications)[M].Springer Series on Data-Centric Systemsand Applications,2007:346–363.
    [5] Jiang J, Song X, Yu N, et al. FoCUS: Learning to Crawl Web Forums[J]. IEEETranscations on Knowledge and Data Engineering,2013,25(6):1293–1306.
    [6] Yang W Y, Cao Y, Lin C Y. A Structural Support Vector Method for ExtractingContexts and Answers of Questions from Online Forums[C]. Proceedings of the2009Conference on Empirical Methods in Natural Language Processing. Singa-pore: ACL,2009:514–523.
    [7] Joty S R, Carenini G, Lin C Y. Unsupervised Modeling of Dialog Acts in Asyn-chronous Conversations[C]. Proceedings of22nd International Joint Conference onArtificial Intelligence. Barcelona, Catalonia, Spain,2011:1807–1813.
    [8] Hu M, Liu B. Mining and Summarizing Customer Reviews[C]. Proceedings of10th ACM SIGKDD International Conference on Knowledge Discovery and DataMining. Seattle, WA, USA: ACM,2004:168–177.
    [9] Liu B, Grossman R, Zhai Y. Mining Data Records in Web Pages[C]. Proceedingsof9th ACM SIGKDD International Conference on Knowledge Discovery and DataMining. Washington, DC, USA,2003:601–606.
    [10] Zhai Y, Liu B. Web Data Extraction Based on Partial Tree Alignment[C]. Pro-ceedings of14th International World Wide Web Conference. Chiba, Japan: ACM,2005:76–85.
    [11] Reis D C, Golgher P B, Silva A S, et al. Automatic web news extraction using treeedit distance[C]. Proceedings of13th International World Wide Web Conference.New York, NY, USA: ACM,2004:502–511.
    [12] Zheng S, Song R, Wen J R, et al. Joint Optimization of Wrapper Generation andTemplate Detection[C]. Proceedings of13th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining. San Jose, CA, USA: ACM,2007:894–902.
    [13] Miao G, Tatemura J, Hsiung W P, et al. Extracting Data Records from the WebUsing Tag Path Clustering[C]. Proceedings of18th International World Wide WebConference. Madrid, Spain: ACM,2009:981–990.
    [14] Yang J M, Cai R, Wang Y, et al. Incorporating Site-Level Knowledge to ExtractStructured Data from Web Forums[C]. Proceedings of18th International WorldWide Web Conference. Madrid, Spain: ACM,2009:181–190.
    [15] Bing L, Lam W, Gu Y. Towards a Unified Solution: Data Record Region Detectionand Segmentation[C]. Proceedings of20th ACM Conference on Information andKnowledge Management. Glasgow, United Kingdom: ACM,2011:1265–1274.
    [16] Bille P. A Survey on Tree Edit Distance and Related Problems[J]. TheoreticalComputer Science,2005,337(1–3):217–239.
    [17] Tai K C. The Tree-to-Tree Correction Problem[J]. Journal of the ACM,1979,26(3):422–433.
    [18] Selkow S M. The Tree-to-Tree Editing Problem[J]. Information Processing Letters,1977,6(6):184–186.
    [19] Buttler D. A Short Survey of Document Structure Similarity Algorithms[C]. Inter-national Conference on Internet Computing. Las Vegas, NV, USA,2004:3–9.
    [20] Yang W. Identifying Syntactic Diferences between Two Programs[J]. Software-Practice and Experience,1991,21(7):739–755.
    [21] Chakrabarti D, Mehta R R. The Paths More Taken: Matching DOM Trees toSearch Logs for Accurate Webpage Clustering[C]. Proceedings of19th Interna-tional World Wide Web Conference. Raleigh, NC, USA: ACM,2010:211–220.
    [22] Buttler D, Liu L, Pu C. A Fully Automated Object Extraction System for theWorld Wide Web[C]. Proceedings of21st International Conference on DistributedComputing Systems. Washington, DC, USA: IEEE,2001:361–370.
    [23] Embley D W, Jiang Y, k. Ng Y. Record-Boundary Discovery in Web Docu-ments[C]. Proceedings of the1999ACM SIGMOD International Conference onManagement of Data. Philadelphia, PA, USA: ACM,1999:467–478.
    [24] Crescenzi V, Mecca G, Merialdo P. RoadRunner: Towards Automatic Data Ex-traction from Large Web Sites[C]. Proceedings of27th International Conferenceon Very Large Data Bases. Roma, Italy: Morgan Kaufmann,2001:109–118.
    [25] Wang J, Lochovsky F H. Data Extraction and Label Assignment for Web Databas-es[C]. Proceedings of12th International World Wide Web Conference. Budapest,Hungary: ACM,2003:187–196.
    [26] Arasu A, Garcia-Molina H. Extracting Structured Data from Web Pages[C]. Pro-ceedings of the2003ACM SIGMOD International Conference on Management ofData. San Diego, CA, USA: ACM,2003:337–348.
    [27] Kushmerick N, Weld D S, Doorenbos R B. Wrapper Induction for InformationExtraction[C]. Proceedings of5th International Joint Conference on Artificial In-telligence. Nagoya, Japan: Morgan Kaufmann,1997:729–737.
    [28] Hsu C N, Dung M T. Generating Finite-State Transducers for Semi-StructuredData Extraction from the Web[J]. Informaion Systems,1998,23(9):521–538.
    [29] Chang C H, Lui S C. IEPAD: Information Extraction Based on Pattern Discov-ery[C]. Proceedings of10th International World Wide Web Conference. HongKong, China: ACM,2001:681–688.
    [30] Chuang S L, Hsu J Y J. Tree-Structured Template Generation for Web Pages[C].Proceedings of the2004IEEE/WIC/ACM International Conference on Web Intel-ligence. Washington, DC, USA: IEEE,2004:327–333.
    [31] Zhao H, Meng W, Yu C. Mining Templates from Search Result Records ofSearch Engines[C]. Proceedings of13th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining. San Jose, CA, USA: ACM,2007:884–893.
    [32] Zheng S, Song R, Wen J R, et al. Efcient Record-Level Wrapper Induction[C].Proceedings of18th ACM Conference on Information and Knowledge Manage-ment. Hong Kong, China: ACM,2009:47–56.
    [33] Cai D, Yu S, Wen J R, et al. Extracting Content Structure for Web Pages Based onVisual Representation[C]. Proceedings of5th Asia Pacific Web Conference. Xian,China: Springer,2003:406–417.
    [34] Simon K, Lausen G. ViPER: Augmenting Automatic Information Extraction withVisual Perceptions[C]. Proceedings of14th ACM International Conference on In-formation and Knowledge Management. Bremen, Germany: ACM,2005:381–388.
    [35] Zhao H, Meng W, Wu Z, et al. Fully Automatic Wrapper Generation for Search En-gines[C]. Proceedings of14th International World Wide Web Conference. Chiba,Japan: ACM,2005:66–75.
    [36] Embley D W, Campbell D M, Jiang Y S, et al. Conceptual-Model-Based DataExtraction from Multiple-Record Web Pages[J]. Data&Knowledge Engineering,1999,31(3):227–251.
    [37] Zhu J, Nie Z, Wen J R, et al. Simultaneous Record Detection and Attribute Label-ing in Web Data Extraction[C]. Proceedings of12th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining. Philadelphia, PA, USA:ACM,2006:494–503.
    [38] Liu B, Zhai Y. NET-A System for Extracting Web Data from Flat and NestedData Records[C]. Proceedings of6th International Conference on Web InformationSystems Engineering. New York, NY, USA: Springer,2005:487–495.
    [39] Arlotta L, Crescenzi V, Mecca G, et al. Automatic Annotation of Data ExtractedFrom Large Web Sites[C]. Proceedings of6th International Workshop on the Weband Databases. San Diego, CA, USA,2003:7–12.
    [40] Lu Y, He H, Zhao H, et al. Annotating Structured Data of the Deep Web[C]. Pro-ceedings of23rd International Conference on Data Engineering. Istanbul, Turkey:IEEE,2007:376–385.
    [41] Arocena G O, Mendelzon A O. WebOQL: Restructuring Documents, Databases,and Webs[C]. Proceedings of14th International Conference on Data Engineering.Orlando, FL, USA: IEEE,1998:24–33.
    [42] Crescenzi V, Mecca G. Grammars Have Exceptions[J]. Information Systems,1998,23(8):539–565.
    [43] Hammer J, McHugh J, Garcia-Molina H. Semistructured Data: The Tsim-mis Experience[C]. Proceedings of1st East-European Symposium on Advancesin Databases and Information Systems. St.-Petersburg, Russia: Nevsky Dialect,1997:1–8.
    [44] Muslea I, Minton S, Knoblock C. A Hierarchical Approach to Wrapper Induc-tion[C]. Proceedings of3rd Annual Conference on Autonomous Agents. Seattle,WA, USA: ACM,1999:190–197.
    [45] Zheng S, Scott M R, Song R, et al. Pictor: An Interactive System for ImportingData from a Website[C]. Proceeding of14th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining. Las Vegas, NV, USA: ACM,2008:1097–1100.
    [46] Wang J, Chen C, Wang C, et al. Can We Learn a Template-independent Wrap-per for News Article Extraction from a Single Training Site?[C]. Proceedings of15th ACM SIGKDD International Conference on Knowledge Discovery and DataMining. Paris, France: ACM,2009:1345–1354.
    [47] Zheng S, Song R, Wen J R. Template-Independent News Extraction Based onVisual Consistency[C]. Proceedings of22nd AAAI Conference on Artificial Intel-ligence. Vancouver, British Columbia, Canada: AAAI,2007:1507–1512.
    [48] Hogue A, Karger D. Thresher: Automating the Unwrapping of Semantic Contentfrom The World Wide Web[C]. Proceedings of14th International World Wide WebConference. Chiba, Japan: ACM,2005:86–95.
    [49] Irmak U, Suel T. Interactive Wrapper Generation with Minimal User Efort[C].Proceedings of15th International World Wide Web Conference. Edinburgh, Scot-land: ACM,2006:553–563.
    [50] Zhai Y, Liu B. Extracting Web Data Using Instance-Based Learning[C]. Proceed-ings of6th International Conference on Web Information Systems Engineering.New York, NY, USA: Springer,2005:318–331.
    [51] Kristina Lerman S M, Knoblock C. Wrapper Maintenance: A Machine LearningApproach[J]. Journal of Artificial Intelligence Research,2003,18:149–181.
    [52]孟小峰,王海燕,谷明哲等. XWlS中基于预定义模式的包装器[J].计算机应用,2001,21(9):1–3.
    [53]张绍华,徐林吴.基于样本实例的Web信息抽取[J].河北大学学报:自然科学版,2001,21(4):431–437.
    [54]朱明,王军,王俊善.基于多层模式的多记录网页信息抽取方法[J].计算机工程,2001,27(9):40–42.
    [55]李效东,顾毓清.基于DOM的Web信息提取[J].计算机学报,2002,25(5):526–533.
    [56]李芳,盛焕晔.特定领域专家主页信息的自动抽取[C].全国第八界计算语言学联合学术大会(JSCL-2005)论文集.南京,2005:675–677.
    [57]郑长松,傅彦,余莉.基于模板的Web信息自动抽取方法[J].计算机应用研究,2009,26(2):570–582.
    [58]杨少华,林海略,韩燕波.针对模板生成网页的一种数据自动抽取方法[J].软件学报,2008,19(2):209–223.
    [59]梅雪,种学旗,郭岩.一种全自动生成网页信息抽取Wrapper的方法[J].中文信息学报,2008,22(1):22–29.
    [60] Liu W, Meng X, Meng W. ViDE: A Vision-Based Approach for Deep Web Da-ta Extraction[J]. IEEE Transactions on Knowledge and Data Engineering,2010,22(3):447–460.
    [61] Embley D W, Campbell D M, Jiang Y S, et al. A Conceptual-Modeling Approachto Extracting Data from the Web[C]. Proceedings of17th International Conferenceon Conceptual Modeling. Singapore: Springer,1998:78–91.
    [62] Su W, Wang J, Lochovsky F H. ODE: Ontology-assisted Data Extraction[J]. ACMTranscations on Database Systems,2009,34(2):1–35.
    [63]王仲远,艾静,孟小峰.一种数据驱动的Wrapper自动生成与维护方法[J].计算机研究与发展,2008,43:37–42.
    [64]刘亚东,彭舰,张达平.基于智能的网页信息提取系统的研究与设计[J].四川大学学报:自然科学版,2009,46(4):957–962.
    [65] Choi N, Song I Y, Han H. A Survey on Ontology Mapping[J]. SIGMOD Record,2006,35(3):34–41.
    [66] McCann R, Shen W, Doan A. Matching Schemas in Online Communities: A Web2.0Approach[C]. Proceedings of24th International Conference on Data Engineer-ing. Cancun, Mexico: IEEE,2008:110–119.
    [67] Kohlschu¨tter C, Fankhauser P, Nejdl W. Boilerplate Detection Using Shallow TextFeatures[C]. Proceedings of3rd International Conference on Web Search and WebData Mining. New York, NY, USA: ACM,2010:441–450.
    [68] Vieira K, da Silva A S, Pinto N, et al. A Fast and Robust Method for Web PageTemplate Detection and Removal[C]. Proceedings of15th ACM CIKM Interna-tional Conference on Information and Knowledge Management. Arlington, VA,USA: ACM,2006:258–267.
    [69] Gibson D, Punera K, Tomkins A. The Volume and Evolution of Web Page Tem-plates[C]. Proceedings of14th International World Wide Web Conference-SpecialInterest Tracks&Posters. Chiba, Japan: ACM,2005:830–839.
    [70] Yamada Y, Craswell N, Nakatoh T, et al. Testbed for Information Extraction fromDeep Web[C]. Proceedings of13th International World Wide Web Conference-Alternate Track Papers&Posters. New York, NY, USA: ACM,2004:346–347.
    [71] Cormen T H, Leiserson C E, Rivest R L, et al. Introduction to Algorithms, Sec-ond Edition[M].The MIT Press and McGraw-Hill Book Company,2001:221–252,350–355.
    [72] Augsten N, Bo¨hlen M H, Gamper J. The pq-gram Distance between Ordered La-beled Trees[J]. ACM Transcations on Database Systems,2010,35(1):1–36.
    [73] Tatikonda S, Parthasarathy S. Hashing Tree-Structured Data: Methods and Ap-plications[C]. Proceedings of26th International Conference on Data Engineering.Long Beach, CA, USA: IEEE,2010:429–440.
    [74] Gusfield D. Algorithms on Strings, Trees, and Sequences: Computer Science andComputational Biology[M].Cambridge University Press,1997:287–292.
    [75] Li F, Wang H, Zhang C, et al. Approximate Joins for XML Using g-String[C].Proceedings of7th International XML Database Symposium. Singapore: Springer,2010:3–17.
    [76] Raghavan S, Garcia-Molina H. Crawling the Hidden Web[C]. Proceedings of27th International Conference on Very Large Data Bases. Roma, Italy: MorganKaufmann,2001:129–138.
    [77] Reiss F, Raghavan S, Krishnamurthy R, et al. An Algebraic Approach to Rule-Based Information Extraction[C]. Proceedings of24th International Conferenceon Data Engineering. Cancun, Mexico: IEEE,2008:933–942.
    [78] Wei L, Meng X, Meng W. Vision-based Web Data Records Extraction[C]. Pro-ceedings of9th International Workshop on the Web and Databases. Chicago, IL,USA,2006:20–25.
    [79] Zhu J, Nie Z, Zhang B, et al. Dynamic Hierarchical markov Random Fields andTheir Application to Web Data Extraction[C]. Proceedings of24th InternationalConference on Machine Learning. Corvalis, OR, USA: ACM,2007:1175–1182.
    [80] Akutsu T, Fukagawa D, Takasu A. Approximating Tree Edit Distance throughString Edit Distance[J]. Algorithmica,2010,57:325–348.
    [81] Brin S. Extracting Patterns and Relations from the World Wide Web[C]. Selectedpapers from the International Workshop on The World Wide Web and Databases.London, UK: Springer,1999:172–183.
    [82] Cai R, Yang J M, Lai W, et al. iRobot: An Intelligent Crawler for Web Forums[C].Proceedings of17th International World Wide Web Conference. Beijing, China:ACM,2008:447–456.
    [83] Chang C H, Kuo S C. OLERA: Semisupervised Web-Data Extraction with VisualSupport[J]. IEEE Intelligent Systems,2004,19(6):56–64.
    [84] Chang C H, Siek H, Lu J J, et al. Reconfigurable Web Wrapper Agents[J]. IEEEIntelligent Systems,2003,18(5):34–40.
    [85] Chang K C C, He B, Li C, et al. Structured Databases on the Web: Observationsand Implications[J]. SIGMOD Record,2004,33(3):61–70.
    [86] Cong G, Wang L, Lin C Y, et al. Finding Question-Answer Pairs from Online Fo-rums[C]. Proceedings of31st International ACM SIGIR Conference on Researchand Development in Information Retrieval. Singapore: ACM,2008:467–474.
    [87] Crescenzi V, Merialdo P, Missier P. Clustering Web Pages Based on Their Struc-ture[J]. Data&Knowledge Engineering,2005,54(3):279–299.
    [88] Crescenzi V, Merialdo P, Missier P. Fine-grain Web Site Structure Discovery[C].Proceedings of5th ACM International Workshop on Web Information and DataManagement. New Orleans, LA, USA: ACM,2003:15–22.
    [89] Demaine E D, Mozes S, Rossman B, et al. An Optimal Decomposition Algo-rithm for Tree Edit Distance[C]. Proceedings of34th International Colloquium onAutomata, Languages and Programming. Wroclaw, Poland,2007:146–157.
    [90] Flesca S, Manco G, Masciari E, et al. Detecting Structural Similarities betweenXML Documents[C]. Proceedings of5th International Workshop on the Web andDatabases. Madison, WI, USA,2002.
    [91] Flesca S, Manco G, Masciari E, et al. Web Wrapper Induction: A Brief Survey[J].AI Communications,2004,17(2):57–61.
    [92] Gatterbauer W, Bohunsky P, Herzog M, et al. Towards Domain-Independent Infor-mation Extraction from Web Tables[C]. Proceedings of16th International WorldWide Web Conference. Banf, Alberta, Canada: ACM,2007:71–80.
    [93] Glance N, Hurst M, Nigam K, et al. Deriving Marketing Intelligence from OnlineDiscussion[C]. Proceedings of11th ACM SIGKDD International Conference onKnowledge Discovery in Data Mining. Chicago, IL, USA: ACM,2005:419–428.
    [94] Gulhane P, Rastogi R, Sengamedu S H, et al. Exploiting Content Redundancyfor Web Information Extraction[J]. Proceedings of the VLDB Endowment,2010,3(1):578–587.
    [95] Guo L, Tan E, Chen S, et al. Analyzing Patterns of User Content Generationin Online Social Networks[C]. Proceedings of15th ACM SIGKDD Internation-al Conference on Knowledge Discovery and Data Mining. Paris, France: ACM,2009:369–378.
    [96] Gusfield D, Stoye J. Linear Time Algorithms for Finding and Representing All theTandem Repeats in a String[J]. Journal of Computer and System Science,2004,69(4):525–546.
    [97] Hammer J, Garcia-molina H, Cho J, et al. Extracting Semistructured Informationfrom the Web[C]. Proceedings of the Workshop on Management of SemistructuredData.1997:18–25.
    [98] Hong J L, Siew E G, Egerton S. Information Extraction for Search Engines usingFast Heuristic Techniques[J]. Data&Knowledge Engineering,2010,69(2):169–196.
    [99] Hu Y, Xin G, Song R, et al. Title Extraction from Bodies of HTML Documentsand its Application to Web Page Retrieval[C]. Proceedings of28th InternationalACM SIGIR Conference on Research and Development in Information Retrieval.Salvador, Brazil: ACM,2005:250–257.
    [100] Hunt J W, Szymanski T G. A Fast Algorithm for Computing Longest CommonSubsequences[J]. Communications of the ACM,1977,20(5):350–353.
    [101] Kayed M, Chang C H. FiVaTech: Page-Level Web Data Extraction from TemplatePages[J]. IEEE Transactions on Knowledge and Data Engineering,2010,22:249–263.
    [102] Klein P N. Computing the Edit-Distance between Unrooted Ordered Trees[C].Proceedings of6th Annual European Symposium on Algorithms. Venice, Italy:Springer,1998:91–102.
    [103] Kundu A, Bertino E. Structural Signatures for Tree Data Structures[J]. Proceedingsof the VLDB Endowment,2008,1(1):138–150.
    [104] Laender A H F, Ribeiro-neto B A, da Silva A S, et al. A Brief Survey of Web DataExtraction Tools[J]. SIGMOD Record,2002,31:84–93.
    [105] Levering R, Cutler M. The Portrait of a Common HTML Web Page[C]. Proceed-ings of the2006ACM Symposium on Document Engineering. Amsterdam, TheNetherlands: ACM,2006:198–204.
    [106] Liu L, Pu C, Han W. XWRAP: An XML-Enabled Wrapper Construction Systemfor Web Information Sources[C]. Proceedings of16th International Conference onData Engineering. San Diego, CA, USA: IEEE,2000:611–621.
    [107] Luo P, Lin F, Xiong Y, et al. Towards Combining Web Classification and WebInformation Extraction: A Case Study[C]. Proceedings of15th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining. Paris, France:ACM,2009:1235–1244.
    [108] Mozes S, Tsur D, Weimann O, et al. Fast Algorithms for Computing Tree LCS[J].Theoretical Computer Science,2009,410(43):4303–4314.
    [109] Muslea I, Minton S, Knoblock C. STALKER: Learning Extraction Rules forSemistructured, Web-based Information Sources[J]. Workshop on AI and Infor-mation Integration,1998:74–81.
    [110] Rahardjo B, Yap R H C. Automatic Information Extraction from Web Pages[C].Proceedings of24th annual international ACM SIGIR Conference on Research andDevelopment in Information Retrieval. New Orleans, Louisiana, United States:ACM,2001:430–431.
    [111] REAL R, VARGAS J M. The Probabilistic Basis of Jaccard’s Index of Similari-ty[J]. Systems Biology,1996,45(3):380–385.
    [112] Richter T. A New Algorithm for the Ordered Tree Inclusion Problem[C]. Pro-ceedings of8th Annual Symposium on Combinatorial Pattern Matching. Aarhus,Denmark: Springer,1997:150–166.
    [113] Rosenfeld B, Feldman R, Aumann Y. Structural Extraction from Visual Layout ofDocuments[C]. Proceedings of11th ACM International Conference on Informationand Knowledge Management. McLean, VA, USA: ACM,2002:203–210.
    [114] Song R, Liu H, Wen J R, et al. Learning Block Importance Models for WebPages[C]. Proceedings of13th International World Wide Web Conference. NewYork, NY, USA: ACM,2004:203–211.
    [115] Xue Y, Hu Y, Xin G, et al. Web Page Title Extraction and its Application[J].Information Processing and Management,2007,43(5):1332–1347.
    [116] Zhai Y, Liu B. Structured Data Extraction from the Web Based on Partial TreeAlignment[J]. IEEE Transactions on Knowledge and Data Engineering,2006,18(12):1614–1628.
    [117] Zhang K, Shasha D. Simple fast algorithms for the editing distance between treesand related problems[J]. SIAM Journal on Computing,1989,18(6):1245–1262.
NGLC 2004-2010.National Geological Library of China All Rights Reserved.
Add:29 Xueyuan Rd,Haidian District,Beijing,PRC. Mail Add: 8324 mailbox 100083
For exchange or info please contact us via email.