面向Web的XML文档数据管理及分类检索技术研究

英文题名：Research on XML Document Management and Classification-based Retrieval Technology in Web
作者：阎红灿
论文级别：博士
学科专业名称：管理科学与工程
中文关键词：XML数据库 ; XQuery数据模型 ; 空间向量模型 ; 频繁模式挖掘 ; XML网页分类 ; Web信息分类检索
英文关键词：XML Database ; XQuery Data Model ; Space Vector Model ; Frequent Structure Miner ; XML Pages Classification ; Web Information Classification Retrieval
学位年度：2009
导师：李敏强
学科代码：1201
学位授予单位：天津大学
论文提交日期：2008-12-01

摘要

随着计算机和互联网技术的发展,网络已经成为资源数量最多、种类最全、规模最大的综合信息库,这些信息大致可分为两类:结构化数据和非结构化数据,据统计,非结构化数据占有整个信息量的80%以上,在信息传递过程中,80%的时间是用来获取信息,因此,如何从Web网上科学高效地获取信息即是本文研究的意义所在。
     XML数据库技术和Web搜索引擎技术的发展为提高Web信息检索特别是非结构化数据的检索效率带来了希望。因为XML数据库技术提供了信息存储和管理的技术保障,而搜索引擎技术为Web信息检索构建了操作平台。基于此,本文针对XML文档数据管理技术及面向Web的分类检索技术做了深入细致的研究。本文主要研究内容和创新性工作如下:
     首先,综述和分析了纯XML数据库和使能XML数据库的管理技术及索引机制,在分析各种数据模型特点基础上,研究讨论了以关系数据库作为存储源、扩展XQuery作为数据模型的优势,通过对XQuery数据模型的扩展,提出了基于Schema模式约束的XML数据存储和索引结构SBXI,从用户逻辑层面定义了XML文档更新语言XUL,并应用Kweelt查询系统和JAVA技术实现了文档更新的关键技术。
     然后,解决了XML网页分类的关键技术-信息检索模型问题。由于传统的向量空间模型不能适用于XML文档结构相似度比较,提出了基于TreeMiner算法的频繁结构向量模型,构建了文档特征矩阵的表示方法和相似度函数;并对该模型拓展,进一步提出频繁结构层次向量模型,不仅挖掘XML文档的结构信息,同时抽取表征文档内容的关键词信息,提高了相似度量的准确率。通过对频繁结构挖掘算法TreeMiner进行改进,使其更适合大文档集合的频繁结构挖掘,实验证明基于频繁模式的检索模型具有很好的网页分类效果。
     最后,提出了分类检索与全文检索结合的二次检索策略,从系统设计角度构架了以频繁结构层次向量模型作为信息检索模型、SBXI作为索引结构的基于主题分类的Web文献全文检索搜索引擎的系统结构,并讨论了其主要构件的功能和工作流程。
With the development of computer and Internet technology, the network has become the largest integrated information base, whose resources have the largest number and most types. This information can be divided into two categories: structured data and unstructured data. According to statistics, unstructured data possess more than 80% in the entire amount of information, in the process of information transmission, 80% of the time is used to obtain information. So, how to obtain information legitimately and efficiently from the Web on-line is the significance of this paper.
     XML database technology and Web search engine technology scant hope for improving efficiency of Web information retrieval especially the unstructured data retrieval. Because that XML database provides technical support for information storage and management, and the search engine builds a platform for Web information retrieval. For this reason, this article does an in-depth and meticulous research for XML data management techniques and classification of Web search technology. The main research and new ideas of this paper are presented as follows:
     Firstly, this article reviewed and analysized native XML database and XML-enabled database management and indexing mechanism. On the base of summing up the various characteristics of the data model, it analysized the advantages of the adoption of relational database to store information as data source and extended XQuery as data model, and then put forward XML data storage and index structure SBXI based on Schema constraint by extending XQuery model. At meantime, defined XML document update language XUL from the user level, and realized the key technology of XML document updating using Kweelt Query System and Java programming.
     Secondly, resolved the key technology of XML pages classification -information retrieval model problems. As the traditional vector space model can not be applied to XML documents similarity comparison, this paper built Frequent Structure Vector Model based on algorithm TreeMiner, expression of document characteristics matrix and document similarity function. Then, extended this model, put forward Frequent Structure Hierarchy Vector Model further, and improved the similarity measurement precision, not only miner structure information, but also extract keywords information. In order to make it more suitable for mining frequent structures from large collection of documents, we improved the algorithm TreeMiner, the experiments had proved that the retrieval model based on frequent structure is very good for classification XML pages.
     At last, provided the thinking of search twice which combined classification retrieval to full-text retrieval. From the point of system design, we build the framework of Web documents full-text search retrieval engine based on theme classification, which adopts FSHVM as information retrieval model and uses SBXI as index structure, and discussed the main components of the functions and work processes.

引文

[1] World Wide Web Consortium. Extensible Markup Language(XML)1.0(third edition). W3C Recommendation. 4 February 2004. http://www.w3.qrg/TR/REC-xml/.
    [2] International Organization for Standardization: Information processing, Text and office systems, Standard Generalized Markup Language. Available at: http://www.iso.org/iso/en/CataloguesDetaiIPage.
    [3](美)Mark Graves著,尹志军,等译.XML数据库设计.机械工业出版社,2002,13~15.
    [4] XML Database . http: //www. Xmldb.org /faqs.html.
    [5] What it is ?. http: //www.ozone-db.org/frames/home/what.html.
    [6] Tian Feng, DeWittD J, et a1.The Design and Performance Evaluation of Alternative XML Storage Strategies. SIGMOD Record, March 2002, 31(1):68~79.
    [7] Abiteboul S, Cluet S, et a1. Queuing and updating the file. In:Proc. of 19th International Conference on Very Large ta Bases, Dublin, Ireland 1993. 54~66.
    [8] Floreseu D.KossmanD. Storing and Querying XML Data using an RDBMS. IEEE Data Engineering Bulletin, September 1999, 22(3): 27~34.
    [9] Kanne C, Moerkotte G. Efficient storage of XML data. In:Proc. of the 16th International Conference on Data Engineering, 28 February-3 March,2000, San Diego,California, USA IEEE Computer Society 2000. 198~205.
    [10] J. Shanmugasundaram, K. Relational Databases for Querying XML Documents: Limitations and Opportunities,In: Proceedings of the 25th VLDB Conference, Edinburgh, Scotland, UK 1999. 302~314.
    [11] Chamberlin D, Robie J, and Florescu D. Quilt: An XML Query Language for Heterogeneous Data Sources. In: Suciu D et al Eds. Proceeding of the 3th Web DB International Workshop on the Web and Databases (Lecture Notes in Computer Science, Vol.1997). Dallas, Texas, USA. May 18-19, 2000. Berlin: Springer, 2001. 1~25.
    [12] World Wide Web Consortium. XML Query Requirements. W3C Working Draft, 12 November 2003. http://www.w3.org/TR/xquery-requirements/.
    [13] Mchugh J, Abiteboul S, Goldman R, et al. Lore: A Database Management System for Semistructured Data. ACM SIGMOD Record, 1997, 26(3):54-66.
    [14] Mchugh J and Widom J. Query Optimization for XML. In: Atkinson M P et al Eds. Proceedings of the 25th VLDB International Conference on Very Large Database. Edinburgh, Scotland. September 7-10, 1999. San Francisco: Morgan KaufmannPublishers, 1999. 315~326.
    [15] Beech D, Malhotra A, and Rys M. A Formal Data Model and Algebra for XML. Communication to the W3C. September 1999. 1~26. http://www-db.stanford.edu/dbseminar/Archive/FallY99/malhotra-tsld001.htm.
    [16] Christopphides V, Cluet S, Simeon J. On Wrapping Query Langusges and Efficient XML Integration. In: Chen W et al Eds. Proceedings of the 19th ACM SIGMOD International Conference on Mansgement of Data. Dallas, Texas, USA. May 14-19, 2000. New York: ACM Press, 2000. 141~152.
    [17] Fernandez M, Simeon J, Wadler P. An Algebra for XML Query. In: Kappoe S et al Eds. Proceeding of the 20th FSTTCS International on Foundation of Software Technology and Theoretical Computer Science (Lecture Notes in Computer Science, Vol. 1974). New Delhi, India. December 13-15, 2000. Springer-Verlap, 2000. 11~45.
    [18] Fernandez M, Simeon J, Wadler P. A Semi-monad for Semi-structured Data. In: Bussche J V et al Eds. Proceeding of the 8th ICDT International Conference On Database Theory (Lecture Notes in Computer Science, Vol. 1973). London, UK. January 4-6, 2001. Heidelberg: Springer–Verlag, 2001. 263~300.
    [19] Jagadish H V, Lakshmanan L V S, Srivastava D, et al. TAX: A Tree Algebra for XML. In: Clark J et al Eds. Proceeding of the International Workshop on Database Programming Languages(Lecture Notes in Computer Science, Vol. 2397). Rome, Italy, September 8-10, 2001. Heidelberg: Springer–Verlag, 2002. 149~164.
    [20] Cooper B F, Sample N, Franklin M J, et al. A Fast Index for Semistructured Data. In: Apers P M G et al Eds. Proceeding of the 27th VLDB International Conference on Very Large Database. Rome, Italy. September 11-14, 2001. San Francisco: Morgan Kaufmann Publishers, 2001. 341~350.
    [21] Kha D D, Yoshikawa M, Uemura S. An XML Indexing Structure with relative region coordinate. In: Reuter A et al Eds. Proceeding of the 17th IEEE ICDE International Conference on Data Engineering. Heidelberg, Germany. April 2-6, 2001. Los Alamitos: IEEE Computer Society, 2001. 313~320
    [22] Selberg, E., &Etzioni,0.(1997), The metacrawler architecture for resource aggregation on the Web. IEEE. ExPert, 12(1): 11~14.
    [23] Aridor,Y,Carmel,D,MaarekY.S. soffer, A. &Lempel, R.(2001), Knowledge encapsulation for focused search form pervasive devices. In Porc. WWW10 HongKong. 754~764.
    [24] W3C(1998),“Document content description for XML”,http://www.w3.org/TR/NOTE-dcd.
    [25] W3C (2000),“XML schema part O: Primer W3C Working Draft”, http://www.w3.org/TR/2000/WD-xmlschema-0-20000407.
    [26] CAP Ventures(1999)”Xdex, an XMLindexing engine from Sequoia Software Corporation”, Dynamic Content Software Strategies Anajysis, 4(33) . http://www.sequoiasoftware.com/xdex/cap101700.asp.
    [27] Katz, H. (2000).“XML Query Engine”, http://www.fatdog.com/.
    [28] B Zhao, A Joseph. XSet; A lightweight XML search engine for internet applications. http://www.cs.berkeley.edu/%7Eravenben/xset/.
    [29] Dongwook Shin,Hyullcheol Jang,Honglan J. BUS:An Efective Indexing and Retrieva1 Scheme in Structured Documents. In Proceedings of the third ACMConference on Digital libraries (DL’98). 235~243.
    [30] XZYFind Corporation(2000),”XYZFind white PaPer:a new technology for search over heterogeneous. Structured data”, http://www.xyzfind.com/paper001.pdf.
    [31] Eyzioni. The world wild web: Quagmire or gold mine. Communication of the ACM, 1996, 39(11): 258~261.
    [32] Bourret R. XML and Database. http://www.rpbourret.com/xml/XMLAndDatabase.htm
    [33] Bourret R. XML Data Products: Native XML Databases. http://www. rpbourret.com/xml/ProdsNative.htm
    [34] Kanne C C and Moerkotte G. Efficient Storage of XML Data. In Proceedings of 16th ICDE, San Diego, California, USA, February 2000. 198~210.
    [35] Jagadish H V, AL-Khalifa S, et al. TIMBER: A Native XML Database. Technical Report, University of Michigan, April 2002.
    [36] Kanne C, Moerkotte G. Efficient storage of XML data. In:Proceedings of the 16th International Conference on Data Engineering, 28 February-3 March,2000,San Diego,California,USA IEEE Computer Society 2000. 198~210.
    [37]罗道峰,孟小峰,安靖. OrientStore: Native XML存储方法.第20届全国数据库学术会议论文集.重庆:计算机科学,2003. 105~110.
    [38] Meng Xiaofeng, Luo Daofeng, Lee Mong Li, et al. OrientStore: A Schema Based Native XML Storage System. In: Heuer A et al Eds. Proceedings of the 29th VLDB International Conference On Very Large Database. Berlin, Germany. September 9-12,2003. San Francisco: Morgan Kaufmann Publishers, 2003. 1057~1060.
    [39]孟小峰,王宇,罗道峰等. OrientX:一个Native XML数据库系统的实现策略.第20届全国数据库学术会议论文集.重庆:计算机科学,2003. 111~115.
    [40] Fiebig T, Helmer S, Kanne C C, et al. Anatomy of a Native XML Base Management System. The VLDB Journal, 2003, 11(4): 292~314.
    [41] Goldman R and Widom J. DataGuide: Enabling Query Formulation and Optimization in Semistructured Databases. In: Jarke M et al Eds. Proceedings of the VLDB International Conference on Very Large Databases. Athens, Greece. August 25-29, 1997. San Francisco: Morgan Kaufmann Publishers, 1997. 436~445.
    [42] The Index Fabric: a Technical Overview. Technical Report, February, 2001. http://www.rightorder.com/technology/overview.pdf.
    [43] Cooper B F, Sample N, Franklin M J, et al. A Fast Index for Semistructured Data. In: Apers P M G et al Eds. Proceeding of the 27th VLDB International Conference on Very Large Database. Rome, Italy. September 11-14, 2001. San Francisco: Morgan Kaufmann Publishers, 2001. 341~350.
    [44] Kha D D, Yoshikawa M, Uemura S. An XML Indexing Structure with relative region coordinate. In: Reuter A et al Eds. Proceeding of the 17th IEEE ICDE International Conference on Data Engineering. Heidelberg, Germany. April 2-6, 2001. Los Alamitos: IEEE Computer Society, 2001. 313~320.
    [45] McHugh J, Widom J, Abiteboul S, et al. Indexing Semistructured Data. Technical Report, January 1998. http://www.db.stsnford.edu/lore/pubs.
    [46] Fiebig T and Moerkotee G. Evaluating Queries on Structure with Extended Access Support Relations. In: Proceedings of the WebDB International Workshop on the Web and Databases( Informal Proceedings), Dallas, Texas, 2000. 41~46.
    [47]万常选,刘云生.基于关系数据库的XML数据管理.计算机科学,2003,30(8):64~68.
    [48] Shamugasundaram J, Shekita E J, Barr R, et al. Efficiently Publishing Relational Data as XML Documents. In: Abbbad A E et al Eds. Procedings of the 26th VLDB International Conference on Very Large Database. Cairo, Egypt. September 10-14, 2000. San Francisco: Morgan Ksufmann Publishers, 2000. 65~76.
    [49] Carey M, Florescu D, Ives Z, et al. XPERANTO: Publishing object-relational data as XML. In: Suciu D et al Eds. Procedings of the 3th WebDB International Workshop on the Web and Databases. Dallas, Texas, USA. May 18-19, 2000. 105~110.
    [50] Carey M, Kiernan J, Shamugasundaram J, et al. XPERANTO: A Middleware for Publishing object-relational data as XML Documents. In: Abbbad A E et al Eds. Procedings of the 26th VLDB International Conference on Very Large Database.Cairo, Egypt. September 10-14, 2000. San Francisco: Morgan Ksufmann Publishers, 2000. 646~648.
    [51] Shamugasundaram J, Kiernan J, Shekita E J, et al. Querying XML Views of Relational Data. In: Apers P M G et al Eds. Procedings of the 27th VLDB International Conference on Very Large Database. Rome, Italy. September 11-14, 2001. San Francisco: Morgan Kaufmann Publishers, 2001. 261~270.
    [52] Fernandez M, Tan W, Suciu D. SilkRoute: Trading Between Relations and XML. In: Herman I et al Eds. Procedings of the 9th International World Wide Web Conference(WWW’00). Amsterdam, The Netherlands. May 15-19, 2000. Amsterdam: Foretec Seminars Inc, 2000. 723~725.
    [53] Fernandez M, Tan W, Suciu D. Publishing Relational data as XML :The SilkRoute Approach. IEEE Data Engineering Bulletin, 2001, 24(2): 12~19.
    [54] Quanzhong Li, Bongki Moon, Indexing and Querying XML Data for Regular Path Expression[C]. Proceedings of the 27th International Conference on Very Large Database, Roma, Italy, September 11-14,2001. San Francisco: Morgan Kaufmann Publishers, 2001: 361-370.
    [55] Yoshikawa M, Amagara T, Shimura T, et al. XREL: A Path Based Approach to Storage and Retrieval of XML documents using Relational Dtabases. ACM Transactions on Internet Technology(TOIT). 2001, 1(1): 110~141.
    [56] Jiang Haifeng, Lu Hongjun, Wang Wei, at el. Xparent: An Efficient RDBMS-based XML Database System. In: Hiong Ngu A H A et al Eds. Proceding of the 18th IEEE ICDE International Conference on Data Engineering. San Jose, California, USA. February 26~March 1, 2002. Los Alamitos: IEEE Computer Society, 2002. 335~346.
    [57] Shamugasundaram J, Tufle K, Zhang C, et al. Relational Databases for Querying XML Documents: Limitations and Opportunities. In: Atkinson M P et al Eds. Procedings of the 25th VLDB International Conference on Very Large Database. Edinburgh, Scotland. September 7-10, 1999. San Francisco: Morgan Kaufmann Publishers, 1999. 302~314.
    [58]万常远,刘云生,徐升华等.基于区间编码的XML索引结构有效实现结构连接.计算机学报, 2005, 28(1) 113~127.
    [59]门爱华,冯建华,周立柱.XML数据库存储策略综述.计算机科学. 2005,32(9): 13~18.
    [60] Carey M , De Witt D, et a l. The BUCKY Object-Relational Benchmark. In: Proc.of ACM SIGMOD Intl. Conf.on Management of Data, May 13-15, 1997, Tucson, Arizona, USA SIGMOD Record, Ju ne 1997, 2(12): 273~284.
    [61] Open Directory Project, http://www. dmoz.org/.
    [62] Carey M, De Witt D, et al .Shoring Up Persistent Applications. In :Pro c. of ACM SIGMOD Intl. Conf.on Management of Data, May 24- 27, 1994, Minneapolis, Minnesota, USA SIGMOD Record, June 1994, 23(2):383~394.
    [63] Quass D, Widom J, Goldman R, Haas K, Luo Q, McHugh J, Nestorov S, Rajaraman A, Rivero H, Abiteboul S, Ullman J, Wiener J. LORE: A lightwight object repository for semistructured data. In: Jagadish HV, Mumick IS, eds. Proc. of the 1996 ACM SIGMOD Int’l Conf. on Management of Data. New York: ACM Press, 1996. 549~554.
    [64] Zhou AY, Lu HJ, Zheng SH, Liang YQ, Zhang L, Ji W, Tian ZP. VXMLR: A visual XML-relational database system. In: Apers P, Atzeni P, Ceri S, Paraboschi S, Ramaohanarao K, Snodgrass R, eds. Proc. of the 27th Int’l Conf. on Very Large Data Bases. San Fransisco: Morgan Kaufmann Publishers, 2001. 719~ 723.
    [65] Wang Q, Zhou JM, Wu HW, Xiao JC, Zhou AY. Mapping XML documents to relations in the presence of functional dependencies. Journal of Software, 2003,14(7):1275 ~1281. http://www.jos.org.cn/1000-9825/14/1275.htm .
    [66] Tatarinov I, Viglas S, Beyer K, Shanmugasundaram J, Shekita E, Zhang C. Storing and querying ordered XML using a relational database system. In: Franklin M, Moon B, Ailamaki A, eds. Proc. of the 2002 ACM SIGMOD Int’l Conf. on Management of Data. New York: ACM Press, 2002. 204~ 215.
    [67] Zhang C, Naughton J, DeWitt DJ, Luo Q, Lohman G. On supporting containment queries in relational database management systems. In: Aref W, ed. Proc. of the 2001 ACM SIGMOD Int’l Conf. on Management of Data. New York: ACM Press, 2001. 426~ 437.
    [68] Arenas M, Libkin L. A normal form for XML documents. ACM Trans. on Database Systems, 2004,29(1):195~232.
    [69] Christophides V, Cluet S, Moerkotte G, Siméon J. On wrapping query languages and efficient XML integration. In: Chen W, Naughton J, Bernstein P, eds. Proc. of the 2000 ACM SIGMOD Int’l Conf. on Management of Data. New York: ACM Press, 2000. 141~152.
    [70] Ludascher B, Papakonstantinou Y, Velikhov P. Navigation-Driven evaluation of virtual mediated views. In: Zaniolo C, Lockemann P, Scholl M, Grust T, eds. Advances in Database Technology-EDBT 2000, 7th Int’l Conf. on Extending Database Technology. Berlin, Heidelberg: Springer-Verlag, 2000. 150~165.
    [71] Buneman P, Fernandez M, Suciu D. UnQL: A query language and algebra for semistructured data based on structural recursion. The VLDB Journal,2000,9(1):76~110.
    [72] Beeri C, Tazban Y. SAL: An algebra for semistructured data and XML. In: Cluet S, Milo T, eds. Proc. of the 2nd ACM SIGMOD Workshop on the Web and Databases. INRIA, 1999. 37~42. http://www-rocq.inria.fr/~cluet/webdb99.html
    [73] Meng XF, Luo DF, Jiang Y, Wang Y. OrientXA: An effective XQuery algebra. Journal of Software, 2004,15(11):1648~1660 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/15/1648.htm
    [74] Goldman R,McHugh J,Widom J. From Semistructured Data to XML:Migrating the Lore Data Model and Query Language. Stanford University, 1999.1~7. http://www-db.stanford.edu/lore.
    [75] Nagy M,Walsh N,et al. XQuery 1. 0 and XPath 2. 0 Data Model. Editors [EB/OL]. W3C. 4, Apr 2005. http://www.w3.org/ TR/ xpath-datamodel /6 World Wide Consortium. Extensible Markup Language (XML)1. 0
    [76] Tian Feng, De WittD J, et al. The Design and Performance Evaluation of Alternative XML Storage Strategies. SIGMOD record, March 2002, 31(1): 5~10.
    [77] Michel Stonebraker著,杨冬青等译校,对象-关系数据库管理系统,北京大学出版社, 1997.8, 10~32.
    [78] Akmal B. Chaudhri, Awais Rashid, Roberto Zicari编著,邢春晓,张志强等译, XML数据管理纯XML和支持XML的数据库系统,清华大学出版社,2006.2, 53~82.
    [79] Brett Mclaughlin著,孙兆林等译. JAVA与XML,中国电力出版社, 2001.4,32~112.
    [80] Goldman, R., Widom, J.: DataGuides: enabling query formulation and optimization in semistructured databases. In Proc. of the Int’l Conf. on Very Large Databases, 1997. 436~445.
    [81] Fankhauser P, et al. XQuery 1.0 and XPath 2.0 formal semantics [EB/OL]. 2005. http://www.w3.org/TR/query-semantics/.
    [82] Chamberlin D, et al. XQuery 1.0: An XML query language[EB/OL]. 2005. http://www.w3.org/TR/xquery.
    [83] S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. L.Winer. The Lorel query language for semistructured data. In Proceedings of International Journal on Digital Libraries, volume 1(1), April 1997.68~70.
    [84] Mengchi Liu, Li Lu and Guoren Wang, A Declarative XML-RL Update Language. ER 2003, LNCS 2813, 2003. Springer-Verlag Berlin Heidelberg 2003. 506~520.
    [85] Tatarinov I., Ives Z.G., Halevy A.Y., Weld, D.S.: Updating XML. In Proceedings of 2001 SIGMOD Conference, Santa Barbara, CA, USA (2001).413~424.
    [86] Andreas Laux. XML Update language[EB/OL], Working Draft - 2000-09-09, http://www.infozone-group.org/lexusDocs/html/wd-lexus.html.
    [87] Lars Martin.XML Update Language Requirements[EB/OL], Working Draft - 2000-11-24, http://xmldb-org.sourceforge.net/xupdate/xupdate-req.html.
    [88] Sahugust A. Kweelt: the Making-of Mistakes Made and Lessons Learned [EB/OL]. 2000.11, http://db.cis.upenn.edu/Publications/2000.
    [89] Abiteboul S, Cluet S, et al. Querying and updating the file. In:Proc. of 19th International Conference on Very Large Data Bases, Dublin, Ireland 1993.73~84.
    [90] Eyzioni.The world wild web: Quagmire or gold mine.Communication of the ACM,1996:39(11): 65~68.
    [91] Gudivada V N, Raghavan V V, Grosky W I. Information Retrival on the World Wide Web. IEEE Internet Computing, 1997: 58~68
    [92] Raghavan V, Wong S K M. A Critical Analysis of Vector Space Model for Information Retrieval. J Am Soc Inform Sci, 1986, 37(5): 279~286.
    [93] Li Z D, Fei X L, Wang H Z. A Concept-Based Information Retrieval Model. Proceedings of the International Symposium on Future Software Technology (ISFST-99). 1999: 296~300.
    [94] Wiliam A W, Conceptual Indexing: A Better Way to Organize Knowledge. Forthcoming Technical Report. Sun Microsystems Lab. http://www.sunalbs.com/research/knowedge. 1995.
    [95] Miller G A. WordNet: A Lexical Database for English. Comm ACM, 1995.39~41.
    [96] M.W. Berry, Z. Drmac, and E. R. Jessup. Matrices, Vector Spaces, and Information Retrieval. SIAM Review, 1999, 41(2): 335~362.
    [97] Richrdo Beeza-Yates, Berthier Ribeiro-Neto. Moderm Information Retrieval. Addison-Wesley Longman Limited, 1999. 51~76.
    [98] COBENA G, ABITEBOUL S, MARIAN A. Detecting changes in XML document[A ].In Proc 18th Int Conf on Data Engineering(ICDE'02) [C], 2002.41~52.
    [99] CHAWATHES,RAJARAMANA,GARCIA-MOLINAH,et al.Change detection in hierarchically structured information[A].In Proc ACM SIGMOD Int.Conf. On Management of Data( SIGMOD'96) [C] Montreal,Quebec,June 1996.493~504.
    [100] Zhang ZP, Li R, Cao SL, Zhu YY. Similarity metric for XML documents. In:Ralph B, Martin S,eds. Proc.of the 2003 Workshop on Knowledge and Experience Management(FGWN2003).Karlsruhe,2003.255~261.
    [101] COSTA G,MANCO G,ORTALE R,et al. A Tree-based Approach to ClusteringXML Documents by Structure[Z]. Rappoito Tecnico N.04: RT-ICAR-CS-04-04 April 2004.137~148.
    [102] FLESCA S,MANCO G,MASCIARI E, et al. Detecting structural similarities between XML documents[A ].In Proc 5th Int. Workshop on the Web and Databases(WebDB'02) [C]. Madison,Wisconsin, 2002. 124~131.
    [103] M. J. Zaki. Effciently Mining Frequent Trees in a Forest. SIGKDD, 2002.71~80.
    [104] M. J. Zaki, C C Aggarwal. Xrules: An effective structural classification for XML data [C]. Int’l Conf on Knowledge Discovery and Data Mining (SIGKDD’03), Washington, DC, 2003. 316~325.
    [105]杨建武,陈晓欧.基于核矩阵学习的XML文档相似度量方法.软件学报[J]. May 2006 , 17(5): 991~1000.
    [106]马海兵,王兰成.高效挖掘无序频繁子树.小型微型计算机系统[J]. Nov 2006,27(11):2104~2108.
    [107] DOUCET A ,MYKA HA . Native clustering of a large XML document collection [A]. In Proc 1st Annual Workshop of the Initiative for the Evaluation of XML retrieval(INEX’02) [C]. Schloss Dagstuhl, Germany, 2002. 81~87.
    [108] http://www.acm.org/sigs/sigmod/record/XMLSigmodRecordMarch1999.zip http://www.acm.org/sigs/sigmod/record/XMLSigmodRecordNov2002.zip
    [109] Yi J, Sundarcsan N. A classifier for semi-structured documents. In: Ramakrishnan R, Stolfo S, Pregibon D, eds. Proc. of the 6th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining (KDD 2000). New York: ACM Press,2000.340~344.
    [110] Denoyer L, Gallinari P. Bayesian network model for semi-structured document classification. Information Processing and Mangement,2004,40(5):807~827.
    [111]张冉,卡米力·毛依丁,基于XML和N层VSM的Web信息检索[J],计算机技术与发展,May 2006, 16(5): 56~58.
    [112]牛强,王志晓,陈岱,夏士雄.基于SVM的中文网页分类方法的研究[J].计算机工程与设计,Apr. 2007, 28(8): 1893~1895.
    [113]袁家政,须德等,基于结构与文本关键词相关度的XML网页分类研究[J],计算机研究与发展, 2006, 43(8):1361~1367.
    [114]杨彦闯,杨炳儒等,基于联合提取特征的粗糙集文本分类技术研究[J],计算机应用研究,2007,24 (7): 97~99.
    [115]唐凯,基于内容和分层结构的XML文件自动分类方法[J],计算机工程与应用,2007, 43(3):168~172.
    [116]韩景倜,卢致杰,覃正.基于XML的复杂信息系统自动分类方法,系统工程理论与应用, Dec.2005, 14 (6): 488~492.XML Documents by Structure[Z]. Rappoito Tecnico N.04: RT-ICAR-CS-04-04 April 2004.137~148.
    [102] FLESCA S,MANCO G,MASCIARI E, et al. Detecting structural similarities between XML documents[A ].In Proc 5th Int. Workshop on the Web and Databases(WebDB'02) [C]. Madison,Wisconsin, 2002. 124~131.
    [103] M. J. Zaki. Effciently Mining Frequent Trees in a Forest. SIGKDD, 2002.71~80.
    [104] M. J. Zaki, C C Aggarwal. Xrules: An effective structural classification for XML data [C]. Int’l Conf on Knowledge Discovery and Data Mining (SIGKDD’03), Washington, DC, 2003. 316~325.
    [105]杨建武,陈晓欧.基于核矩阵学习的XML文档相似度量方法.软件学报[J]. May 2006 , 17(5): 991~1000.
    [106]马海兵,王兰成.高效挖掘无序频繁子树.小型微型计算机系统[J]. Nov 2006,27(11):2104~2108.
    [107] DOUCET A ,MYKA HA . Native clustering of a large XML document collection [A]. In Proc 1st Annual Workshop of the Initiative for the Evaluation of XML retrieval(INEX’02) [C]. Schloss Dagstuhl, Germany, 2002. 81~87.
    [108] http://www.acm.org/sigs/sigmod/record/XMLSigmodRecordMarch1999.zip http://www.acm.org/sigs/sigmod/record/XMLSigmodRecordNov2002.zip
    [109] Yi J, Sundarcsan N. A classifier for semi-structured documents. In: Ramakrishnan R, Stolfo S, Pregibon D, eds. Proc. of the 6th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining (KDD 2000). New York: ACM Press,2000.340~344.
    [110] Denoyer L, Gallinari P. Bayesian network model for semi-structured document classification. Information Processing and Mangement,2004,40(5):807~827.
    [111]张冉,卡米力·毛依丁,基于XML和N层VSM的Web信息检索[J],计算机技术与发展,May 2006, 16(5): 56~58.
    [112]牛强,王志晓,陈岱,夏士雄.基于SVM的中文网页分类方法的研究[J].计算机工程与设计,Apr. 2007, 28(8): 1893~1895.
    [113]袁家政,须德等,基于结构与文本关键词相关度的XML网页分类研究[J],计算机研究与发展, 2006, 43(8):1361~1367.
    [114]杨彦闯,杨炳儒等,基于联合提取特征的粗糙集文本分类技术研究[J],计算机应用研究,2007,24 (7): 97~99.
    [115]唐凯,基于内容和分层结构的XML文件自动分类方法[J],计算机工程与应用,2007, 43(3):168~172.
    [116]韩景倜,卢致杰,覃正.基于XML的复杂信息系统自动分类方法,系统工程理论与应用, Dec.2005, 14 (6): 488~492.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700