图数据库中的子图查询算法研究

英文题名：Subgraph Query in Graph Databases
作者：邹磊
论文级别：博士
学科专业名称：计算机软件与理论
中文关键词：图数据库 ; 子图匹配查询 ; 频繁结构模式 ; Top-K查询
英文关键词：Graph Database ; Sub-graph Query ; Frequent structural Patterns ; Top-K Query
学位年度：2009
导师：卢炎生
学科代码：081202
学位授予单位：华中科技大学
论文提交日期：2009-05-01

摘要

图是计算机科学中的重要数据结构。随着信息技术地不断发展，出现了越来越多的以图作为逻辑表达的数据，例如化学分子结构式，生物网络，社会网络以及图像中的实体关系等等。另一方面，这些图数据本身的数据量也在不断增大，例如每天就有4,000个新的化学结构被加入到SCF Finder数据库；现在的社会网络图中的结点数目超过了1亿5千万。如何有效地管理和挖掘海量的图数据是图数据库研究的核心问题。具体包括：1)如何建立有效的存储机制和索引策略；2)如何有效地回答图数据库中的查询；3)如何从海量的图数据库中挖掘出有用的信息。
     子图查询是图数据管理中的基本操作。具体地说，给定一个查询图Q，在图数据库中找到所有包含查询图Q的数据图。由于子图同构是典型的NP完全问题，目前的子图查询算法均采用“过滤-精化”的算法来找到结果集。在过滤阶段，根据某种子图同构的必要条件过滤掉不可能包含查询图Q的数据图；然后在精化阶段利用子图同构算法在剩下的数据图中找到结果集。目前的大部分的过滤策略都是基于“特征子结构”(简称“特征”)的方法。这种方法没有考虑到特征之间的拓扑关系。根据特征和特征之间的拓扑关系，设计一种新的过滤策略将加快子图查询的响应时间。另外目前的子图查询算法没有考虑数据库频繁动态更新的情况。当数据库出现增删改时，不得不重新建立索引。为了适应动态的图数据库，根据图谱理论，将图的拓扑信息映射到数据空间中；并根据映射的数值空间，建立相应的索引结构。这种方法不但加快了过滤的时间，而且可以动态的维护索引结构。
     随着社会网络等复杂网络的出现，给定查询图Q，如何在一张大图中找到Q的匹配位置是非常有意义的课题，例如可以帮助我们找到社会网络中的特定的朋友圈，以及生物网络中的功能团。大图上的匹配的定义本身并一定是基于子图同构。因为网络上考虑的更多的是两点之间的连接性关系。根据连接性关系，找到查询图Q的所有匹配位置。因为通常复杂网络中的节点数目是海量的，为了减少搜索空间，将图结构中映射到向量空间。通过在向量空间的操作，找到所有的候选匹配位置；然后在原来的图结构中确定最终的匹配结果集。
     图数据挖掘可以帮助我们从海量的图数据中找到有用的知识，其中频繁结构模式挖掘和结构相关性挖掘是两个重要的课题。目前的结构模式挖掘算法大部分采用“生成-检测”的算法框架。这种方法的缺点是“检测”阶段耗费大量时间。根据图数据的特例“树数据”的特点，采用模式增长的方式设计频繁结构模式挖掘算法，从而避免了检测阶段。给定一个查询图Q，希望从图数据库中找到与Q具有高度相关性的子结构。这个问题的难点在于海量的搜索空间。为了加快算法响应时间，根据模式增长的策略，设计一种有效的过滤策略。
     用图结构来表示关系型数据库中的记录之间的“控制”关系，从而将传统的Top-K查询问题转换为图结构中的遍历问题。与现有的经典Top-K查询算法相比，这种方法的优点是其搜索空间较小。
As an important data structure in computer science, graph can be used to model many complex objects, such as social networks, compound structures and so on. Furthermore, the growing popularity of graph databases has generated interesting data management problems, such as sub-graph search, frequent structural pattern mining and so on. For example, approximate 4000 new compound structures are added into SCI Finder database everyday; there are 150 million active users in Facebook (one of famous social network) Web sites. This dissertation addresses three key issues in graph databases, that are 1) how to design efficient storage and index strategies; 2) how to answer query over graph databases efficiently; 3) how to discovery knowledge from large graph databases.
     Subgraph query is a fundamental operation in graph databases. Specifically, given a query graph Q, subgraph query reports all data graphs that contains Q. Due to NP-complete hardness of subgraph isomorphism, subgraph query algorithms always adopt "filter-and-refine" framework. In filtering process, according to some necessary condition of subgraph isomorphism, we filter out most false alarms (the data graphs that do not contain query Q) to obtain candidates; Then, we perform subgraph isomorphism algorithm on candidates to fix final results. Existing subgraph query algorithms use "feature" to perform filtering process. However, they ignore the relationships between features. Considering both features and the relationships between them, we design an efficient pruning strategy to speed up query processing. Furthermore, existing subgraph query algorithms cannot handle frequent updates in graph database efficiently. They have to rebuild the index from scratch. Based on spectral graph theory, we design a novel graph coding technique, called GCoding. Based on GCoding, we map graph structural information into numerical space. Then, we build the index on the converted numerical space. This method not only speeds up query processing, but also handle graph database maintenance efficiently.
     Given a query graph Q, finding all matching positions of Q in a very large graph G is useful in social network analysis, such as finding some specified community in social network and finding a functional group in a biological network. Since we care more about the connectivity in a large graph, the matching definition is different from subgraph isomorphism. According to the shortest path distance, we propose a novel matching definition. To reduce the huge search space, we map vertices in a large graph into points in vector space. Then, we find candidate matchings in the converted vector space. At last, we fix final results in the original large graph.
     This dissertation also address two graph mining problems, that are frequent structural pattern mining and correlative subgraph mining. We propose an efficient pattern growth algorithm to find frequent structural patterns without expensive verification process. Given a query graph Q, correlation pattern query is to find all correlative patterns with regard to Q. In order to speed up query processing, we design an pattern-growth algorithm to find final results.
     Considering the "domination" between records in vector space, we build a graph-structure index, called DG-index. Based on DG-index, we convert top-k query into a graph travel problem.

引文

[1] 庄毅．海量多媒体数据库的高效查询处理：[博士学位论文]．浙江：浙江大学，2008．
    [2] Yan X, Yu P S, Han J. Graph Indexing: A Frequent Structure-based Approach. in: Proceedings of SIGMOD Conference, 2004, 335-346.
    [3] Shasha D, Wang J T L, Giugno R. Algorithmics and Applications of Tree and Graph Searching. in: Proceedings of PODS, 2002, 39-52.
    [4] Cohen E, Halperin E, Kaplan H, et al. Reachability and Distance Queries via 2-Hop Labels. SIAM J. Comput., 2003, 32(5):937-946.
    [5] Wang H, He H, Yang J, et al. Dual Labeling: Answering Graph Reachability Queries in Constant Time. in: Proceedings of ICDE, 2006, 75.
    [6] Chen Y, Chen Y. An Efficient Algorithm for Answering Graph Reachability Queries. in: Proceedings of ICDE, 2008, 845-856.
    [7] Jing N, Huang Y W, Rundensteiner E A. Hierarchical Encoded Path Views for Path Query Processing: An Optimal Model and Its Performance Evaluation. IEEE Trans. Knowl. Data Eng., 1998, 10(3):409-432.
    [8] Cormen T H, Leiserson C E, Rivest R L, et al. Introduction to algorithms. The MIT Press, 2001.
    [9] Agrawal R, Srikant R. Fast Algorithms for Mining Association Rules in Large Databases. in: Proceedings of VLDB, 1994.
    [10] Hay M, Miklau G, Jensen D, et al. Resisting structural re-identification in anonymized social networks. PVLDB, 2008, 1(1): 102-114.
    [11] Bachman C W. The Programmer as Navigator. ACM Turing Award lecture. Communications of the ACM, 16.
    [12] Zou L, Chen L, Zhang H, et al. Summarization Graph Indexing: Beyond Frequent Structure-based Approach, in: Proceedings of DASFAA, 2008.
    [13] Zou L, Chen L, Yu J X, et al. A novel spectral coding in a large graph database, in: Proceedings of EDBT, 2008, 181-192.
    [14] Zou L, 0002 L C, Lu Y. Top-k subgraph matching query in a large graph, in:Proceedings of PIKM, 2007, 139-146.
    [15] Zou L, Lu Y, Zhang H, et al. PrefixTreeESpan: A Pattern Growth Algorithm for Mining Embedded Subtrees, in: Proceedings of WISE, 2006, 499-505.
    [16] Zou L, Lu Y, Zhang H, et al. Mining Frequent Induced Subtree Patterns with Subtree-Constraint, in: Proceedings of ICDM Workshops, 2006, 3-7.
    [17] Zou L, 0002 L C. Dominant Graph: An Efficient Indexing Structure to Answer Top-K Queries, in: Proceedings of ICDE, 2008, 536-545.
    [18] Petrakis E G M, Faloutsos C. Similarity Searching in Medical Image Databases.IEEE Trans. Knowl. Data Eng., 1997, 9(3):435-447.
    [19] Koo J H, YoO S I. A Structural Matching for Two-Dimensional Visual Pattern Inspection, in: Proceedings of IEEE Int. Conf. Systems, Man and Cybernetics,1998, 4429-4434.
    [20] Gregory L, Kittler J. Using Graph Search Techniques for Contextual Colour Re-trievall in: Proceedings of Proceedings of the Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition, 2002.
    [21] Berretti S, Bimbo A D, Vicario E. Efficient Matching and Indexing of Graph Models in Content-Based Retrieval. IEEE Trans. Pattern Anal. Mach. Intell.,2001,23(10):1089-1105.
    [22] Lladoos J, Martf E, Villanueva J J. Symbol Recognition by Error-Tolerant Sub-graph Matching between Region Adjacency Graphs. IEEE Trans. Pattern Anal.Mach. Intell., 2001, 23(10): 1137-1143.
    [23] Ozer B, Akansu A N, Wolf W. A Graph Based Object Description for Information Retrieval in Digital Image and Video Libraries, in: Proceedings of CBAIVL,1999, 79.
    [24] James C A, Weininger D, Delany J. Daylight theory manual daylisght version 4.82. Daylight Chemical Information Systemsjnc, 2003.
    [25] Willett P, Barnard J M, Downs G M. Chemical Similarity Searching. Journal of Chemical Information and Computer Sciences, 1998, 38(6):983-996.
    [26] Dumay A C, Geest R J, J.Gerbrands J, et al. A Graph Based Object Descriptionfor Information Retrieval in Digital Image and Video Libraries, in: Proceedings of Proc. Int. Conf. Pattern Recognition, Conf. C, 1992, 439-442.
    [27] Wang M, Iwai Y, Yachida M. Expression Recognition from Time-Sequential Facial Images by Use of Expression Change Model, in: Proceedings of FG '98:Proceedings of the 3rd. International Conference on Face & Gesture Recognition,Washington, DC, USA: IEEE Computer Society, 1998, 324.
    [28] Haris K, Efstratiadis S N, Maglaveros N, et al. Model-Based Morphological Seg-mentation and Labeling of Coronary Angiograms. IEEE Trans. Med. Imaging,1999, 18(10): 1003-1015.
    [29] Garey M R, Johnson D S. Computers and Intractability: A Guide to the Theory of NP-Completeness. New York: W.H.Freeman and Company., 1979.
    [30] Foggia P, Sansone C, Vento M. A performance comparison of five algorithms for graph isomorphism, in: Proceedings of the 3rd IAPR-TC15 Workshop on Graph based Representation (GbR2001), 2001.
    [31] Ullmann J R. An Algorithm for Subgraph Isomorphism. J. ACM, 1976, 23(1).
    [32] Cordelia L P, Foggia P, Sansone C, et al. A (Sub)Graph Isomorphism Algo-rithm for Matching Large Graphs. IEEE Trans. Pattern Anal. Mach. Intell., 2004,26(10): 1367-1372.
    [33] Zhang S, Hu M, Yang J. TreePi: A Novel Graph Indexing Method, in: Proceed-ings of ICDE, 2007, 966-975.
    [34] Cheng J, Ke Y, Ng W, et al. Fg-index: towards verification-free query processing on graph databases, in: Proceedings of SIGMOD Conference, 2007, 857-872.
    [35] Jiang H, Wang H, Yu P S, et al. GString: A Novel Approach for Efficient Search in Graph Databases, in: Proceedings of ICDE, 2007, 566-575.
    [36] Williams D W, Huan J, Wang W. Graph Database Indexing Using Structured Graph Decomposition, in: Proceedings of ICDE, 2007, 976-985.
    [37] Zhao P, Yu J X, Yu P S. Graph Indexing: Tree + Delta >= Graph. in: Proceedings of VLDB, 2007,938-949.
    [38] He H, Singh A K. Closure-Tree: An Index Structure for Graph Queries, in:Proceedings of ICDE, 2006, 38.
    [39] Yang Q, Sze S. Path matching and graph matching in biological networks. J Comput Biol., 2007, 14(1).
    [40] Lacroix V, Fernandes C G, Sagot M F. Motif Search in Graphs: Application to Metabolic Networks. IEEE/ACM Trans. Comput. Biology Bioinform., 2006,3(4):360-368.
    [41] Shang H, Zhang Y, Lin X, et al. Taming verification hardness: an efficient algo-rithm for testing subgraph isomorphism. PVLDB, 2008, 1(1):364-375.
    [42] Chen C, Yan X, Yu P S, et al. Towards Graph Containment Search and Indexing, in: Proceedings of VLDB, 2007, 926-937.
    [43] Zhang S, Li J, Gao H, et al. A Novel Approach for Efficient Supergraph Query Processing on Graph Databases, in: Proceedings of EDBT, 2009, 204-215.
    [44] Wang C, Chen L. Continuous Subgraph Pattern Search over Graph Streams, in:Proceedings of ICDE, 2009.
    [45]Tian Y, McEachin R C, Santos C, et al. SAGA: a subgraph matching tool for biological graphs. Bioinformatics, 2007, 23(2):232-239.
    [46]李先通，李建中，高宏．一种高效频繁子图挖掘算法．软件学报．，2007，10(11)：2469-2480．
    [47]汪卫，周皓峰，袁晴晴，et al．基于图论的频繁模式挖掘．计算机研究与发展．，2005，2(7)：230-235．
    [48]Agrawal R, Srikant R. Fast Algorithms for Mining Association Rules in Large Databases. in: Bocca J B, Jarke M, Zaniolo C, editors, Proceedings of VLDB'94, Proceedings of 20th International Conference on Very Large Data Bases, September 12-15, 1994, Santiago de Chile, Chile. Morgan Kaufmann, 1994, 487-499.
    [49]Srikant R, Agrawal R. Mining Generalized Association Rules. in: Proceedings of VLDB, 1995, 407-419.
    [50]Han J, Pei J, Yin Y, et al. Mining Frequent Pattems without Candidate Generation: A Frequent-Pattern Tree Approach. Data Min. Knowl. Discov., 2004,8(1):53-87.
    [51]Pei J, Han J, Mortazavi-Asl B, et al. Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach. IEEE Trans. Knowl. Data Eng., 2004, 16(11):1424-1440.
    [52]Agrawal R, Srikant R. Mining Sequential Pattems. in: Proceedings of ICDE, 1995, 3-14.
    [53]Zaki MJ, Aggarwal C C. XRules: an effective structural classifier for XML data. in: Proceedings of KDD, 2003, 316-325.
    [54]Deshpande M, Kuramochi M, Wale N, et al. Frequent Substructure-Based Approaches for Classifying Chemical Compounds. IEEE Trans. Knowl. Data Eng.,2005, 17(8): 1036-1050.
    [55] Kumar R, Tuzhilin A, Faloutsos C, et al. Social networks: looking ahead, in:Proceedings of KDD, 2008, 1060.
    [56] Srikant R, Agrawal R. Mining Sequential Patterns: Generalizations and Perfor-mance Improvements, in: Proceedings of EDBT, 1996, 3-17.
    [57] Zaki M J. Efficiently Mining Frequent Trees in a Forest: Algorithms and Appli-cations. IEEE Trans. Knowl. Data Eng., 2005, 17(8): 1021-1035.
    [58] Kuramochi M, Karypis G. Frequent Subgraph Discovery, in: Proceedings of ICDM,2001, 313-320.
    [59] Han J, Pei J, Yin Y. Mining Frequent Patterns without Candidate Generation,in: Chen W, Naughton J F, Bernstein P A, editors, Proceedings of Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data,May 16-18, 2000, Dallas, Texas, USA. ACM, 2000.
    [60] Pei J, Han J, Mortazavi-Asl B, et al. PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth, in: Proceedings of ICDE, 2001.
    [61] Asai T, Abe K, Kawasoe S, et al. Efficient Substructure Discovery from Large Semi-structured Data, in: Proceedings of Second SIAM Int'l Conf Data Mining,2002.
    [62] Yan X, Han J. gSpan: Graph-Based Substructure Pattern Mining, in: Proceedings of ICDM, 2002, 721-724.
    [63] Wang C, Hong M, Pei J, et al. Efficient Pattern-Growth Methods for Frequent Tree Pattern Mining, in: Proceedings of PAKDD, 2004, 441-451.
    [64] Han J, Kamber M. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2000.
    [65] Yan X, Han J. CloseGraph: mining closed frequent graph patterns, in: Proceed-ings of KDD.2003,286-295.
    [66] Thomas L T, Valluri S R, Karlapalem K. MARGIN: Maximal Frequent Subgraph Mining, in: Proceedings of ICDM, 2006, 1097-1101. [67] Zhu F, Yan X, Han J, et al. gPrune: A Constraint Pushing Framework for Graph Pattern Mining, in: Proceedings of PAKDD, 2007, 388-400.
    [68] Chen C, Yan X, Zhu F, et al. gApprox: Mining Frequent Approximate Patterns from a Massive Network, in: Proceedings of ICDM, 2007, 445-450.
    [69] Fagin R. Fuzzy Queries in Multimedia Database Systems, in: Proceedings of PODS, 1998, 1-10.
    [70] G. Salton. Automatic Text Processing, the Transformation, Analysis and Re-trieval of Information by Computer. MA: Addison - Wesley Pub. Co., 1989.
    [71] Raghavan S, Garcia-Molina H. Complex Queries over Web Repositories, in:Proceedings of VLDB, 2003, 33-44.
    [72] Fagin R, Lotem A, Naor M. Optimal aggregation algorithms for middleware. J.Comput. Syst. Sci., 2003, 66(4):614-656.
    [73] Mamoulis N, Cheng K H, Yiu M L, et al. Efficient Aggregation of Ranked Inputs,in: Proceedings of ICDE, 2006,72.
    [74] Chang Y C, Bergman L D, Castelli V, et al. The Onion Technique: Indexing for Linear Optimization Queries, in: Proceedings of SIGMOD Conference, 2000,391-402.
    [75] Hristidis V, Koudas N, Papakonstantinou Y. PREFER: A System for the Efficient Execution of Multi-parametric Ranked Queries, in: Proceedings of SIGMOD Conference, 2001, 259-270.
    [76] Xin D, Chen C, Han J. Towards Robust Indexing for Ranked Queries, in: Pro-ceedings of VLDB, 2006, 235-246.
    [77] Xin D, Han J, Cheng H, et al. Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach, in: Proceedings of VLDB, 2006.
    [78] Bruno N, Chaudhuri S, Gravano L. Top-k selection queries over relational databases: Mapping strategies and performance evaluation. ACM Trans.Database Syst., 2002, 27(2): 153-187.
    [79] Hua M, Pei J, Zhang W, et al. Efficiently Answering Probabilistic Threshold Top-k Queries on Uncertain Data, in: Proceedings of ICDE, 2008, 1403-1405.
    [80] Lian X, 0002 L C. Probabilistic ranked queries in uncertain databases, in: Pro-ceedings of EDBT, 2008, 511-522.
    [81] Soliman M A, Eyas I F, Chang K C C. Top-k Query Processing in Uncertain Databases., in: Proceedings of ICDE, 2007, 896-905.
    [82] Chi Y, Xia Y, Yang Y, et al. Mining Closed and Maximal Frequent Subtrees from Databases of Labeled Rooted Trees. IEEE Trans. Knowl. Data Eng., 2005,17(2): 190-202.
    [83] Fortin S. The graph isomorphism problem. Department of Computing Science,University of Alberta, 1996.
    [84] Faloutsos C, Christodoulakis S. Signature Files: An Access Method for Docu-ments and Its Analytical Performance Evaluation. ACM Trans. Inf. Syst., 1984,2(4):267-288.
    [85] Tousidou E, Bozanis P, Manolopoulos Y. Signature-based structures for objects with set-valued attributes. Inf. Syst., 2002, 27(2):93-121.
    [86] 0002 N Z, Ozsu M T, Eyas IF, et al. FIX: Feature-based Indexing Technique for XML Documents, in: Proceedings of VLDB, 2006, 259-270.
    [87] West D B. Introduction to Graph Theory (2nd Edition). Prentice Hall, 2000.
    [88] Kitagawa H, Fukushima Y, Ishikawa Y, et al. Estimation of False Drops in Set-valued Object Retrieval with Signature Files, in: Proceedings of FODO, 1993,146-163.
    [89] Goldstine H H, Murray F J, Neumann J. The Jacobi Method for Real Symmetric Matrices. J. ACM, 1959, 6(l):59-96.
    [90] Xiong H, Shekhar S, Tan P N, et al. Exploiting a support-based upper bound of Pearson's correlation coefficient for efficiently identifying strongly correlated pairs, in: Proceedings of KDD, 2004, 334-343.
    [91] Brin S, Motwani R, Silverstein C. Beyond Market Baskets: Generalizing Asso-ciation Rules to Correlations, in: Peckham J, editor, Proceedings of SIGMOD.ACM Press, 1997, 265-276.
    [92] Omiecinski E. Alternative Interest Measures for Mining Associations in Databases. IEEE Trans. Knowl. Data Eng., 2003, 15(l):57-69.
    [93] Pan J Y, Yang H J, Faloutsos C, et al. Automatic multimedia cross-modal corre-lation discovery, in: Proceedings of KDD, 2004, 653-658.
    [94] Ke Y, Cheng J, Ng W. Correlation search in graph databases, in: Proceedings of KDD, 2007, 390-399.
    [95] Kuramochi M, Karypis G. An Efficient Algorithm for Discovering Frequent Subgraphs. IEEE Trans. Knowl. DataEng., 2004, 16(9):1038-1051.
    [96] Shahabi C, Kolahdouzan M R, Sharifzadeh M. A Road Network Embedding Technique for K-Nearest Neighbor Search in Moving Object Databases. Geoln-formatica, 2003, 7(3):255-273.
    [97] Brinkhoff T, Kriegel H P, Seeger B. Efficient Processing of Spatial Joins Using R-Trees. in: Proceedings of SIGMOD Conference, 1993,237-246.
    [98] Bohm C, Braunmuller B, Krebs F, et al. Epsilon Grid Order: An Algorithm for the Similarity Join on Massive High-Dimensional Data, in: Proceedings of SIGMOD, 2001, 379-388.
    [99] Jagadish H V, Ooi B C, Tan K L, et al. iDistance: An adaptive B+-tree based indexing method for nearest neighbor search. ACM Trans. Database Syst., 2005,30(2):364-397.
    [100] Borzsonyi S, Kossmann D, Stocker K. The Skyline Operator, in: Proceedings of ICDE, 2001, 421-430.
    [101] Chaudhuri S, Dalvi N N, Kaushik R. Robust Cardinality and Cost Estimation for Skyline Operator, in: Proceedings of ICDE, 2006, 64.
    [102] Barber C B, Dobkin D P, Huhdanpaa H. The Quickhull algorithm for convex hulls. ACM Trans, on Mathematical Software, 1996.
    [103] Das G, Gunopulos D, Koudas N, et al. Answering Top-k Queries Using Views,in: Proceedings of VLDB, 2006,451-462.
    [104] Cheng J, Yu J X. On-line exact shortest distance query processing, in: Proceedings of EDBT, 2009, 481-492.

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700