面向大规模图数据的分布式子图匹配算法

英文篇名：Distributed Subgraph Matching Algorithm for Large Scale Graph Data
作者：许文 ; 宋文爱 ; 富丽贞 ; 吕伟
英文作者：XU Wen;SONG Wen-ai;FU Li-zhen;LV Wei;School of Software,North University of China;
关键词：子图匹配 ; 子图查询 ; 分布式 ; 图数据 ; 图划分
英文关键词：Subgraph matching;;Subgraph query;;Distributed;;Graph data;;Graph partition
中文刊名：JSJA
英文刊名：Computer Science
机构：中北大学软件学院;
出版日期：2019-04-15
出版单位：计算机科学
年：2019
期：v.46
基金：国家自然科学基金(61602427);; 山西省自然科学基金(201601D202037)资助
语种：中文;
页：JSJA201904005
页数：8
CN：04
ISSN：50-1075/TP
分类号：34-41

摘要

图数据规模的爆发式增长使在单机上的子图匹配变得较为困难。尽管现有的分布式算法可以在一定程度上解决大规模图数据的子图匹配问题,但分布式环境中的网络通信代价仍然影响着算法的性能。为此,文中提出了DSsearch分布式子图匹配算法,包含查询图拆分、数据图预处理、候选顶点过滤、中间结果合并4个步骤。其中,在数据图预处理步骤中使用图划分和完善邻居顶点策略来降低匹配过程中分布式计算节点之间的通信代价;在过滤候选顶点阶段设计DSgraph存储结构存储候选顶点,通过推迟笛卡尔积来减少冗余的中间结果。最后设计了对比实验并在具有7个计算节点的Spark分布式集群上使用真实数据集进行验证。实验结果表明,DSsearch算法能够在秒级时间内完成对百万规模顶点的数据图的子图匹配,尤其是在处理复杂查询图和稠密数据图方面更高效。数据图预处理策略的实验结果说明了通过顶点复制来降低分布式环境中网络通信代价这一策略的可行性。相比TwinTwigJoin、PSgL等算法,随着查询图顶点数量的增加,DSsearch算法的运行时间增长得更缓慢,当查询图顶点数量达到14时,其运行时间是TwinTwigJoin和PSgL算法的一半。实验数据充分说明,分布式环境中的网络通信代价和中间结果数量是影响分布式子图匹配算法的主要因素。实现数据图的预处理和推迟笛卡尔积解决了分布式子图匹配的性能瓶颈问题,有效地完成了大规模图数据的子图匹配。
The explosive growth of graph data size makes it difficult to process subgraph matching on single machine.Although the existing distributed algorithms could solve the problem to some extent,the network communication cost among the distributed computing nodes still affects the performance of algorithm.To solve this problem,this paper proposed a distributed subgraph matching algorithm,named DSsearch.It includes four steps:spliting query graph,preprocessing data graph,filtering candidate and combining intermediate results.In the data graph preprocessing,the strategy of graph partitioning and perfect neighbor vertex are used to reduce the communication cost among the distributed computing nodes.The DSgraph storage structure is designed to store candidate vertices during the filtering candidate vertex stage,and the redundant intermediate results are reduced by delaying the Cartesian product.Finally,a comparative experiment was designed and real data sets verification was performed on a Spark distributed cluster with 7 compute nodes.The experimental results show that the DSsearch algorithm can complete subgraph matching of data graphs of millions of vertices in seconds,especially in dealing with complex query graphs and dense data graphs.The experimental results of the data graph preprocessing strategy illustrate the feasibility of reducing network communication cost in a distributed environment through vertex replication.Compared with other algorithms such as TwinTwigJoin and PSgL,the increase of running time of DSsearch algorithm becomes slowly as the number of vertices in the query graph increases.When the number of query graph vertices reaches 14,the running time is only half of that of TwinTwigJoin and PSgL algorithms.The experimental data fully demonstrates that the data transmission cost and the number of intermediate results in the distributed system are the main factors affecting the efficiency of distributed subgraph matching algorithm.Realization of preprocessing the data graph and delaying the Cartesian product solves the bottleneck problem of the performance of distributed subgraph matching and effectively completes the subgraph matching of large-scale graph data.

引文

[1] ULLMANN J R.An algorithm for subgraph isomorphism[J].Journal of the ACM,1976,23(1):31-42.
    [2] CORDELLA L P,FOGGIA P,SANSONE C,et al.A (sub) graph isomorphism algorithm for matching large graphs[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2004,26(10):1367-1372.
    [3] LEE J,HAN W S,KASPEROVICS R,et al.An in-depth comparison of subgraph isomorphism algorithms in graph databases[J].Proceedings of the VLDB Endowment,2012,6(2):133-144.
    [4] HAN W S,LEE J,LEE J H.Turbo iso:towards ultrafast and robust subgraph isomorphism search in large graph databases[C]//Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data.ACM,2013:337-348.
    [5] BI F,CHANG L,LIN X,et al.Efficient subgraph matching by postponing cartesian products[C]//Proceedings of the 2016 International Conference on Management of Data.ACM,2016:1199-1214.
    [6] GIUGNO R,SHASHA D.Graphgrep:A fast and universal method for querying graphs[C]//16th International Conference on Pattern Recognition.IEEE,2002:112-115.
    [7] ZOU L,CHEN L,YU J X,et al.A novel spectral coding in a large graph database[C]//Proceedings of the 11th International Conference on Extending Database Technology:Advances in database technology.ACM,2008:181-192.
    [8] HE H,SINGH A K.Closure-tree:An index structure for graph queries[C]//Proceedings of the 22nd International Conference on Data Engineering(ICDE’06).IEEE,2006:38-38.
    [9] SUN Z,WANG H,WANG H,et al.Efficient subgraph matc- hing on billion node graphs[J].Proceedings of the VLDB Endowment,2012,5(9):788-799.
    [10] SHAO B,WANG H,LI Y.Trinity:A distributed graph engine on a memory cloud[C]//Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data.ACM,2013:505-516.
    [11] ZHAO Z,WANG G,BUTT A R,et al.Sahad:Subgraph analysis in massive networks using hadoop[C]//2012 IEEE 26th International Parallel & Distributed Processing Symposium (IPDPS).IEEE,2012:390-401.
    [12] ALON N,DAO P,HAJIRASOULIHA I,et al.Biomolecular network motif counting and discovery by color coding[J].Bioinformatics,2008,24(13):i241-i249.
    [13] DEAN J,GHEMAWAT S.MapReduce:simplified data proces- sing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
    [14] LAI L,QIN L,LIN X,et al.Scalable subgraph enumeration in mapreduce[J].Proceedings of the VLDB Endowment,2015,8(10):974-985.
    [15] SHAO Y,CUI B,CHEN L,et al.Parallel subgraph listing in a large-scale graph[C]//Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data.ACM,2014:625-636.
    [16] REZA T,KLYMKO C,RIPEANU M,et al.Towards Practical and Robust Labeled Pattern Matching in Trillion-Edge Graphs[C]//2017 IEEE International Conference on Cluster Computing (CLUSTER).IEEE,2017:1-12.
    [17] SUO B,LI Z,PAN W.Parallel subgraph matching on massive graphs[C]//International Congress on Image and Signal Processing,BioMedical Engineering and Informatics (CISP-BMEI).IEEE,2016:1932-1937.
    [18] BENLIC U,HAO J K.A multilevel memetic approach for improving graph k-partitions[J].IEEE Transactions on Evolutiona-ry Computation,2011,15(5):624-642.
    [19] ZAHARIA M,CHOWDHURY M,FRANKLIN M J,et al.Spark:Cluster computing with working sets[J].HotCloud,2010,10(10-10):95.
    [20] LIU X,ZHOU Y,GUAN X,et al.A feasible graph partition framework for parallel computing of big graph[J].Knowledge-Based Systems,2017,134:228-239.
    [21] KARYPIS G,KUMAR V.A fast and high quality multilevel scheme for partitioning irregular graphs[J].SIAM Journal on Scientific Computing,1998,20(1):359-392.
    [22] KERNIGHAN B W,LIN S.An efficient heuristic procedure for partitioning graphs[J].The Bell System Technical Journal,1970,49(2):291-307.