摘要
针对大规模数据图下基于回溯法的子图查询算法的准确率低、开销大等问题,为提高查询准确率,降低大图下的查询开销,提出一种基于Spark的子图匹配(SQM)算法。首先根据结构信息过滤数据图,再将查询图分割成基本查询单元;然后对每一个基本查询单元分别匹配后进行Join操作;最后运用并行化提高了算法的运行效率,减小了搜索空间。实验结果表明,与Stwig、Turbo ISO算法相比,SQM算法在保证查询结果不变的情况下,速度提高了50%。
Focusing on low accuracy and high costs of backtracking-based subgraph query algorithm applied to large-scale graphs, a Spark-based Subgraph Query Matching( SQM) algorithm was proposed to improve query accuracy and reduce query overhead for large graphs. The data graph was firstly filtered according to structure information, and the query graph was divided into basic query units. Then each basic query unit was matched and joined together. Finally, the algorithm's efficiency was improved and search space was reduced by parallelization. The experimental results show that compared with Stwig( Sub twig)algorithm and Turbo ISO algorithm, SQM algorithm can increase the speed by 50% while ensuring the same query results.
引文
[1]中国互联网信息中心.中国互联网络发展状况统计报告[EB/OL].(2017-02-03)[2018-05-20]. http://www. cac. gov. cn/2018zt/cnnic41/index. htm.(China Internet Information Center.Statistical Report on the Development of China's Internet.[EB/OL].(2017-02-03)[2018-05-20]. http://www. cac. gov. cn/2018zt/cnnic41/index. htm.)
[2]KATSAROU F, NTARMOS N, TRIANTAFILLOU P. Performance and scalability of indexed subgraph query processing methods[J].Proceedings of the VLDB Endowment, 2015, 8(12):1566-1577.
[3]ULLMANN J R. An algorithm for subgraph isomorphism[J]. Journal of the ACM, 1976, 23(1):31-42.
[4]CORDELLA L P, FOGGIA P. A(sub)graph isomorphism algorithm for matching large graphs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004, 26(10):1367-1372.
[5]HE H, SINGH A K. Graphs-at-a-time:query language and access methods for graph databases[C]//SIGMOD 2008:Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. New York:ACM, 2008:405-418.
[6]ZHANG S, LI S, YANG J. GADDI:distance index based subgraph matching in biological networks[C]//Proceedings of the 12th International Conference on Extending Database Technology:Advances in Database Technology. New York:ACM, 2009:192-203.
[7]张硕,李建中,高宏,等.一种多到一子图同构检测方法[J].软件学报,2010,21(3):401-414.(ZHANG S, LI J Z, GAO H, et al.A multi-to-one subgraph isomorphism detection method[J]. Journal of Software, 2010, 21(3):401-414.)
[8]HAN W S, LEE J, LEE J H. Turbo ISO:towards ultrafast and robust subgraph isomorphism search in large graph databases[C]//ICMD2013:Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. New York:ACM, 2013:337-348.
[9]AFRATI F N, FOTAKIS D, ULLMAN J D. Enumerating subgraph instances using Map-Reduce[C]//ICDE2013:Proceedings of the2013 IEEE International Conference on Data Engineering. Piscataway, NJ:IEEE, 2013:62-73.
[10]PLANTENGA T. Inexact subgraph isomorphism in MapReduce[J]. Journal of Parallel&Distributed Computing, 2013, 73(2):164-175.
[11]SUN Z, WANG H Z, WANG H X, et al. Efficient subgraph matching on billion node graphs[J]. Proceedings of the VLDB Endowment, 2012, 5(9):788-799.
[12]于静,刘燕兵,张宇,等.大规模图数据匹配技术综述[J].计算机研究与发展,2015,52(2):391-409.(YU J, LIU Y B,ZHANG Y, et al. Survey on large-scale graph pattern matching[J]. Journal of Computer Research and Development, 2015, 52(2):391-409.)
[13]ZOU L, CHEN L, LU Y. Top-k subgraph matching query in a large graph[C]//PIKM 2007:Proceedings of the ACM first Ph.D. Workshop in Sixteenth ACM Conference on Information and Knowledge Management. New York:ACM, 2007:139-146.
[14]NEUMANN T, WEIKUM G. The RDF-3X engine for scalable management of RDF data[J]. The VLDB Journal, 2010, 19(1):91-113.
[15]ATRE M, CHAOJI V, ZAKI M J, et al. Matrixbitloaded:a scalable lightweight join query processor for RDF data[C]//IW3C2:Proceedings of the 2010 International World Wide Web Conference Committee. New York:ACM, 2010:41-50.
[16]HONG L, ZOU L, LIAN X, et al. Subgraph matching with set similarity in a large graph database[J]. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(9):2507-2521.
[17]YAN X, YU P S, HAN J. Substructure similarity search in graph databases[C]//SIGMOD 2005:Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. New York:ACM, 2005:766-777.
[18]Apache. Spark[EB/OL].[2018-02-03]. http://spark. apache. org.
[19]CHANG Z, ZOU L, LI F. Privacy preserving subgraph matching on large graphs in cloud[C]//ICMD2016:Proceedings of the2016 International Conference on Management of Data. New York:ACM, 2016:199-213.
[20]LESKOVEC J. SNAP[EB/OL].[2018-03-03]. http://snap.stanford. edu/snappy.