SigMR: MapReduce-based SPARQL query processing by signature encoding and multi-way join

详细信息查看全文

作者：Jinhyun Ahn ; Dong-Hyuk Im ; Hong-Gee Kim
关键词：Hadoop ; MapReduce ; Multi ; way join ; Signature encoding ; SigMR ; SPARQL
刊名：The Journal of Supercomputing
出版年：2015
出版时间：October 2015
年：2015
卷：71
期：10
页码：3695-3725
全文大小：2,592 KB
参考文献：1.Abadi DJ, Marcus A, Madden SR, Hollenbach K (2007) Scalable semantic web data management using vertical partitioning. In: Proceedings of the 33rd international conference on very large data bases, VLDB 鈥?7. VLDB endowment, pp 411鈥?22
2.Afrati FN, Ullman JD (2011) Optimizing multiway joins in a map-reduce environment. IEEE Trans Knowl Data Eng 23(9):1282鈥?298. doi:10.鈥?109/鈥婽KDE.鈥?011.鈥?7 CrossRef
3.Alu莽 G, Ozsu MT, Daudjee K (2014) Workload matters: why rdf databases need a new design. Proc VLDB Endow 7(10):837鈥?40CrossRef
4.Apache storm. https://鈥媠torm.鈥媋pache.鈥媜rg . Accessed 25 May 2015
5.Aranda-And煤jar A, Bugiotti F, Camacho-Rodr铆guez J, Colazzo D, Goasdou茅 F, Kaoudi Z, Manolescu I (2012) Amada: web data repositories in the amazon cloud. In: CIKM 2012. Maui, 脡tats-Unis
6.Arenas M, Cuenca Grau B, Evgeny E, Marciuska S, Zheleznyakov D (2014) Towards semantic faceted search. In: Proceedings of the companion publication of the 23rd international conference on world wide web companion, WWW companion 鈥?4. International world wide web conferences steering committee, Republic and Canton of Geneva, Switzerland, pp 219鈥?20. doi:10.鈥?145/鈥?567948.鈥?577381
7.Atre M, Chaoji V, Zaki MJ, Hendler JA (2010) Matrix bit loaded: a scalable lightweight join query processor for rdf data. In: Proceedings of the 19th international conference on world wide web. ACM, pp 41鈥?0
8.Becker C, Bizer C (2008) Dbpedia mobile: a location-enabled linked data browser. In: Proceedings of World Wide Web 2008 Workshop: Linked Data on the Web (LDOW 08), Beijing, China, 2008
9.Berners-Lee T, Hendler J, Lassila O et al (2001) The semantic web. Sci Am 284(5):28鈥?7CrossRef
10.Berners-Lee T, Chen Y, Chilton L, Connolly D, Dhanaraj R, Hollenbach J, Lerer A, Sheets D (2006) Tabulator: exploring and analyzing linked data on the semantic web. In: Proceedings of the 3rd international semantic web user interaction workshop, vol 2006
11.Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422鈥?26. doi:10.鈥?145/鈥?62686.鈥?62692 MATH CrossRef
12.Cui X, Zhu P, Yang X, Li K, Ji C (2014) Optimized big data k-means clustering usingmapreduce. J Supercomput 70(3):1249鈥?259. doi:10.鈥?007/鈥媠11227-014-1225-7 CrossRef
13.Cure Faye, Blin O (2012) A survey of RDF storage approaches. ARIMA J 15:11鈥?5
14.Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107鈥?13CrossRef
15.Xicheng D, Ying W, Huaming L (2011) Scheduling mixed real-time and non-real-time applications in MapReduce environment. In: Proceedings of the 17th International Conference on Parallel and Distributed Systems, Tainan, Taiwan, 2011
16.Gal谩rraga L, Hose K, Schenkel R (2014) Partout: a distributed engine for efficient rdf processing. In: Proceedings of the companion publication of the 23rd international conference on world wide web companion. International world wide web conferences steering committee, pp 267鈥?68
17.Hose K, Schenkel R (2013) Warp: workload-aware replication and partitioning for rdf. In: 4th international workshop on data engineering meets semantic web (DESWeb 2013). Brisbane, Australia
18.Huang J, Abadi DJ, Ren K (2011) Scalable sparql querying of large rdf graphs. Proc VLDB Endow 4(11):1123鈥?134
19.Husain M, McGlothlin J, Masud MM, Khan L, Thuraisingham B (2011) Heuristics-based query processing for large rdf graphs using cloud computing. IEEE Trans Knowl Data Eng 23(9):1312鈥?327CrossRef
20.Kaoudi Z, Manolescu I (2014) Rdf in the clouds: a survey. VLDB J. doi:10.鈥?007/鈥媠00778-014-0364-z
21.Koren J, Zhang Y, Liu X (2008) Personalized interactive faceted search. In: Proceedings of the 17th international conference on world wide web. ACM, pp 477鈥?86
22.Lee T, Im DH, Kim H, Kim HJ (2014) Application of filters to multiway joins in mapreduce. Math Probl Eng 2014, Art. ID 249418. doi:10.鈥?155/鈥?014/鈥?49418
23.McBride B (2001) Jena: implementing the rdf model and syntax specification. In: Proceedings of the Second International Workshop on the Semantic Web, Hongkong, 2001
24.Minack E, Sauermann L, Grimnes G, Fluit C, Broekstra J (2008) The sesame lucene sail: rdf queries with full-text search. In: Technical Report 2008-1, NEPOMUK consortium
25.Myung J, Sg Lee (2013) Exploiting inter-operation parallelism for matrix chain multiplication using mapreduce. J Supercomput 66(1):594鈥?09. doi:10.鈥?007/鈥媠11227-013-0936-5 CrossRef
26.Myung J, Yeon J, Lee Sg (2010) Sparql basic graph pattern processing with iterative mapreduce. In: Proceedings of the 2010 workshop on massive data analytics on the cloud, MDAC 鈥?0. ACM, New York, NY, USA, pp 6:1鈥?:6. doi:10.鈥?145/鈥?779599.鈥?779605
27.Neumann T, Weikum G (2010) The rdf-3x engine for scalable management of rdf data. VLDB J 19(1):91鈥?13. doi:10.鈥?007/鈥媠00778-009-0165-y CrossRef
28.Papailiou N, Konstantinou I, Tsoumakos D, Koziris N (2012) H2rdf: adaptive query processing on rdf data in the cloud. In: Proceedings of the 21st international conference companion on world wide web. ACM, pp 397鈥?00
29.Phan LTX, Zhang Z, Loo BT, Lee I (2010) Real-time MapReduce scheduling. In: Technical report no. MS-CIS-10-32, University of Pennsylvania, Philadelphia
30.Punnoose R, Crainiceanu A, Rapp D (2012) Rya: a scalable rdf triple store for the clouds. In: Proceedings of the 1st international workshop on cloud intelligence. ACM, p 4
31.Rohloff K, Schantz RE (2010) High-performance, massively scalable distributed systems using the mapreduce software framework: the shard triple-store. In: Programming support innovations for emerging distributed applications. ACM, p 4
32.Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. In: 2010 IEEE 26th symposium on mass storage systems and technologies (MSST). IEEE, pp 1鈥?0
33.Um Jh, Choi H, Sk Song, Sp Choi, Yoon H, Jung H, Kim Th (2013) Development of a virtualized supercomputing environment for genomic analysis. J Supercomput 65(1):71鈥?5. doi:10.鈥?007/鈥媠11227-012-0752-3 CrossRef
34.Van Aart C, Wielinga B, Van Hage WR (2010) Mobile cultural heritage guide: location-aware semantic search. In: Proceedings of The 17th International Conference on Knowledge Engineering and Knowledge Management, Lisbon, Portugal, 2001
35.Virtuoso. http://鈥媣irtuoso.鈥媜penlinksw.鈥媍om/鈥?/span> . Accessed 25 May 2015
36.Weiss C, Karras P, Bernstein A (2008) Hexastore: sextuple indexing for semantic web data management. Proc VLDB Endow 1(1):1008鈥?019. doi:10.鈥?4778/鈥?453856.鈥?453965 CrossRef
37.Zeng K, Yang J, Wang H, Shao B, Wang Z (2013) A distributed graph engine for web scale rdf data. In: Proceedings of the VLDB Endowment, vol 6. VLDB Endowment, pp 265鈥?76
38.Zhang X, Chen L, Tong Y, Wang M (2013) Eagre: towards scalable i/o efficient sparql query evaluation on the cloud. In: 2013 IEEE 29th international conference on data engineering (ICDE). IEEE, pp 565鈥?76
39.Zou L, Mo J, Chen L, 脰zsu MT, Zhao D (2011) gstore: answering sparql queries via subgraph matching. Proc VLDB Endow 4(8):482鈥?93CrossRef
作者单位：Jinhyun Ahn (1)
Dong-Hyuk Im (2)
Hong-Gee Kim (1) (3)

1. Biomedical Knowledge Engineering Laboratory, Dental Research Institute, Seoul National University, Seoul, Republic of Korea
2. Department of Computer and Information Engineering, Hoseo University, Asan, Republic of Korea
3. Institute of Human-Environment Interface Biology, Seoul National University, Seoul, Republic of Korea
刊物类别：Computer Science
刊物主题：Programming Languages, Compilers and Interpreters
Processor Architectures
Computer Science, general
出版者：Springer Netherlands
ISSN：1573-0484

文摘

Large numbers of Resource Description Framework triples are available in Linked Data which can grow exponentially. It makes SPARQL query processing engines infeasible on a single machine. To address this scalability issue, MapReduce framework-based SPARQL engines have been proposed, but we note that these methods are limited in terms of join evaluations. The two-way join-based approach evaluates joins via a sequence of binary multiplications that require multiple MapReduce jobs, which involves costly disk accesses between MapReduce jobs. The multi-way join-based approach combines multiple two-way join operations, which allows the simultaneous evaluation of joins during one MapReduce job. However, the size of data for the MapReduce job might increase exponentially if a complex query is given. In this study, we propose SigMR, a pruning method for multi-way join-based SPARQL query processing in MapReduce. In the proposed approach, a SPARQL query can be evaluated in a single MapReduce job, where the size of data is reduced dramatically by pruning based on our signature encoding technique, thereby overcoming the weaknesses of the previous approaches. In experiments, we showed that the query processing time required was lower with our approach than existing MapReduce-based methods. Keywords Hadoop MapReduce Multi-way join Signature encoding SigMR SPARQL

地址：北京市海淀区学院路29号邮编：100083

电话：办公室：(+86 10)66554848；文献借阅、咨询服务、科技查新：66554700