An Agglomerative-adapted Partition Approach for Large-scale Graphs

Chen Tao, Rongrong Shan, Hui Li, Dongsheng Wang, Wei Liu

Abstract


In recent years, an increasing number of knowledge bases have been built using linked data, thus datasets have grown substantially. It is neither reasonable to store a large amount of triple data in a single graph, nor appropriate to store RDF in named graphs by class URIs, because many joins can cause performance problems between graphs. This paper presents an agglomerative-adapted approach for large-scale graphs, which is also a bottom-up merging process. The proposed algorithm can partition triples data in three levels: blank nodes, associated nodes, and inference nodes. Regarding blank nodes and classes/nodes involved in reasoning rules, it is better to store with an optimal neighbor node in the same partition instead of splitting into separate partitions. The process of merging associated nodes needs to start with the node in the smallest cost and then repeat it until the final number of partitions is met. Finally, the feasibility and rationality of the merging algorithm are analyzed in detail through bibliographic cases. In summary, the partitioning methods proposed in this paper can be applied in distributed storage, data retrieval, data export, and semantic reasoning of large-scale triples graphs. In the future, we will research the automation setting of the number of partitions with machine learning algorithms.


Keywords


Linked Data, Agglomerative-Adapted Partition, Merging Algorithm, Large-Scale Graph, k-Graph

Full Text:

PDF

References


Erkimbaev, A. O., Zitserman, V. Y., Kobzev, G. A., Serebrjakov, V. A., & Teymurazov, K. B. (2013). Publishing scientific data as linked open data. Scientific and Technical Information Processing, 40(4): 253-263. DOI:10.3103/S014768821304014X

Craig A. Knoblock, Pedro Szekely, Eleanor Fink, Duane Degler, David Newbury, Robert Sanderson, ... Yixiang Yao (2017). Lessons learned in building linked data for the American art collaborative. in Proc. The Semantic Web - ISWC 2017, 263-279. DOI:10.1007/978-3-319-68204-4_26

Chen Tao, Zhang Yongjuan, Liu Wei, & Zhu Qinghua (2019). Several specifications and recommendations for the publication of linked data. Journal of Library Science in China, 45(1):34-46

Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, & Bhavani Thuraisingham (2009). Storage and retrieval of large RDF graph using Hadoop and MapReduce. CloudCom 2009, LNCS 5931, 680-686. DOI:10.1007/978-3-642-10665-1_72

Kurt, R., & Richard, E. S. (2010). High-performance, massively scalable distributed systems using the MapReduce software framework: the SHARD triple-store. in Proc. Programming Support Innovations for Emerging Distributed Applications, ACM, 4:1-4:5. DOI:10.1145/1940747.1940751

Khushboo, T., & Abhishek B. (2017). A review of large-scale RDF document processing in Hadoop MapReduce framework. International Journal of Scientific Research Engineering & Technology (IJSRET), 6(2):123-126. .

Vaibhav Khadilkar, Murat Kantarcioglu, Bhavani Thuraisingham, & Paolo Castagna (2012). Jena-HBase: a distributed, scalable and efficient RDF triple store. in Proc. International Semantic Web Conference (ISWC), Springer, 1-4.

Nikolaos, P., Ioannis, K., Dimitrios, T., & Nectarios K. (2012). H2RDF: adaptive query processing on RDF data in the cloud. in Proc. 21st International Conference on World Wide Web, 397-400. DOI:10.1145/2187980.2188058

Alfredo, C., Rajkumar, B., Vincenzo P., & Giovanni P. (2017). MapReduce-based algorithms for managing big RDF graphs: state-of-the-art analysis, paradigms, and future directions. in Proc. 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 898-905. DOI:10.1109/CCGRID.2017.109

Kai Zeng, Jiacheng Yang, Haixun Wang, Bin Shao, & Zhongyuan Wang (2013). A distributed graph engine for web scale RDF data. Proceedings of the VLDB Endowment, 6(4):265-276. DOI:10.14778/2535570.2488333

Rong Gu, Wei Hu, & Yihua Huang (2014). Rainbow: a distributed and hierarchical RDF triple store with dynamic scalability. in Proc. IEEE International Conference on Big Data, 561-566. DOI:10.1109/BigData.2014.7004274

Yingjie Li, & Jeff Heflin (2010). Query optimization for ontology-based information integration. Proceedings of the 19th ACM Conference on Information and Knowledge Management, CIKM 2010, Toronto, Ontario, Canada. DOI:10.1145/1871437.1871623

Razen AI-Harbi, Yasser Ebrahim, & Panos Kalnis (2014). PHD-Store: an adaptive SPARQL engine with dynamic partitioning for distributed RDF repositories. CoRR.

Ruben ,V., Miel, V.S., & Pieter, C. (2014). Web-scale querying through Linked Data Fragments. Proceedings of the 7th Workshop on Linked Data on the Web

Huang, J. W., & Daniel J. A. (2016). LEOPARD: lightweight edge-oriented partitioning and replication for dynamic graphs. Proceedings of the VLDB Endowment, 9(7):540-551. DOI:10.14778/2904483.2904486

Wang, R. & Kenneth, C. (2012). A graph partitioning approach to distributed RDF stores. in Proc. IEEE 10th International Symposium on Parallel and Distributed Processing with Application (ISPA), 411-418. DOI:10.1109/ISPA.2012.60

Yun Hao, Gaofeng Li, Pingpeng Yuan, & Hai Jin (2017). An association-oriented partitioning approach for streaming graph query. Scientific Programming, 11:1-11. DOI:10.1155/2017/2573592




DOI: https://doi.org/10.23974/ijol.2019.vol4.1.106

Refbacks

  • There are currently no refbacks.


Copyright (c) 2019 chen tao

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.