Ying Yan¹ , Chen Wang² , Aoying Zhou^1,3 , Weining Qian³ , Li Ma² , Yue Pan²

¹ Department of Computer Science and Engineering, Fudan University
{yingyan, ayzhou}@fudan.edu.cn

² IBM China Research Laboratory
{chwang, malli, panyue}@cn.ibm.com

³ Institute of Massive Computing, East China Normal University
{ayzhou, wnqian}@sei.ecnu.edu.cn

ABSTRACT

Efficiently querying RDF [1] data is being an important factor in applying Semantic Web technologies to real-world applications. In this context, many efforts have been made to store and query RDF data in relational database using particular schemas. In this paper, we propose a new scheme to store, index, and query RDF data in triple stores. Graph feature of RDF data is taken into considerations which might help reduce the join costs on the vertical database structure. We would partition RDF triples into overlapped groups, store them in a triple table with one more column of group identity, and build up a signature tree to index them. Based on this infrastructure, a complex RDF query is decomposed into multiple pieces of sub-queries which could be easily filtered into some RDF groups using signature tree index, and finally is evaluated with a composed and optimized SQL with specific constraints. We compare the performance of our method with prior art on typical queries over a large scaled LUBM and UOBM benchmark data (more than 10 million triples)in [3]. For some extreme cases, they can promote 3 to 4 orders of magnitude.

1 Introduction

The Semantic Web is an effort by the World Wide Web Consortium (W3C) to enable data integration and sharing across different applications. It is designed to evolve the general web which mostly consists of markup or other formats perceived by people into the machine-readable data web. The Semantic Web data model of Resource Description Framework [1] (RDF), recommended by W3C, represents data as a labelled graph connecting resources and their property values with labelled edges representing properties. The graph can be structurally parsed into a set of triples or statements in the form of < subject, predicate, object >. RDF data model is very general and is easy to express any type of data. Though RDF data representation is flexible, it potentially results in serious performance issues since RDF queries involve intensive self-joins over the triple table. When the number of triples is so large that they can not be cached in memory for each filter, the joins will be very expensive because disk scan and index lookup are required. As a general data model, RDF triple table stores any type of data in a column. For example, values of age, weight, or even given name of all persons co-exist in the object column. Accordingly, the statistics collected in these columns complicates selectivity estimation, which will further disable relational database query optimizer to deliver a nice query plan.

At present, creating indices, splitting data into property tables (2-column schema), and materializing join views (e.g., subject-subject and subject-object) are common methods for improving RDF query performance on the vertical database structure. However, it is still urgently needed to figure out new storage and indexing schemes that make relational database much efficient to support RDF query. In this paper, we propose a novel idea to optimize executable SQL for general-purposed triple stores by considering graph feature of RDF data. An intuitive example is given as follows to illustrate the idea and our motivation.

Efficiently Querying RDF Data in Triple Stores

ABSTRACT

Categories and Subject Descriptors

General Terms

Keywords

1 Introduction

2 Graph Partitioning and Indexing

Bibliography