Semantic Data Cubes

In this context we define semantic data cubes as data cubes encoded and approached with 'semantic web' technologies.

The different pieces of the semantic web approach to data cubes are the following:

In the next sections, we will elaborate on each of the above, and continue to build on the Olympic Games example that we also used to explain the fundamentals of data cubes. We will also introduce Turtle as the syntax to encode RDF statements.

RDF: a domain independent data model

The Resource Description Framework (RDF) is a domain-independent framework for expressing information about resources, intended for machine processing and exchange on the web. Resources can be anything, including documents, people, physical objects, and abstract concepts.

In our example, countries, editions of Olympics, medals, gender, athletes, disciplines, and so on, are all resources that we can refer to and make statements about.

RDF statements always have the structure of a simple sentence:

<subject> <predicate> <object>.

The subject is the thing, the resource the statement is about. This resource has a property or relationship (called predicate), of which the value is the object.

Because RDF statements consist of three resources they are called triples.

For example:

<USA> <is_a> <country> .
<USA> <is_situated_in> <Northern_America> .
<USA> <has_area> 9,833,520 km2 .
<Northern_America> <is_a> <geographic_region> .
    

It looks quite artificial for now, but just take this as a starting point. There are already several things that this example illustrates:

  1. A particular resource may be the subject of one triple and the object of another. This makes it possible to find connections between triples, which is an important part of RDF's power.
  2. Properties are also resources. They are a bit special in that they express some kind of relationship between two other resources, but other than that they are resources just like the subject and object.
  3. We need a way of identifying resources unambiguously for machines to be able to process this and establish correct (sensible) links between the data they are given, and also to re-use resources that have already been identified (by ourselves or by others).

To identify the subjects, properties, and objects in RDF statements unambiguously, the following techniques are available: URIs and literals.

Applying this to the first triple in our example, we could arrive at the following statement:

     <http://dbpedia.org/resource/United_States>
          <http://www.w3.org/1999/02/22-rdf-syntax-ns/type>
                <http://dbpedia.org/ontology/Country> .
    

Step by step:

To illustrate the use of literals, let’s also add this statement:

     <http://dbpedia.org/resource/United_States>
        <http://www.w3.org/2000/01/rdf-schema/label>
           "Verenigde Staten"@nl .
    

Again, step by step:

As may already be apparent from the examples above, when we start making multiple statements related to the same subject, things can quickly get quite verbose and lengthy, and we also see the same URIs being repeated entirely or partially quite often. To reduce this verbosity and make the statements more readable, a more compact syntax is available in the form of Turtle.

Turtle, or Terse RDF Triple Language, uses abbreviations and other shorthand notations:

If we apply this to our sample triples, we can reduce our statements to the following more condensed form (keeping in mind that prefixes only need to be defined once):

     @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
     @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
     @prefix dbo:	<http://dbpedia.org/ontology/> .
     @prefix dbr:	<http://dbpedia.org/resource/> .

     dbr:United_States rdf:type dbo:Country;
         rfds:label "Verenigde Staten"@nl;
        .
    

We will use Turtle as the syntax for our examples from here on forward.

RDFS: adding domain semantics

While RDF provides a way to make statements in triple format about resources, it does not however make any assumptions about what those resources represent. RDF in itself is domain-free.

To apply RDF in your domain, you need to specify “the things” you want to talk about and the properties and relationships of relevance. In RDF-speak this translates into defining the classes and properties to be used in your RDF statements to describe your resources. Such a group of domain-relevant classes and properties are referred to as a vocabulary. The RDF Schema Language (RDFS) is what is used to define such vocabularies.

RDF Data Cube Vocabulary

Fortunately there is already a vocabulary for describing statistical information and data cubes, and what’s more, it is a W3C standard: the RDF Data Cube vocabulary.

RDF Data Cube Classes

Let’s have a quick reminder of the concepts we discovered while exploring the Olympics data:

These are our classes of interest, and nearly all of these have an RDFS translation/definition in the RDF Data Cube vocabulary (in what follows, qb is the prefix for http://purl.org/linked-data/cube#):

Data cubeqb:DataSet
Dimensionqb:DimensionProperty
Observationqb:Observation
Measurementqb:MeasureProperty
UnitNot in RDF Data Cubes but can be found in other vocabularia, e.g. QUDT.
Sliceqb:Slice

Using that vocabulary and adding a few new “words”, a data cube has been defined as follows:

A qb:DataSet has a qb:DataStructureDefinition that defines the structure of the cube. The structure is specified by means of a qb:ComponentSpecification containing a number of qb:ComponentProperty sets, detailing qb:DimensionProperty to define the dimensions of the cube, qb:MeasureProperty to define the measured variables, and qb:AttributeProperty to define structural metadata such as the unit of measurement. The observations themselves are included as a qb:Observation for each cell of the cube.

For those who are interested to know more of the technical details, each of these classes has a formal semantic definition in the RDF Data Cube vocabulary specification. We’ll limit ourselves to just one example here: the definition of the Observation Class.

  qb:Observation
   rdf:type rdfs:Class ;
   rdfs:comment "A single observation in the cube, may have one or more
    associated measured values"@en ;
   rdfs:isDefinedBy <http://purl.org/linked-data/cube> ;
   rdfs:label "Observation"@en ;
   rdfs:subClassOf qb:Attachable ;
   owl:equivalentClass scovo:Item ;
.

Some of this will already be sufficiently familiar and readable:

The formal semantics are expressed in the last two statements:

Omitting the prefixes for a moment, the use of subClassOf means that every instance of an Observation is also an instance of the class Attachable (an abstract superclass for everything that can have attributes and dimensions.). The use of equivalentClass means that every instance of an Observation is also an instance of the class Item defined in another vocabulary called The Statistical Core Vocabulary (SCOVO, meanwhile deprecated), and conversely every such Item is also an instance of Observation.

RDF Data Cube Properties

The RDF data cube vocabulary defines also a list of properties:relations:

rdf data cube properties

Again, more technical details are available in the RDF Data Cube vocabulary specification.

Let’s have a more detailed look at one example:

:X qb:dataSet :Y .

Note a small but important detail here: the lowercase letter d in dataSet. Earlier we mentioned qb:DataSet with a capital D, which is the class for data cubes. Both qb:DataSet and qb:dataSet come from the same vocabulary, but the class qb:DataSet does not equal the property qb:dataSet – they are different things!

The triple states that the resource we identify by means of the URI :X has a qb:dataSet property of which the value is the resource identified by URI :Y. To understand what that means, we need to look at the formal definition of qb:dataSet in the RDF Data Cube vocabulary specification:

qb:dataSet
  rdf:type rdf:Property ;
  rdfs:comment "indicates the data set of which this observation 
     is a part"@en ;
  rdfs:isDefinedBy <http://purl.org/linked-data/cube> ;
  rdfs:label "data set"@en ;
  rdfs:domain qb:Observation ;
  rdfs:range qb:DataSet ;
  owl:equivalentProperty scovo:dataset ;
.

Step by step:

The formal semantics are in the last 3 lines, which say the following things about qb:dataSet:

So, given a triple that states :X qb:dataSet :Y, this means that:

Supporting vocabularies

The Data Cube vocabulary in turn builds upon the following existing RDF vocabularies:

We’ll very briefly elaborate on SKOS, and we’ll also touch on SDMX (Statistical Data and Metadata eXchange).

SKOS

Values for dimensions within a data cube must be unambiguously defined. This can be achieved either by assigning data types to values to make sure they are correctly specified and interpreted, or by defining codes in code-lists in order to create controlled sets of values. Sometimes such code lists will already have been defined and may be suitable for re-use. If however you need to define your own lists, the Simple Knowledge Organization System (SKOS) is the recommended way to define both the codes and the code lists, where codes can be defined as skos:Concept, and the code lists as skos:ConceptScheme or skos:Collection. SKOS can also be used to define hierarchical code lists, thereby enabling aggregation of data in cubes.

Those who are keen to know more of the details of SKOS should add the SKOS primer to their reading lists.

An example:

<https://example.org/id/conceptscheme/competitions>   rdf:type skos:ConceptScheme ;   
       rdfs:label "competitions"@en ;   
       skos:hasTopConcept <https://example.org/id/concept/olympics> ;   
       skos:hasTopConcept <https://example.org/id/concept/paralympics> ; 
.
<https://example.org/id/concept/olympics>   rdf:type skos:Concept ;   
       rdfs:label "olympics"@en ;   
       skos:inScheme <https://example.org/id/conceptscheme/competitions> ; 
. 
<https://example.org/id/concept/paralympics>   rdf:type skos:Concept ;   
       rdfs:label "paralympics"@en ;   
       skos:inScheme <https://example.org/id/conceptscheme/competitions> ; 
. 

We have a controlled list (skos:ConceptScheme) here with label 'competitions' with 2 concepts:

Notice the relationship (skos:hasTopConcept) for relating the controlled list with the concepts and skos:inScheme for the other way around.

SDMX

The model underpinning the RDF Data Cube vocabulary is compatible with the cube model that underlies SDMX (Statistical Data and Metadata eXchange), an ISO standard for exchanging and sharing statistical data and metadata among organizations.

The SDMX standard includes a set of content oriented guidelines (COG) which define cross-domain concepts, code lists, and categories that support interoperability and comparability between datasets by providing a shared terminology between SDMX implementers. A community group has developed RDF encodings of these guidelines. While these encodings do not form part of the RDF Data Cube specification, they are however used by a number of existing Data Cube publications.

More information on SDMX and specifically how it relates to the RDF Data Cube vocabulary is available in the RDF Data Cube vocabulary specification.