Table2qb

Important: the example has been worked out on a still 'in development' version of table2qb (dated 30 november 2018). For the latest version of all the commands and related parameters available, consult the git repository.

Instructions for downloading and installing the tool are provided at the above URL. In this section we will only describe the main steps and then apply them to a sample data set for the Olympic medals.

Overview of the conversion process

Converting a structured CSV data file with table2qb comprises two main parts:

  1. Preparation of the input and configuration files. Besides the structured CSV data covering the observations, we also need to prepare some structured input files providing information for the following:
    • the 'own' components (dimensions, attributes, measures). In our case we defined 2 dimensions https://example.org/ns/olympics#competition and https://example.org/ns/olympics#medaltype, and 1 measure https://example.org/ns/olympics#numberofmedals.
    • the codelists that we define ourselves. In our case: these are https://example.org/id/conceptscheme/medaltypes and https://example.org/id/conceptscheme/competitions.
    • the mapping between the components and the data (also known as the columns configuration file).
  2. Running table2qb pipelines (i.e. transformations from CSV to Turtle):
    • the components pipeline,
    • the codelist pipeline,
    • the cube pipeline.

When we combine the results of the three pipelines we get an RDF data cube encoded in Turtle.

Preparation of the environment

Before engaging in creating files for table2qb, let’s quickly make a list of the essentials of the preparation for our example:

Preparation of the input file ('observations.csv')

Starting again from the csv file introduced previously:

Competition,Edition,NOC,Gender,Medal,Value
Olympics,2004,CHN,Male,Bronze,5
Olympics,2004,CHN,Male,Gold,16
Olympics,2004,CHN,Male,Silver,9
Olympics,2004,CHN,Female,Bronze,10
[...]

Line 1 of the file contains the column labels, and we see that our dimensions are there as columns:

The last column – Value - is the measure, the number of medals for the combination of values specified in the preceding columns on each line.

For processing with table2qb, we need to add an additional tool-specific column, MeasureType, with value Count. To indicate that we are dealing with numbers/counts, not ratio's. As a result, we get an input file that looks as follows (showing only the first few lines):

Competition,Edition,NOC,Gender,Medal,MeasureType,Value
Olympics,2004,CHN,Male,Bronze,Count,5
Olympics,2004,CHN,Male,Gold,Count,16
Olympics,2004,CHN,Male,Silver,Count,9
Olympics,2004,CHN,Female,Bronze,Count,10
Olympics,2004,CHN,Female,Gold,Count,36
[...]
    

As we will be using the http://purl.org/linked-data/sdmx/2009/code#sex controlled vocabulary from SDMX to indicate gender, we also translate our source column Gender to use the values sex-F and sex-M accordingly:

Competition,Edition,NOC,Gender,Medal,MeasureType,Value
Olympics,2004,CHN,sex-M,Bronze,Count,5
Olympics,2004,CHN,sex-M,Gold,Count,16
Olympics,2004,CHN,sex-M,Silver,Count,9
Olympics,2004,CHN,sex-F,Bronze,Count,10
Olympics,2004,CHN,sex-F,Gold,Count,36
[...]
    

Preparing the components data file

The components data file 'components.csv' must provide information for the following:

The 'components.csv' file should have the following columns:

Hence our 'components.csv' should be as follows:

Label,Description,Component Type,Codelist
Competition,The type of competition,Dimension,https://example.org/id/conceptscheme/competitions
Medal,The type of medal,Dimension,https://example.org/id/conceptscheme/medaltypes 
Count,Number of medals,Measure,
    

Running the components pipeline

The following instructions assume that you have a Java 8 runtime installed, that you have obtained the table2qb.jar from https://github.com/Swirrl/table2qb/releases, and that you are familiar with command-line processing.

To run the components-pipeline use the following command:

table2qb exec components-pipeline 
  --input-csv components.csv 
  --base-uri http://example.org/
  --output-file components.ttl
    

This will generate the following output for the two dimension properties that we defined ourselves:

<https://example.org/ns/olympics#competition>
  rdf:type qb:DimensionProperty ;
  rdf:type rdf:Property ;
  dcterms:description "The type of competition" ;
  qb:codeList <https://example.org/id/conceptscheme/competitions> ;
  rdfs:label "Competition" ;
  skos:notation "competition" ;
.
<https://example.org/ns/olympics#medaltype>
  rdf:type qb:DimensionProperty ;
  rdf:type rdf:Property ;
  dcterms:description "The type of medal" ;
  qb:codeList <https://example.org/id/conceptscheme/medaltypes> ;
  rdfs:label "Medal" ;
  skos:notation "medal" ;
.

And the measure :

<https://example.org/ns/olympics#numberofmedals>
  rdf:type qb:MeasureProperty ;
  rdf:type rdf:Property ;
  dcterms:description "Number of medals" ;
  rdfs:label "Number of medals" ;
  skos:notation "number of medals" ;
.    
    

Preparing the codelists data files

Codelists describe the universal set of codes that may be the object of a component, not just the (sub)set that has been used within a cube. A separate codelist csv file, containing all the possible codes for a given codelist, should be created for every component from the input file that is defined by a codelist.

Each codelist CSV file should have the following columns:

In our case, this means we must create codelist files for competitions and medals.

'competitions.csv' is a simple flat codelist, so there is no parent for the codes:

Label,Notation,Parent Notation
Olympics,olympics,
Paralympics,paralympics,
    

'medaltypes.csv' is a hierarchical codelist, so in addition to the codes used in our data we also need to declare the parent:

Label,Notation,Parent Notation
All medals,allmedals,
Bronze,bronzemedal,allmedals
Silver,silvermedal,allmedals
Gold,goldmedal,allmedals    
    

Running the codelist pipeline

The codelist-pipeline must be run for each of the codelist files that we created.

Use the following command for the competitions codelist:

 table2qb exec codelist-pipeline 
  --codelist-csv competitions.csv
  --codelist-name "Competitions"
  --codelist-slug "competitions"
  --base-uri http://example.org/
  --output-file competitions.ttl    
    

This produces the following output:

<https://example.org/id/conceptscheme/competitions>
  rdf:type skos:ConceptScheme ;
  dcterms:title "Competitions"@en ;
  rdfs:label "Competitions"@en ;
  skos:hasTopConcept <https://example.org/id/concept/olympics> ;
  skos:hasTopConcept <https://example.org/id/concept/paralympics> ;
.
<https://example.org/id/concept/olympics>
  rdf:type skos:Concept ;
  rdfs:label "Olympics" ;
  skos:inScheme <https://example.org/id/conceptscheme/competitions> ;
  skos:notation "olympics" ;
  skos:topConceptOf <https://example.org/id/conceptscheme/competitions> ;
.
<https://example.org/id/concept/paralympics>
  rdf:type skos:Concept ;
  rdfs:label "Paralympics" ;
  skos:inScheme <https://example.org/id/conceptscheme/competitions> ;
  skos:notation "paralympics" ;
  skos:topConceptOf <https://example.org/id/conceptscheme/competitions> ;
.
    

For 'medaltypes.csv', run the following command:

table2qb exec codelist-pipeline 
  --codelist-csv medaltypes.csv
  --codelist-name "Medal types"
  --codelist-slug "medaltypes"
  --base-uri http://example.org/
  --output-file medaltypes.ttl

Which gives us this:

<https://example.org/id/conceptscheme/medaltypes>
  rdf:type skos:ConceptScheme ;
  dcterms:title "Medal types"@en ;
  rdfs:label "Medal types"@en ;
  skos:hasTopConcept <https://example.org/id/concept/allmedals> ;
.
<https://example.org/id/concept/allmedals>
  rdf:type skos:Concept ;
  dcterms:title "Medals aggregated"@en ;
  rdfs:label "Medals aggregated"@en ;
  skos:notation "allmedals" ;
  skos:inScheme <https://example.org/id/conceptscheme/medaltypes> ;
  skos:topConceptOf <https://example.org/id/conceptscheme/medaltypes>;
.
<https://example.org/id/concept/bronzemedal>
  rdf:type skos:Concept ;
  rdfs:label "Bronze" ;
  skos:broader <https://example.org/id/concept/allmedals>;
  skos:inScheme <https://example.org/id/conceptscheme/medaltypes> ;
  skos:notation "bronzemedal" ;
.
<https://example.org/id/concept/goldmedal>
  rdf:type skos:Concept ;
  rdfs:label "Gold" ;
  skos:broader <https://example.org/id/concept/allmedals>;
  skos:inScheme <https://example.org/id/conceptscheme/medaltypes> ;
  skos:notation "goldmedal" ;
.
<https://example.org/id/concept/silvermedal>
  rdf:type skos:Concept ;
  rdfs:label "Silver" ;
  skos:broader <https://example.org/id/concept/allmedals>;
  skos:inScheme <https://example.org/id/conceptscheme/medaltypes> ;
  skos:notation "silvermedal" ;
.

Preparing the columns configuration file

The columns configuration file defines the mapping between our column names and the corresponding components. This file also sets out any preparatory transformations and URI templates.

This CSV file should have the following columns:

The configuration file should contain one row for each dimension, measure and attribute used in any input file.

Applying the above to our data, we end up with the following 'columns.csv':

title,name,component_attachment,property_template,value_template,datatype,value_transformation
Competition,competition,qb:dimension,https://example.org/ns/olympics#competition,https://example.org/id/concept/{competition},string,
Edition,edition,qb:dimension,http://purl.org/linked-data/sdmx/2009/dimension#refPeriod,,string,
NOC,noc,qb:dimension,http://purl.org/linked-data/sdmx/2009/dimension#refArea,http://publications.europa.eu/resource/authority/country/{noc},string,
Gender,gender,qb:dimension,http://purl.org/linked-data/sdmx/2009/dimension#sex,http://purl.org/linked-data/sdmx/2009/code#{gender},string,slugize
Medal,medal,qb:dimension,https://example.org/ns/olympics#medaltype,https://example.org/id/concept/{medal},string,slugize
MeasureType,measure_type,qb:dimension,http://purl.org/linked-data/cube#measureType,https://example.org/ns/olympics#numberofmedals,string,slugize
Unit,unit,qb:attribute,http://purl.org/linked-data/sdmx/2009/attribute#unitMeasure,http://qudt.org/vocab/unit#{unit},string,unitize
Value,value,,https://example.org/ns/olympics#numberofmedals,,number,
Count,count,qb:measure,https://example.org/ns/olympics#numberofmedals,,string,
    

Running the cube pipeline

Finally, we run the cube pipeline with the following command:

table2qb exec cube-pipeline 
  --input-csv input.csv 
  --dataset-name "Olympics dataset" 
  --dataset-slug "olympics" 
  --column-config columns.csv 
  --base-uri http://example.org/ 
  --output-file olympics.ttl

The result is a Turtle file containing the observations. This extract shows one of the observations from the file:

<https://example.org/id/observation/2004/CHN/sex-f/bronze/count>
  rdf:type qb:Observation ;
  qb:dataSet <https://example.org/id/dataset/olympics> ;
  qb:measureType <https://example.org/ns/olympics#numberofmedals> ;
  sdmx-dimension:refArea
    <http://publications.europa.eu/resource/authority/country/CHN> ;
  sdmx-dimension:refPeriod "2004" ;
  sdmx-dimension:sex <http://purl.org/linked-data/sdmx/2009/code#sex-F> ;
  <https://example.org/ns/olympics#competition> <https://example.org/id/concept/olympics> ;
  <https://example.org/ns/olympics#medaltype> <https://example.org/id/concept/bronze> ;
  <https://example.org/ns/olympics#numberofmedals> 10^^xsd:integer ;
  .

Results

The result of all of the above operations is that we now have three types of output files:

When we import those into an RDF triple store we have a queryable database.

See also: