Best practices for creating RDF data cubes

This section presents a summary list of best practices for publishing statistics on the Web following the linked data principles. These practices aim at supporting data publishers to model their data and to apply common linked data standards. Adoption of such practices can increase interoperability among portals of statistical data on the Web, and thus facilitate the integration of relevant datasets as well as the development of generic software tools that can be reused across different datasets.

The particular case against which these best practices have been formulated is as follows:

John works for the National Statistical Institute of Belgium. He is in charge of publishing last year’s data about unemployment and poverty in the regions of Belgium. These data refer to various groups of people based on their age and gender. John decided to exploit linked data technologies in order to improve the quality and reusability of the data. The data at hand are of multi-dimensional nature, and thus they should be modelled as a cube. John needs to define the measures, units, dimensions and code lists. Some of the challenges that he faces include: (i) the definition of multiple measures (unemployment, poverty) per cube, (ii) the definition of multiple units (percentage, count) per measure, (iii) the re-use of standard vocabularies and code lists.

Defining a measure

Goal: John needs to model the measures of the cube at hand as linked data. He wonders what the best way is to do so.

BP1.1: A new measure property should be defined that is not a sub-property of sdmx-measure:obsValue. The new measure enables the annotation with additional properties (for example: labels, comments).

Defining the unit

Goal: John has already defined unemployment as the measure of his cube. Now he wonders (i) whether or not to include the unit of the measure in the cube, (ii) what RDF property to use to define the unit, (iii) where to define the unit, and (iv) what values to assign.

BP2.1: A unit of measure should always be included in the cube. The measure on its own is a plain numerical value and thus unit is required to correctly interpret this value.

BP2.2: sdmx-attribute:unitMeasure should always be re-used to define units. This property can be used directly to assign values that are not part of a code list (e.g. QUDT). However, when annotation with additional properties (e.g. labels, code-list, etc.) is required, then new units that are sub-properties of sdmx-attribute:unitMeasure should be defined.

BP2.3: The unit should be defined at the qb:Observation. The unit can be additionally defined at the qb:DataSet in order to facilitate the retrieval of the available units in a cube.

BP2.4: URIs from QUDT should be re-used. If QUDT is not sufficient, then DBpedia or other code lists can be used.

Defining multiple units per measure

Goal: John realizes that the data he wants to publish contain unemployment as both rate, i.e. percentage of the labour force, and count, i.e. the actual number of unemployed people. As a result, he needs to include both units. Now he wonders (i) whether to include both units at the same cube or define separate cubes for each unit, and (ii) where to define multiple units (at the structure or at the observation).

BP3.1: One cube with multiple units should be created and the unit should be defined at each qb:Observation. Conceptually, it is preferable to have all related units of the same measure in the same cube. The unit can be additionally defined at the qb:DataSet in order to facilitate the retrieval of the available units in a cube.

Defining multiple measures

Goal: John wants to publish also data about poverty in Belgium. John wonders whether to publish the data about unemployment and poverty in the same or separate cubes. In case both measures are included in the same cube, he also wonders what is the best way to do so, considering that the measures have multiple units (count and rate).

BP4.1: If the data have multiple measures, then it is common to publish cubes with multiple measures only when measures are closely related to a single observational event (e.g. sensor network measurements). However, the approach to be followed is up to the data cube publisher. In case of modelling multiple measures in multiple cubes with one measure each, then follow:

BP4.2: In case of modelling multiple measures in one cube then the measure dimension approach (i.e. observations with a single measure) should be followed and the unit should be defined in each observation (as explained in BP3).

Defining dimension properties

Goal: John has already defined the measures and the units of his cube. Now he needs to define the dimensions including time, geography, age and gender. He wonders what RDF properties to use for these dimensions.

BP5.1: If a dimension refers to time, geography, or age, then a new qb:DimensionProperty should be defined. This new qb:DimensionProperty should be also defined as rdfs:subPropertyOf the corresponding SDMX dimension. For example, a geospatial dimension of a cube should be defined as sub-property of sdmx-dimension:refArea. BP6 and BP7 describe the way the values of a new dimension can be defined.

BP5.2: If a dimension refers to gender, then sdmx-dimension:sex should be reused provided that the associated code list addresses the modelling needs, e.g. more notions of sex such as hermaphroditism, transgender, and asexual are not needed. Otherwise, a new dimension should be defined along with a controlled vocabulary (see BP6 and BP7).

Associating dimensions with their values

Goal: John needs to associate dimensions with their potential values. He wonders what is the best way to do so.

BP6.1: The rdfs:range of a qb:DimensionProperty should always be defined.

BP6.2: If a code list is modelled as skos:ConceptScheme, qb:HierarchicalCodeList, or skos:Collection, then it should be associated with the qb:DimensionProperty using the qb:codeList property. In addition, the object that is related to the rdfs:range property should be set to skos:Concept. (for the way to define a new code list see BP9).

Defining values of common dimensions

Goal: John now knows how to associate his cube’s dimensions with their values. However, he wonders (i) whether to use data types or URIs and (ii) in case of URIs, what code lists to use to define values of common dimensions including time, geography, age, and gender.

BP7.1a: In case of a specific point in time a new dimension should be defined. This dimension should be rdfs:subPropertyOf sdmx-dimension:refPeriod and have rdfs:range xsd:dateTime.

BP7.1b: In case of a period of time, a new dimension should be defined. This dimensions should be rdfs:subPropertyOf sdmx-dimension:refPeriod and have as rdfs:range the interval:Interval Class of the http://reference.data.gov.uk vocabulary, which uses this class to define years. However, if the approach of http://reference.data.gov.uk is not sufficient, then new code lists can also be created and used (see BP9).

BP7.2: In case of a geography or age-related dimension, a new dimension should be defined. This dimension should be rdfs:subPropertyOf the sdmx-dimension:refArea or sdmx-dimension:age respectively. Moreover, the rdfs:range and/or qb:codeList of this dimension should be defined as described in BP6. If a code list or reference dataset that addresses the modelling needs exists, then it should be re-used. Otherwise, a new code list should be created (see BP9).

Modelling single value dimensions

Goal: The cube at hand includes data only for one year, i.e. 2016. John wonders (i) whether or not to include this single value in the data cube and (ii) if so, what modelling approach to follow.

BP8.1: A single value dimension should always be included in all observations of the cube.

Creating code lists

Goal: John has already defined dimension properties and decided the code lists to use for two dimensions, namely time and gender. However, there are no appropriate code lists for age and geography related dimensions, and thus John has to create them.

BP9.1: A code list should be modelled using SKOS. This is also suggested by the QB vocabulary. Specifically, individual code values should be modelled using skos:Concept and the overall set of values should be modelled using skos:ConceptScheme or skos:Collection. Always define a separate code list for each distinct set of values (e.g. age groups and geographical areas).

BP9.2: In the case of hierarchical data, hierarchical code lists should always be used to describe them. SKOS should be preferred when the hierarchies are simple. In the case where the hierarchical levels are fully separated, and depth is a meaningful concept then XKOS is appropriate. Finally, when there is a need to express more relations that are not covered by SKOS or XKOS (e.g. administeredBy vs within) then QB vocabulary code list facilities should be preferred.

BP9.3: Aggregate values (e.g., “Total”) should be included in a dimension if the measured variable in this dimension can be aggregated. The aggregate value should be modelled on the top a hierarchy (see BP9.2).

See also:



Feedback:

Questions and remarks can be addressed to Evangelos Kalampokis, University of Macedonia, Greece and CERTH/ITI.