Data Cubes: from tabular data to data cubes

This section introduces the concepts of dimensions, measures, data cubes, and slices.

Tabular Data

Consider these Olympic medal results, shown in familiar tabular form:

The rows and columns format is familiar from spreadsheets, but this is in fact, multi-dimensional data. These data have two dimensions: country, and medal colour. The country dimension is shown as rows, and the medal colour dimension as columns:

These data show medal results for the 2012 Olympics, so we can think of the data as including a third dimension, the year of the results:

Here, the year dimension has the same value for all the data in our table, so we describe this dimension as locked to a value (in this case, 2012). By contrast, the country and medal dimensions can take different values.

To obtain one value, we lock one dimension at a time — by choosing what value it takes — until all the dimensions are locked. For example, if we choose:

then we get to one cell.

Where our locked dimension values intersect, we have located the observation we are interested in:

We call the number in the cell the measure. The measure has a value (46) and a unit (medals won). This measure is, intuitively, a count — i.e. how many of something were there?. It’s also very common to encounter ratios, expressed as a percentage, or as something per something.

Two dimensions, plus one

Let’s imagine that we obtain two more spreadsheets of medal data, covering 2008 and 2004:

Note that our observation from before — because it locked values for all of its dimensions — remains valid, but we’ve added a bunch of new observations.

Now, suppose we are interested in how the number of gold medals won by each team has changed over time. Our data set contains all this data (the cells that have each been filled with a colour below), but it’s not very convenient to extract it (even in a small data set like this).

Wouldn’t it be more convenient to have a table which locked the medal dimension to ‘gold’ and then showed us how that changed over time, as shown here?

If all you have is spreadsheets, then you can get there with the required spreadsheet skills and maybe a little bit of trial and error. But there is a better way to do it: data cubes!

Data cubes

To explain the concept of data cubes, let’s make our illustrations a bit clearer by getting rid of the numbers, and just using colour to show what’s in each cell. This is just for the illustration though — bear in mind that the measures are all still there!

Next, let’s adjust how we draw it a little bit. Imagine stacking the spreadsheets front to back. Notice that the data isn’t changing, we’re just adding a third dimension to the illustration.

The result is a data cube. The one illustrated is an actual cube, but in practice, the dimensions don’t need to be  - and indeed most likely won’t be - the same size. A date cube is not a "cube" in the strict mathematical sense, as all the sides are not necessarily equal. For instance, we’ve only shown three countries, but the example could include a hundred more rows for a hundred more countries.

The important change here is that the data are no longer tied to a two-dimensional representation, they fill space.

Let’s have a closer look at what we have here.

Our three original spreadsheets, showing the “free” dimensions Country and Medal and the locked dimension, Years, are still there. Each of them is a slice through the cube. It becomes clear now also that there are other ways of slicing the cube as well …

Each of these slices is a potential two dimensional spreadsheet.

This is where we start to see the usefulness of arranging our data as a cube. By choosing a dimension to lock, and its value, we can obtain a table of exactly the view onto our data that we need.

More than three dimensions

Recall how, a little earlier, looking at our 2-D table, we realised we had been assuming a third (year) dimension? Perhaps we might now realise that there is a fourth dimension — the gender of the athlete. Our tables so far have locked the gender dimension to ‘All’ but maybe we also want to provide charts for male and female athletes:

There’s no reason to stop with just four dimensions. We can go on to imagine datasets of any dimensionality. For example, let’s add a fifth dimension with data for paralympics:

The term hypercube is sometimes used, especially for data with more than three dimensions.

This also illustrates that dimensions can be organized as a hierarchy: a set of parent-child relationships, typically where a parent dimension summarizes its children. Parent elements can further be aggregated as the children of another parent.

As we add dimensions, it becomes increasingly difficult to visualise this, but you don’t really need to. Just lock values for each dimension in turn, and you will eventually end up with an easily-handled table.

We’ve now reached the limit of what would be practical in everyday spreadsheet software — this would need 54 worksheets, but the data cube can keep on adding dimensions indefinitely.

And what about our example observation? Even with five dimensions, it is still there, at the following co-ordinates: