Data is magical. When I talk to members of my family, I get the impression they view it like the matrix, the numbers all pouring from the sky and somehow falling in the Tableau bucket to make charts and who knows what else.
Outside the Matrix, when I get into data, it sometimes looks like this:
Sometimes (okay – public data – often) this:
On really bad days, it looks like this:
And I spend a lot of time doing this:
But then, sometimes, I get lucky and get this:
This is a
nice reasonably tidy data set. While it has numerous levels (more to come), it tells me this, so I don’t end up counting the wrong thing. A lot of the data sets I get handed from my teammates look a bit like this. Some of them even make a lot of nice folders.
So, why does this matter? The shape of the data affects the analysis you can do. Let me repeat – these rows and columns can make your life easy or difficult, depending on how they’re made.
What is shape?
If you look around there’s lots of posts on tall versus wide data (Brittany Fong for one, and I’m sure there’s others I’m forgetting – sorry). You can use columns and put loads of numbers in there and have fewer rows (the SPSS kids love this), or you can have more rows and less columns (presto, Tableau!). This, in a nutshell, is shape.
But it goes deeper.
How you decide what makes a row or column influences how you display it. Certain set ups make it easier to do certain types of analysis. I generally encourage my clients to balance between flexibility (all the ways we can analyze the data) and ease (being able to make analyze in a particular way). What you want to avoid: the dreaded single data source for each chart as you lose the ability to facet your data. And, it’s like crowding 20 elephants in 1 room. The first one is hilarious. By the time you shove the 5th elephant in there, it’s monstrous.
Tableau does wonders with low-level, or transactional, data. It can lump it up (aggregate, if you want to be proper, but you came to the toddler’s house – we’re going lumpy) however you like and it does magic. 1 row of the data usually has some type of meaning. Maybe 1 row is a call, a daily summary, or an interaction on a web page. The kids over in data warehousing call this the grain of the data – I think they’ve been playing wood shop too much. Maybe I’m I’m still at the data shed stage.
But, I’ve come to realize, many businesses have their own form of the data shed. Maybe, it’s some data that comes from some process that never got optimized or from some process typically named after an animal with some nod to numbers in front of it (yes, DataFerrett, I’m calling you out).
The data shed gets messy, just like a real shed. After extensive hunting, I found some data hidden under the cobwebs:
This data looks tidy. At least compared to the original (see above in messy data), I can toss this in Tableau and be on my merry way. It gives me a breakout by show at the show level, so 1 row = 1 show. Check.
I plug into the matrix and this is what I get:
This is a problem, kids. A big, massive ‘herd of elephants in the room’ type problem. It didn’t look like a problem, but now I see that half the information I need lives in the measure names area – the demographic breakouts of populations and the AQH (average quarter hour people) versus regular people. If I want to break this down by population versus target and averaged people versus regular people (this is a TV thing), I can’t. No dice at all. I have the data – it’s all there. But the format prevents me from doing anything useful.
Fortunately, this data is small. I can sit there and cut and paste magic into Excel and make it work. Someday, Tableau will give me Project Maestro. Teknion has given me Alteryx, but I haven’t the slightest in how to work it (yet).
Copy and paste gets me nice trim data from a format side. I’ve dropped this down to 2 measures, population and rating. I’ve added new dimensions for demographic and type of person (it seems some people are averaged into quarter-hour persons versus others who get to be whole people. Now I know what really happens when my head is in the clouds…)
Hitting the magic refresh button gets me this:
Now I can see I’ve dropped down from 8 (not including Tableau generated ones) to 2 (same deal) measures and up from 4 dimensions to 6. This gives me more flexibility to slice and dice. Also, with Tableau, I can add columns (what shows up in the dimension and measure windows), but not rows (the data inside those fields). More rows means better automation as the data updates, as you don’t have to keep adding columns to the equation.
This is one way of putting the data together. I could also leave AQH versus the regular persons separated. and have 4, not 2 measures. It all depends on business rules (would AQH populations be examined with regular people or is it apples and oranges?), questions around the data (would certain things ever be added together or compared), and long term uses (that whole flexibility versus ease question).