How We Collect Data Matters

I saw something on Twitter that made me stop recently. Too often, in the data field, we talk about data being a raw product. I can’t say I’ve fully agreed with that sentiment, but this thread put it far more eloquently than I ever could.

There’s a whole thread, including this:

The data we touch is curated. Someone is making choices and, too often, we analysts are not in the room when these decisions are made. More often than not, we’re handed a pile of data divorced from its origins and real-life workflow. And, with this removal, it’s too easy to see this data as baseline without seeing what wasn’t included or even considered.

I’m manually collecting data right now for a personal task. It’s meaningful to me, but so many elements are subjective. I judge outcomes, I determine what’s important and what’s collected. There’s still way too much of the narrative that lives in my head.

Bar chart with no context. Color coded, but legend intentionally not provided. Some bars overlap while most vary in distance.
The great mystery data set.

How I collect it also matters. Yes, I have timestamps. How accurate are they? Usually within about 15 minutes, but sometimes, they could be off by as much as an hour if I forget to enter the data. How much does this affect my data? I use the same clock, so at least that’s consistent.

Other values are even less concrete. I have several yes-no variables, and a few very subjective fields on status. These data elements are totally biased. Even my goal biases my data. In no way is this data “raw,” but cultivated, pruned, fermented, and even shaped before it makes it into Tableau.

Even the more concrete tools like FitBit don’t all tell the truth. Did I get credit for that ride in the car where I was jostling around just enough to get step counts? You bet, and you better believe I’m letting that get counted towards the goal. And, if you’re like me, the phone spends half the day elsewhere. No data collection method captures the whole truth. My cute hand-curated data set has been appended and retroactively added to – how accurate is it? For my purposes, accurate enough, but I won’t win any prizes for it.

Even our brains are not objective. You see, visual perception is learned. We learn that shadows are not literal objections, but an extension of our own. We learn to identify shapes and colors as meaningful. Even vision can lie to us and not in any universal way.

If our eyes aren’t objective, our brains are even worse for it. Culture shapes how we think, what we perceive, and how it’s framed. Lived experience – same deal. Science evolved as ‘natural philosophy’ and acted as a way to explain the world. We’ve matured the process as a way to extend our own experiences, but our own experiences do bias the work. Just get 10 scientists of different backgrounds together and ask how they’d to the experiment. Just like with dashboard design, you’re likely to get some varied answers.

When we understand the bias that drives our data collection, we can start correcting for it. Maybe we bring varied perspectives in the room to check and build some counteractions where harm may creep in. Or, for data like mine, we spell out clearly what the bias is and use it within reason. I’m not publishing my data or using it beyond my very limited scope, but let me say a little hand collected data has come a long way and made a huge difference in what I’m doing.

Are you collecting data? Here’s some things to consider that blend my ideas and Arrianna Planey’s:

  1. What’s the reason you’re collecting this data?
  2. What methods are you using?
    • What limits exist in the methods? (ex: only track what’s seen)
    • What affects the entry? (I left the phone behind)
    • What are you consciously not collecting?
    • How subjective are your measures? If someone else collected, how much would your answers differ? Can you explain why?
  3. How does data collection fit within the process?
    • The job is X, data is collected at what points? What points aren’t collected?
    • How much does form factor influence the answer? (ex: drop downs vs open text) What should be consistent vs free form?
  4. When you have subjective fields, what differentiates between them? Is there a way to define consistency?
  5. How can you disclose your judgements and methods?
  6. What level of accuracy is appropriate?
    • Are you clearly disclosing how accurate/inaccurate your data is?
    • Have you vetted this with others to validate interpretations and find biases?
  7. Have you checked with people affected by your data to ensure you have the whole picture?

Number 7 is last, but it’s probably the one we miss the most. If you need more proof of this, count how many people can’t get automated faucets to work. Yes, kids, there is a clear biased reason some of us are invisible to automation and it’s not that we’re soulless or from another realm. We make these mistakes with data collection too.

So, when you look at raw data, I urge you to ask what counts and to find what isn’t included, or what’s been pre-decided in your data.

Cheers from the kitchen. It’s time to eat this stew.

Cilantro, garlic and turmeric for stew
You only think this is raw. The truth is hidden underneath.

1 Comment

  • Michelle Kosmicki
    September 18, 2019 11:39 am

    OMG. Yes! The only way to know your data is to clean and prep it….just like your food. Healthy Data = Healthy Analysis = Healthy Viz