Show Me the Data! A Case for Truly Open Data Portals

At one point in my life, ‘search’ meant hitting the library and a base level proficiency with the Dewey decimal system.  When all else failed, you asked the librarian.  Then came the internet and Google.  Now you ask a search bar.

Contests like the Tableau IronViz prompt deep dives into particular topics and lots of Googling.  Last year, it was food – literally, it fried my hard drive, but no one wanted that burnt crisp.  Some things aren’t too bad fried, like avocados.  Others, like hard drives, are best served chilled.

This year, it’s ‘safari,’ or anything plant or animal related.  So I go to Bermuda, figuratively, or else I’d still be on a beach hunting fried avocados.  Bermuda, well known for its waters and protected coral reefs, has lots of data.  Data on the ocean, data on cruise ships, and data on plankton.  It has so much data, it should be perfect, right?!

BEACON! It’s a ray of light in a sea of data!  They even have charts, surely, we’re on to something!  A little more navigating takes you to this delightful page:

Oh, this is looking less promising by the minute…clicking on a report gets you a PDF.  Well, Tableau 10.3 now has a PDF reader!  Except…

It’s a scanned image.  If you notice the different font sizes, this was typed.  A little more Googling finds this:

Yes, that is 428 Excel files for readings since 1988.  No, really, go count them.  Ya wanna play this game, public data?

I work at Teknion where they’ve given me Alteryx.  We’ll try this!

Lightning bolts do more than show that Harry Potter was a horcrux – they also download!  Okay, maybe it’s time to bring in the heavy hitters if I ever want to get to visualizing the data…1 expert and 2 hours later, I get this.

It’s so big, I can’t screenshot the whole thing.  But it downloads all the files. And makes 1 TDE.  Which gets me this:

Yes, you read this right, all this work for 2,857,594 rows of data.  Only on water statistics.  If I want the plankton, I’m going to need to double this work.  And I won’t start in yet on decimal dates.  Maybe, there’s other data that requires less work…

The wonderful David Suzuki has all kinds of information on nature.  Let’s check his site:

The paper offers some interesting tidbits, but where’s the data?

Surely, there’s ways to find data…how about some more research papers?

It’s easier to find data about the paper than useable data in the paper…

And, of course, there’s research papers calling for better data management and accessibility:

Did you catch that?  How about “concerted governmental attention to the challenges is critical” or my favorite about a system “that embraces technology, standards, open access, and education.”  Not sold yet?

Are you seeing a pattern yet?  We need more modern storage of data for scientific data sets and engagement with the greater community.  Not just so real scientists can complete an easier review, but so contest-crazy people like myself can download one data set and answer a few questions, instead of several sources for 1 question each.  (Do I seem like I have an axe to grind on this?  Maybe…)

This isn’t to say there’s no data.  There’s tons of it.  DataONE offers 136,798 data sets to download, including triplicates of research, because ¯\_(ツ)_/¯.  Warning: most come in several files, with decimal dates, and offer a very isolated view into a creature.  Blame the nature of these studies.  But at some point, we need a way to roll this up to a more unified view, besides BYODB (bring your own database).

We need to strive for better public systems to allow a greater understanding of our world.  Target and Facebook (which I’m not even on) have more insight on me as a human being than I have of any given ecosystem of this world.  It’s not for lack of data, but it is for lack of cohesiveness, distributive models, and access to data at a wide array of levels.

The concrete jungle may have more to offer me.  They even make a nice case for how we could (nudge, nudge) share data.  And a fair amount of it – sit down for this – is at a low level, meaning we can find patterns and roll it up as needed.  That may just be wild enough for this contest.  No decimal dates required.

PS – if any libraries or scientists want to help me find tidy data around climate change and it’s impact on a whole system, well, I might just have ice cream to trade.  Or maybe I should do shirts…