A Linguistics-Centric Take on Data Literacy

Ask anyone in data these days, and the term du jour is ‘data literacy’. We define this a myriad of ways, often stemming around the use and comprehension of all things data. The most common part of this is the end result – the charts, dashboards, or infographics created – but sometimes, we also consider what it takes to get there, such as the raw data itself.

My background is interpreting, which spent a fair whack of time delving into linguistics. While that included nearing-esoteric definitions on language parts (phonemes, morphemes, and sign language parameters), but also written forms of language. How and why did we start writing in the first place? How is it we learn to read? What should we consider for bilingual reading? These ideas directly affect how I see data literacy.

What is literacy?

Merriam-Webster defines being ‘literate’ as:

With this, we see nuance already. Most data literacy discussions fall somewhere around 1b (“reading and writing”) and 2c (“competency within a specific skill”) with their definitions with a flavor of 1a (“cultured”) added. These slight variations add to some of the confusion. Suffixes like -acy turn this into a noun around the state of being literate. It’s a broader focus and usually includes terms like population.

We sometimes try to fold reading and writing (1b) and competence (2c) into the same idea. Let’s dig in more.

A (Horribly Brief) History of Literacy

When we think of the history of writing, we often assume everyone writes and has for a long time. Even today, there are cultures that solely preserve their history orally. Of the estimated 7,111 known languages (living and dead), somewhere around 44% do not have a known written form. Sometimes, that means there is some level of writing, but it’s not standardized. Other times, another language is used for documents and writing, so things still get written.

Some of the earliest documents use a variety of symbols for writing, such as Cuneiform and hieroglyphics some 5,000 years ago. Written Chinese is the oldest writing system still in use today. Languages with a written system may influence other non-related languages to mirror their writing style, despite differences in phonemes. The Greeks can blame the Phonecians for inspiring their alphabet and we can blame the Greeks. Incidentally, we can also thank/blame the Greeks for including vowels as letters versus hiding them as tic marks (aka diacriticals if you want to be fancy), as is more common in languages like Arabic and Hebrew (from your non-fluent alphabet-familiar friend, these are hard at first…maybe later they get easier).

Matt Baker, Useful Charts.

As more people interact, we’ve sought to preserve more languages in writing. This is in part because many of us move less, sit more, and need something to do. Socrates was not a fan of writing. He thought it made us stupider. Lucky for us, Plato quietly disagreed (and took notes).

Learning to Read

So, we see how written language has evolved and we recognize it’s a newer skill in the broad range of human history. We also know there’s great variety in how it’s done. Great, let’s look at the mechanics of language learning.

This is your brain. Okay, probably not, since it’s missing the cerebellum and you’re reading this. This is a brain (sans cerebellum) looking at the letters b and d. Without phonemes attached, this brain thinks one is just flipped a different way and it may as well be the same. Hooray, pre-literate brain.

Let’s take this brain to school. We break reading down into a series of tasks, such as recognizing letters, and matching them to their sounds. In English, this is much harder than it looks. While doing that, we also typically teach a series of words to be recognized by sight alone – there are typically words like ‘and’, ‘the’ and even ‘look’.

Take a look (sight word!) at the timeline here: fall Kindergarten vs spring, then the same patterns for 1st. After that, we increment to 3rd (assume spring from here out), 5th, and 8th. These are typical benchmarks in the US and when we administer a whole slew of tests.

By the end first grade, most kids recognize the letters as different. They may not get the word right, but you can count on them to sound out the start of the word and, with decent accuracy, the end of the word. This part counts towards that ‘reading and writing’ (1b) definition. From there, we expand and build vocabulary via a variety of strategies, including one of the last, extrapolation.

Extrapolation is a really neat skill as it relies heavily on using prior knowledge. This is where reading and writing also starts to include a bit of that ‘cultured’ (1a) definition. Notice how even by 8th grade, we’re only seeing around 80% of the kids sampled demonstrating proficiency in this task. Extrapolation requires exposure to both vocabulary, knowledge of the topic, and the ability to use closure skills to fill the gap. Think about this in the context of unfamiliar charts or data. We’ll come back to this.

Creating connections

We’ve seen that literacy requires a broad amount of skills. What does this do to the brain?

The brain compartmentalizes information similar to ways we like to apply normalization to table structures in databases. It builds synapses that act as keys to each of these parts. When all of these areas have connections, we find reading and writing easier. The more we train that loop, the more pathways we create and the better we typically do.

These models rely on alphabet-based languages and spoken languages. What happens if one of those conditions changes?

Deaf Children and Literacy

As an interpreter, one topic we discussed was reading and Deafness. At that time, roughly 60% of the people in my program would likely enter the school system as educational interpreters. With the advent of video relay, I suspect that number has shifted drastically. We still learned a fair bit about literacy.

We’ve seen from above that most educational models rely on building phonemic awareness. This model assumes you’ve had exposure to spoken English. What do you do when that language isn’t accessible?

Some of the latest research continues to highlight that the best way to teach literacy to Deaf students is to build natural first language fluency (translation: American Sign Language in this context). By having comparable fluency in the first language (ASL), those skills can be transferred to reading (English). Some strategies include drawing comparisons to techniques and intentionally building vocabulary. This sound a lot like building towards extrapolation techniques using both languages.

Study after study highlights that greater ASL fluency helps improve literacy levels. This includes having access to advanced ASL constructs, such as storytelling and classifiers (these are awful to learn as second-language learners, by the way). Additionally, Deaf children themselves often try to find comparable models, whether trying to find parallels in specific signs from ASL or using what they know of English phonemes (this part will vary widely based on when they went deaf, exposures to languages and modes, and general differences).

What we can take from this is that fluency in another language helps, rather than hinders. This, too, will become important.

What does this have to do with data?

The deep dive above helps us understand literacy in its various forms. There’s the technical element of reading and writing, the assessment view of proficiency, and a cultured view that includes exposure, experience, and acclimation. We can take this view a few ways.

Data Visualization as a Linguistic Construct

There are some debates around if charts and graphs count as language-like creations. Various views of this highlight that charts have a semantic layer (smarter people here and my take here), so there’s more meaning behind a chart than what we see. I blame this for most of the debates on chart selection.

Consistent with definitions of language supplied by Baker-Shenk and Cokely, charts are arbitrary with (potential) grammatical signals. We’ve see changes across time to more abstracted products and effects that culture has on it. Charts are not a primary form of communication, but they are used by a community to interact (using a primary language) to transmit ideas, emotions, and intents. Lastly, there is a long and established history of using charts to highlight cultural aspects.

If a language exists in only a written form, is it a language or simply a form of communication?

Reading and Writing Data

If we take the path that data visualization is a true language, we assume:

  1. It has rules and grammar specific to it, independent of other languages
  2. It has a culture
  3. There are universals within data visualization that exist outside of other languages (viz practices all around the world would share these characteristics outside of languages used.)

I won’t lie, I personally am not ready to make this leap. From what I see so far, other languages play a noteworthy role and local cultural artifacts (ex: color) are a factor as well.

The Assessment View of Literacy

In some circles, this is also called functional literacy. It’s what’s needed to survive in society. Sometimes, this is phrased as grade-level reading, which goes back to some of our reading charts and skills. Our standard benchmark is 8th grade (where we expect the majority to have extrapolation skills).

When we deal with technology, this is how we often define literacy. It’s the ability to navigate around a specific area. We speak of being computer literate, internet savvy, and so forth. With data, we struggle to define the exact benchmarks, around the following parts:

  • Input
  • Storage
  • Modeling / Preparing
  • Defining
  • Analyzing
  • Output

To be data literate, how much do I need to understand about data inputs?

USDA Agricultural Census Survey

This example is fairly straightforward and I know what’s being collected. How about this example?

I don’t know what exactly Google wants to collect, how often, and what else goes into this. I have to know this exists to opt out. Is this a part of data literacy?

Storage is another big issue. How data is kept (and modeled) can affect the analysis. Retention policies are a part of this discussion as well, which delves into ethics (my other favorite topic).

Data modeling (including ad hoc prep) is where people often start the discussion. It goes somewhere along these lines: “you have to know how to shape the data to analyze it.” Shaping preferences vary widely.

Here’s what I feel gets missed: defining. What is and isn’t this dataset? Can it really show what I’m trying to understand? To some of the points above, what isn’t collected and why? Do I understand all the terms used in this data set or is there jargon (and possible nuance I’m missing)? This part also dovetails with ethics. Often, data is about people and our findings can have an effect far beyond our anticipation.

The other part that’s hairy in literacy is analyzing. Depending on inputs (what did I collect?), storage (how much of it do I have?), models (what linkages can I make?), and definitions (do I know what I have and what’s missing?), analyzing becomes far more complex. Without the above framework, we miss a lot. It’s like reading phonemes without understanding the words (which some of us can do since we studied various alphabets).

The final piece of this is the output, which would include charts, vizzes, graphs, dashboards, and whatever else you want to call data-made products. Often, data literacy conversations are framed around the ability to read without the ability to make. Is this really literacy?

If we look at reading again, we can see some patterns:

  1. We first start identifying letters by their sounds.
  2. We move from phonemes to words.
  3. We move from words to concepts and grammar, thus evaluating the whole piece.
  4. This entire process also includes writing.

Is a chart equivalent to a word or a phoneme?

Data Literacy: A Third Tier Skill

Linguistics relies heavily on the ability to break up language into meaningful bits. Phonemes are meaningful in that we select some and not others across languages. Identifying these bits helps us find languages that share sounds and identify rules around word construction. ‘Glick’ doesn’t have meaning, but is allowable in English rules. Vladrufsky would not appear in English, but might be allowable in Russian. People get paid good money to study this and make languages for TV programs like Game of Thrones (oh, what missed career opportunities I’ve had).

Words can apply to a concept or serve to link other words. ‘A’ not only links certain words, but specifies a thing is singular and not specific. You can’t have a dogs and, if you’ve mentioned a specific dog before, it’s likely the dog. We like to call this semantics.

When dealing with charts, we have a few things at play: the chart, the data, the level of aggregation, and the categories driving the aggregation.

Sometimes, we make it more complicated by throwing in various calculations, filters/restrictions, and visual tricks (think logarithmic scales). To read this chart, you need a few things:

  1. Fluency in a written language (English in this chart)
  2. Familiarity with numbers, math, and business (‘profit ratio’)
  3. Some level of graphicacy or chart knowledge (line for over time, dots as call-out)

Some of graphicacy comes from math classes or statistics. Other times, it comes from far scarier places, like learning on the job, which is – if you ask me – how we got here in the first place, some 2,000 words later.

To truly understand charts, we need a language (first tier) and numeracy (second tier) to reach the third tier – data literacy. Without this, we evaluate charts similar to the way an untrained eye does art. We either like it or we don’t and we rarely do much beyond look.

When we lack training, we also fall into habits, so habits over reason every time. Habits are easy and require less thought. I can learn alphabets and produce words without knowing what they are – this does not make me fluent by any stretch of the imagination. I can learn to play the piano by rote without ever learning to read music – this does not make me proficient. And, we can see charts without having a clue what they mean or what we should do. Without a solid base, we lack the ability to extrapolate meaning.

The bigger problem, though, in data literacy is not charts, but understanding data in the first place. Call these the phonemes of data. We live in a world where so many of our actions generate a data point. How many people understand this? If they don’t understand this and what data exists, along with the relationships that can be created, do people have informed consent around their data?

If we, as analysts, take this data without looking at the process that created it, what are we creating? Call this using sight words. Sure, we can get by, but we’re missing so much. Beyond the data at rest, there are forms, processes, and people who create the bits and bites we examine. There are pieces missing. There’s nuance in specific terms – do we truly understand these or are we hoping others will?

We cannot talk about data literacy without delving into ethics. Call this extrapolation and reading comprehension. Beyond fluency, we have effects beyond our imagined reach. This returns to the ‘cultured’ definition (1a) from before. We have to understand things beyond the data to be effectively literate in our charts.

When we deal with ethics, we dive headfirst into practice professional discussions. Doctors, lawyers, and interpreters are all examples of practice professionals. Which leads us to this next, very painful question…

Is data literacy for the masses possible?

When we look at something as complex as healthcare, we see a relative to data. Like data, health problems are usually below the surface. They have a myriad of complexities and patterns, which ideally a trained professional can identify. Can individual patients contribute to this process? Yes, absolutely.

In the US, The Joint Commission provides rankings of hospitals and other facilities. What they say matters to patients, providers, and hospitals. Their assessments can affect reimbursements and policies. The Joint Commission also spends a ton of time trying educate patients directly and indirectly through the hospital. They were instrumental in shifting the doctor-patient relationship from one where the doctor held all the power to where patients could speak up and help frame treatments and diagnoses.

Yet, many people struggle to navigate the healthcare system. As an interpreter, I helped people understand their conditions. Each person had to develop a mental model for what was fairly abstract and removed. Some did better than others, likely because they had enough of the fundamental building blocks. This lack of health literacy – which includes cultures, norms, and other barriers to change – has a direct correlation to outcomes.

You can also find a full presentation here.

The Joint Commission first started their ‘Speak Up’ campaign in 2002. I spent time around it as early as 2007 and spent a few years directly educating. As of 2019, this initiative is still in play.

Data literacy is no different. It, too, is highly abstract. What data is meaningful? What patterns highlight something? When people don’t have a mental model for data, they often ignore it or acquiesce to whatever is easiest. Some give to powerlessness. This lack of data literacy – which is affected by culture, norms, and other barriers to change – also affects outcomes.

Beyond the reading of charts, we have to help others learn to speak up about their understanding of data. We have to provide a framework that gives mental models to what likely feels like The Matrix to so many people. Our world is growing ever more complex and abstract. So much exists in the digital space, but few can tangibly imagine it.

It is not enough to ‘read and write’ data. We must be educated and cultured to what data is, what it can do, and then how it’s interpreted and displayed. Only then do we have the extrapolation skills needed to be functionally literate. We must be competent to exist in this space where data – like writing – is everywhere. That’s what data literacy to the masses means to this linguist.

4 Comments

  • Stephanie
    May 9, 2019 8:36 am

    Great post! Thank you!

  • May 11, 2019 2:10 am

    There are some gems. I had a proverbial grandmother. What did I learn from her at ten? Married a French wife and had a bilingual son. First language with wife Spanish. For four years adopted a beautiful Chinese girl. Met her in restaurant she was studying management and wanted to improver her English. I have always loved charts and visual representation. Met my first computer in 1952 and fell in love. Didn’t get to programming until 1962. Exec too valuable and expensive. Speak 4 languages. Noticed that my iPad knows me better than my wife. Nice to have met. You.

  • Warren Tanner
    May 13, 2019 5:38 pm

    Thank you for sharing this – very clear and detailed perspective on an important topic.

  • May 15, 2019 4:05 am

    […] start, Zen Master Bridget Cogley looks at the meaning of “data literacy” from the perspective of a linguist. Bridget provides an overview of the term literacy and then […]