The numbers never lie but they can be misleading

The numbers never lie but they can be misleading

Spurious correlations, statistical hurdles, and mathematical hiccups

There’s a risk with Big Data:  Encouraging everyone to use the tools to explore the company’s big data lake will lead people to do so.

It’s what you want right?

But the risk is that people may find some significant, industry changing insight that will not only catapult the business to the pinnacle of their industry but will give them the chance to make a name for themselves; one they so readily deserve.

Spoiler Alert:  Every data scientist will agree that finding truly valuable and realistic insights are never easy.  Actually, it can be very hard.

After all, while the numbers never lie they can be misleading.

Prepare to be disappointed

Data Analytics requires a great deal of exploration, trial and error, disappointment, and perhaps grey hair.  Yet, the business expects you to deliver the holy grail, week after week.  They must think you’re Gandalf.

The answers aren’t just there, waiting for you to find them.  They need to be teased out and when discovered, they need to be tested and also have to be rock solid:

  • They must stand up to the rigour of daily use
  • Be reliable and trustworthy
  • Their boundaries must be discovered

Yet, the excitement of finding something that appears to be incredible is hard to suppress.  When you’ve spent days, weeks, or even months exploring data, finding nothing to then come across results that appear mind blowing, it’s hard not to think about the possibilities.

Even for the most seasoned Data Scientists.

But they know there’s more to do to prove or disprove these findings.  As most statisticians will attest, correlation does not imply causation. Just because two sets of data appear to be related, doesn’t mean that they are related and spurious correlation can be the undoing of many hypotheses.

Spurious Correlations

Spurious correlations are statistical anomalies that involve two data sets that look like they are correlated but are far from it.

There are many examples around us, and there are some real clangers out there that are worthy of sharing

There’s even a website dedicated to spurious correlations:

People who drowned after falling out of a fishing boat correlates with the Marriage rate of Kentucky

People who died falling out of their bed correlates with the number of lawyers in Puerto Rico

My favourite: Cheese consumption and death by bedsheet entanglement

Perhaps eating cheese before bed causes really bad nightmares?

The pitfalls and problems with statistical analysis

It’s obvious that these examples are spurious, yet many correlated data sets are not so obviously spurious.

Any data analytics has to be rigorously tested before conclusions can be reliably made.  There are mathematics tests to check correlation equates to causation but spurious correlations are just one of the pitfalls of working with big data.

It’s great to encourage the organisation to explore your big data but make sure the data scientist is on hand.  Either to help avoid these statistical and mathematical pitfalls or to console the budding Data Analysts when they fall back down to earth.

After all, while the numbers never lie they can be misleading.