Covariations vs Correlations in BigData

Recently, I wrote about how #BigData and #BigScience differ, having almost opposite approaches at looking at data. Needless to say that I remain skeptical about the varying quality of what’s being said and written about data, big or not. As a matter of fact, my main concern is about what one can infer, or pretend to infer from that data. Data help to think the world, yes. Yet it isn’t the whole story. Reading posts on Internet and the sky-rocketing amount of new material about it, one must honestly ask oneself: Is Data, especially since it became Big, a object of knowledge by itself?

In this post, I want to discuss the difference between covariations and correlations. In a context of data-driven decisions (a concept I’ve read in the two books I’ve mentioned last time), failing to distinguish covariations and correlations might lead to unexpected consequences, to the say least. The least dommageable being, probably, to remain ignorant after all…

In my previous post mentioned above, I cited these sequence of tweets:

Here is the image of the original tweet:

 The image of the original tweet.
The image of the original tweet.

Talking to strangers, and telling them they are wrong. What else Internet is about?…


(xkcd: Duty Calls)

Anyway. Days passing, I couldn’t help but keep thinking about this « fitting » problem. I think I have a (natural?, normal?, scientist?) reflex saying that data isn’t telling the whole story, but just a mean, among others, to climb the ladder to stand in the shoulders of giantsThe existing story we build upon, with the help of data, is the knowledge and understanding we have of our world, and the history of the discoveries that led to the state of it. And that knowledge is based on correlations. Correlations that were observed, checked, verified, and understood (if I could make a fine word I would say, that, in computer science, we would say these correlations were entirely ‘debugged’, since debugging is understanding).

But correlations and covariations haven’t the same meaning! Simply stated, a covariation is the observation that when one parameter varies, one another does as well, and vice versa. Covariations are (I love this rule from mathematics:) necessary but not sufficient to make correlations. Covariations are merely a hint about something happening under the hood. Covariations can have various ‘shapes’ or, in other words, can be represented graphically with various figures. The shape of that figure is certainly an excellent hint about the underlying phenomenon, but it is not the explanation by itself. On the other hand, understanding means giving a cause, or an explanation, to a covariation. While the study of covariations is full of lessons, this isn’t usually enough to reach an explanation. And it is not a matter of quantity. Correlations are living in a different space. ‘Data-points fitting’ isn’t equal to understanding (obvious, isn’t it? Or not?). Stated simply, a correlation integrates the corpus of knowledge, while a covariation integrate the corpus of observations.

What amaze me most in this journey into BigData as I navigate into it, and the dozens of articles about it in every corner of Internet, is – again – the very weak presence of words such as understandingknowledgeresearch, ‘Nature‘, etc. They are utterly dominated by the presence of ‘insights’, ‘obvious’, ‘noise’, ‘pattern discovery’, and also ‘revolutionary’, ‘potential’; words that belongs a lot more to marketing than to, well, science. <note>This little game about the number of occurrences of words in BigData articles should prompt me one day to perform a semantic and quantitative analysis of them… with BigData tools, of course!</note>

Recently, I stumbled upon an truly excellent website that illustrates very well the general considerations outlined above. It is entitled A Visual Introduction to Machine Learning. (Machine Learning, for those who aren’t really immersed into BigData is one of the key technique of manipulating the data. See the detailed Wikipedia entry about it.) The above article is really well crafted (even if it doesn’t fully work on Safari – prefer Chrome or Firefox). Please, to follow what’s next in this post, read it (~10 min) and come back. I’ll wait.


In the meantime, here is a small visual interlude, with the first image of an exoplanet. Are you seeing a large white-blueish dot and small red dot too? How do you know they are not only dots?  And what about the fundamental process of crafting meaning by placing, in a spatially-structured manner, variations of colours in a limited rectangular 2D space, also known as ‘image’? How does this process could even make sense to you? Isn’t an image already a graphical representation of a lot of data?

Knowing how truly the electromagnetic fields of light combine to form constructive fringes that lead to measurement of the spatial coherence along a line projected into a plane would already change for ever your vision of what an image is.


Image Credits: E.S.O.


Ok, back to our business. If you freshly read the article, you probably have an idea of what I am heading in this post.

The article beautifully exemplifies the use of a Machine Learning technique. In this particular example, it allows, seemingly, to classify members of a dataset into one of the two categories: a home is either in New York or San Francisco. We have 7 different types of data points. Literally: ‘elevation’, ‘year built’, ‘bathrooms’, ‘bedrooms’, ‘price’, ‘square feet’, ‘price per sqft’

Before saying anything about it, the immediate question that obviously should have strucked you as well is: why not simply obtaining geographical coordinates of these houses?!! Given the problem they ask themselves to solve, that would be the immediate and logical question to raise. (We note that the goal seems to change a little bit between the introduction – ‘distinguish homes in New York from homes in San Francisco‘ – and the first section – ‘determine whether a home is in San Francisco or in New York‘ – which is not really the same question. Anyway.)

But okay, that’s an example. And examples are often a little bit silly, for the matter of demonstration, and they rarely demonstrate intelligence, but rather skills.

What is example beautifully illustrate is that machines are powerful, but are not smart. And those who pretend here and there that « BigData will revolutionise the way we think the man or the world » are probably seeking power rather than intelligence… 

Here is a list of problematic points that the article does not even touched gently:

  • How the data types were chosen? 
  • Are the data types relevant to the question? Is there any other relevant quantity that could help solving the problem? (okay, okay…)
  • Are the data types enough to solve the question?
  • How did these data points were obtained? Measured? Any error associated with it?
  • Are there any statistical biases? Instrumental ones? Data isn’t just numbers, you know…
  • Were the data points taken all at the same time? How? By how many different people? Were there some outliers?
  • How do we know that the distributions of the points of each type can be compared? Are all these types meaningful to the question?
  • How do we know that all points have the same weight?
  • How do we know the problem is ‘solved’? 
  • Actually, is the problem well- or ill-posed?

Ok there are more than that, but enough. 

There is an obvious conclusion to all of this. But I am never sure myself I didn’t just miss something obvious. I would formulate a conclusion that is somehow obvious, but if this is so too for other people, why do we (I) never hear of them?

Conclusion: A data analysis does not lead to data science, even less science pure and simple. And when you see ‘exciting’ data-scientist positions in companies that list a number of technologies you have to master before applying, be simply aware that science is probably everything but these required technical skills.

 

BigData vs BigScience: same as Day and Night

I am reading the book « Data-ism » by Steve Lohr nowadays. Something strucked me on the first chapter. Steve explains (first chapter!) that BigData follows the principle « Measure everything first, ask questions later«  (we could even say « find questions to ask later »). Boom. In one sentence, this could summarise what BigData « is ». Funny enough, in a french-speaking book I am reading in parallel (and a bit pompously subtitled « BigData: think man and world differently« ), it says the exact same thing. And here too, to my surprise again…

Science, and more specifically Big Science (which is not a new and buzzy expression, but rather a term coined by historians) is just plain the opposite: questions it explores have been asked since humans are humans. And we are still struggling to grab all that data about it. When I say struggling, I mean it. Not only it is difficult to obtain it by itself (the tools required are  expensive such a way that only states or large organisations can afford them), but also because the right to obtaining it is the result of a fierce competition and a process over months. 

I propose here to quickly illustrate the fantastic contrast between the data in BigData and that in a Big Science such as modern observational astrophysics. They compare as Day and Night. And you’ll see that there is a key reason for that difference.

Let start with some context. For a long time, the use of the word ‘data’ was probably concentrated in spots like universities, research laboratories, probably some governmental agencies etc. Then became the fantastic revolution that the Internet is, and the the enormous increase of the amount of data that is today collected and processed is one of its by-products. Not only BigData induces important technology shifts, but as quantity becomes a quality, it induces totally new ways at considering, and using this data. One could even say that: because it is too large to be seen (read: grasp in a look), new ways of thinking arise.

I think that BigData can be… seen quite simply as the result of a combination of successive technical progresses. First, the fundamental step of interconnecting all computers (the www). Second, the storage becoming a quasi unlimited and extremely cheap resource (remember the first day of Gmail, with 1GB free, April 1st 2004?). Sounds normal today, but it was not in 2004! And finally, the mobile revolution where everybody is living with a computer-data-sensor-communicator all the time (2007, the iPhone). Sensor is key, here.

With all these technological advances combined, the amount of data produced by everybody starts to follow an exponential. The scale of that amount created new problems, unseen before, about the collect, transmission, storage, organisation and structure, mining, sharing, analysis of that data. Most of that is now comprised by the generic term ‘Cloud’ (although some people keep us warned that there is no such cloud…). Hence the new tools developed for it: for instance MongoDB for databases – a non-relational DB, or Hadoop, a Java framework to allow working with thousands of nodes and petabytes of data. Ok. #BigDeal. 

It is apparently such new exciting stuff for many people in the IT industry that they look like discovering a toy so large that couldn’t dreamed about it before. The world, or more precisely – and this is key – the vision they have of the (economic) world starts to be quantifiable. It’s all about startups, algorithms, data « intelligence », companies being reshaped to « accept data », or re-organised around data, or data marketing, Gafa (Google, Apple, Facebook, Amazon) etc. Not talking about the bazillions of dollars it drives (more on that later).

Measure everything first, ask questions later. How strong the contrast is with #BigScience!

Astrophysics is a Big Science because of the size and scale of the tools it requires to perform at minimum its first mission: exploration. Telescopes, observatories, satellites, instruments, arrays of gigantic antennas (thanks to Photo Ambassadors of E.S.O. for sharing the amazing pictures that makes this site beautiful) etc are large and expensive tools requiring very specialised and trained people of many different disciplines. Astrophysics is also known to produced bazillions of bytes of data. <note>Astrophysics is not the science producing the largest amount of data, however. That title remains probably the property of the … CERN, where the web has been created. For another post. </note>

For the sake of comparison with the amount of data an iPhone can takes (dozens of GBs per month), and how easy it is to share it with various services, let’s briefly outline the process a single astronomer has to go through to obtain his/her data, taking the example of major world-class observatories. It is as follows: Every 6 months, a « Call for proposals » is open. Proposals are very specialised forms to be prepared by a scientific team. It must contains (and that absolutely key) a meaningful combination of, first and foremost, science motivation (is the subject worth the effort?), operational conditions (are you in the right place in the right moment, with the right observing conditions? – think about coordinates, brightness, phenomenon phase of the subject, moon phase, latitude etc) and technical capabilities (is the telescope and its instrument and the requested sophisticated configuration the right one to possibly answer the scientific question you want to ask, assuming this question is valid?)… 

It is hard. You usually need an advanced university degree to reach that level. Simply because it is very hard to ask sensible questions.

Let’s assume you have this combination, and you managed to write it down it in a very precise yet concise way and… on time. Your proposal is reviewed by a panel of ‘experts’ judging all proposals, and ranking them (a necessarily imperfect choice, and tensions and conflicts arise regularly, but nobody has proposed a better way so far). A threshold is set, and the amount of nights above that threshold (assuming there are no conflicts of targets, dates, configurations between proposals) is compared to the amount of nights available. And well, there is only about 180 nights inside 6 months of time, when accounting for special / technical nights. So only the top proposals are granted. At that stage you still have 0 bytes of data. 

Let assume that, given a pressure factor between 3 and 20, your proposal is granted. Wouhou! Congratulations. Between 3 and 9 months after that day (i.e. between 6 and 12 after you submitted your proposal), you travel to the observatory. There, you are welcomed by support astronomers who will guide you through the complex process of preparing your observations, with dedicated custom software, giving you the latest details about the instrument, the observatory, the constraints, the news etc. Assuming the observatory, telescope and instrument is all running well (that’s far from guaranteed in small observatories), you finally cross your fingers for the wind to remain low, humidity not too high and more importantly, that there will be no clouds. And if by any bad chance clouds prevent you to obtain a single bit of data, too bad for you! Thanks for coming, but other people are waiting down the line, for the next coming nights. Please, come back next year. If your new proposal is accepted.

If that’s all well (and yes, a majority of the time, it is all going pretty well, fortunately), you are working during that night, manipulating complex opto-mechanical instruments, with dedicated software to obtain… raw data. That is, data full of imperfections, full of problems, more or less manageable. Once back home, you’ll have to work, sometimes entire weeks, to transform these raw data into usable and true scientific data. That’s it. Now, the work of thinking about your scientific question can continue…

Isn’t the contrast amazing? The scientific data in this case is just extremely expensive in terms of energy, efforts, risk of failure, people involved, time spent preparing it, and justifying it! Day and Night.

At the end, to me, this difference all comes from the difference of approach I mentioned in the beginning: BigScience has tons of open questions. But they are very very hard to answer to. And they requires very sophisticated tooling and observational performance to be able to brush the surface of the question. BigData is flowing through our devices. And yet we look for questions to ask with it. But what questions? New business questions? Some call « revolutions » things that are only innovations, or more simply progresses in a field… 

I may be a bit too simplistic here. There are indeed very important domains for humans (such as health, quite obviously – too obviously?) that would benefit from a « measure first, think later » approach (that’s the first example in Steve Lohr’s book). So the key difference is not so much the volume of data, its variety (or its velocity, the 3V). BigScience is accustomed to at least the first one.

No, what struck me most, when reading things about BigData or DataScience, is the absence of two words: knowledge, and understanding. It seems that BigData doesn’t work to increase knowledge. I do not mean detecting « patterns » (which some are so fascinated about) in highly noisy data. I mean reproducible knowledge, gained through the understanding of the underlying phenomena. You, know, science…

Calling « science » something that is not focusing on knowledge and understanding is a bit problematic to me. The rush to the new gold era of BigData and DataScience (which is real, with sky rocketing amount of investments in it) will all appear slightly artificial to usual (academic) « scientists ». For sure, if they embrace the business side of the force, scientists leaving academia have a definitive experience at thinking data, hence having a critical opinion about it.

Talking about thinking… (ok, these are only Tweets).

« Fitting » isn’t meaningful by itself. (Click on the image link – pic.twitter.com… – to see what I mean).

When it’s beautiful, it tends to be over interpreted. I wish every one could follow Edward Tufte courses, or read his absolutely stunning and brilliant books… See what I mean?

Thinking is slow. Thinking right is very slow. Business if fast. Decision making must be fast, especially(?) in business. Bringing the two together is an interesting challenge!

As a matter of fact, all that isn’t making perfect sense? It may be a huge chance that BigData mostly contains very noisy, hard-to-reproduce, poorly meaningful « patterns ». That’s the only way thinking with its 3V – Volume, Variety, Velocity – is humanly possible. Just imagine that amount of data, at that rate, but as meaningful as what BigScience Data can produce?… That’s an interesting question for the next post!