Covariations vs Correlations in BigData

Recently, I wrote about how #BigData and #BigScience differ, having almost opposite approaches at looking at data. Needless to say that I remain skeptical about the varying quality of what’s being said and written about data, big or not. As a matter of fact, my main concern is about what one can infer, or pretend to infer from that data. Data help to think the world, yes. Yet it isn’t the whole story. Reading posts on Internet and the sky-rocketing amount of new material about it, one must honestly ask oneself: Is Data, especially since it became Big, a object of knowledge by itself?

In this post, I want to discuss the difference between covariations and correlations. In a context of data-driven decisions (a concept I’ve read in the two books I’ve mentioned last time), failing to distinguish covariations and correlations might lead to unexpected consequences, to the say least. The least dommageable being, probably, to remain ignorant after all…

In my previous post mentioned above, I cited these sequence of tweets:

Here is the image of the original tweet:

 The image of the original tweet.
The image of the original tweet.

Talking to strangers, and telling them they are wrong. What else Internet is about?…


(xkcd: Duty Calls)

Anyway. Days passing, I couldn’t help but keep thinking about this « fitting » problem. I think I have a (natural?, normal?, scientist?) reflex saying that data isn’t telling the whole story, but just a mean, among others, to climb the ladder to stand in the shoulders of giantsThe existing story we build upon, with the help of data, is the knowledge and understanding we have of our world, and the history of the discoveries that led to the state of it. And that knowledge is based on correlations. Correlations that were observed, checked, verified, and understood (if I could make a fine word I would say, that, in computer science, we would say these correlations were entirely ‘debugged’, since debugging is understanding).

But correlations and covariations haven’t the same meaning! Simply stated, a covariation is the observation that when one parameter varies, one another does as well, and vice versa. Covariations are (I love this rule from mathematics:) necessary but not sufficient to make correlations. Covariations are merely a hint about something happening under the hood. Covariations can have various ‘shapes’ or, in other words, can be represented graphically with various figures. The shape of that figure is certainly an excellent hint about the underlying phenomenon, but it is not the explanation by itself. On the other hand, understanding means giving a cause, or an explanation, to a covariation. While the study of covariations is full of lessons, this isn’t usually enough to reach an explanation. And it is not a matter of quantity. Correlations are living in a different space. ‘Data-points fitting’ isn’t equal to understanding (obvious, isn’t it? Or not?). Stated simply, a correlation integrates the corpus of knowledge, while a covariation integrate the corpus of observations.

What amaze me most in this journey into BigData as I navigate into it, and the dozens of articles about it in every corner of Internet, is – again – the very weak presence of words such as understandingknowledgeresearch, ‘Nature‘, etc. They are utterly dominated by the presence of ‘insights’, ‘obvious’, ‘noise’, ‘pattern discovery’, and also ‘revolutionary’, ‘potential’; words that belongs a lot more to marketing than to, well, science. <note>This little game about the number of occurrences of words in BigData articles should prompt me one day to perform a semantic and quantitative analysis of them… with BigData tools, of course!</note>

Recently, I stumbled upon an truly excellent website that illustrates very well the general considerations outlined above. It is entitled A Visual Introduction to Machine Learning. (Machine Learning, for those who aren’t really immersed into BigData is one of the key technique of manipulating the data. See the detailed Wikipedia entry about it.) The above article is really well crafted (even if it doesn’t fully work on Safari – prefer Chrome or Firefox). Please, to follow what’s next in this post, read it (~10 min) and come back. I’ll wait.


In the meantime, here is a small visual interlude, with the first image of an exoplanet. Are you seeing a large white-blueish dot and small red dot too? How do you know they are not only dots?  And what about the fundamental process of crafting meaning by placing, in a spatially-structured manner, variations of colours in a limited rectangular 2D space, also known as ‘image’? How does this process could even make sense to you? Isn’t an image already a graphical representation of a lot of data?

Knowing how truly the electromagnetic fields of light combine to form constructive fringes that lead to measurement of the spatial coherence along a line projected into a plane would already change for ever your vision of what an image is.


Image Credits: E.S.O.


Ok, back to our business. If you freshly read the article, you probably have an idea of what I am heading in this post.

The article beautifully exemplifies the use of a Machine Learning technique. In this particular example, it allows, seemingly, to classify members of a dataset into one of the two categories: a home is either in New York or San Francisco. We have 7 different types of data points. Literally: ‘elevation’, ‘year built’, ‘bathrooms’, ‘bedrooms’, ‘price’, ‘square feet’, ‘price per sqft’

Before saying anything about it, the immediate question that obviously should have strucked you as well is: why not simply obtaining geographical coordinates of these houses?!! Given the problem they ask themselves to solve, that would be the immediate and logical question to raise. (We note that the goal seems to change a little bit between the introduction – ‘distinguish homes in New York from homes in San Francisco‘ – and the first section – ‘determine whether a home is in San Francisco or in New York‘ – which is not really the same question. Anyway.)

But okay, that’s an example. And examples are often a little bit silly, for the matter of demonstration, and they rarely demonstrate intelligence, but rather skills.

What is example beautifully illustrate is that machines are powerful, but are not smart. And those who pretend here and there that « BigData will revolutionise the way we think the man or the world » are probably seeking power rather than intelligence… 

Here is a list of problematic points that the article does not even touched gently:

  • How the data types were chosen? 
  • Are the data types relevant to the question? Is there any other relevant quantity that could help solving the problem? (okay, okay…)
  • Are the data types enough to solve the question?
  • How did these data points were obtained? Measured? Any error associated with it?
  • Are there any statistical biases? Instrumental ones? Data isn’t just numbers, you know…
  • Were the data points taken all at the same time? How? By how many different people? Were there some outliers?
  • How do we know that the distributions of the points of each type can be compared? Are all these types meaningful to the question?
  • How do we know that all points have the same weight?
  • How do we know the problem is ‘solved’? 
  • Actually, is the problem well- or ill-posed?

Ok there are more than that, but enough. 

There is an obvious conclusion to all of this. But I am never sure myself I didn’t just miss something obvious. I would formulate a conclusion that is somehow obvious, but if this is so too for other people, why do we (I) never hear of them?

Conclusion: A data analysis does not lead to data science, even less science pure and simple. And when you see ‘exciting’ data-scientist positions in companies that list a number of technologies you have to master before applying, be simply aware that science is probably everything but these required technical skills.

 

Introducing exoplanets, observing sites, telegrams and converters @ arcsecond.io

What an exciting time! I am very happy to announce that partial implementations of exoplanets, observing sites, telegrams and converters are now available at arcsecond.io! Moreover, People, Finding Charts and Publications have been also included in the pipe and will be announced later, once available.

Let’s go through it. First, exoplanets. I’ve been using exoplanet.eu catalog since a long time in iObserve, and the current implementation in arcsecond.io uses that source. By itself, this catalog is already an aggregation of multiple sources. But the real value of arcsecond.io is integration. Hence, the exoplanets will be a consolidation of what exoplanet.eu return with that of the NASA Exoplanet Archive

To access an individual planet, follow this example:

api.arcsecond.io/1/exoplanets/51 Peg b

To access the whole catalog, leave empty the exoplanet name.

Second, observing sites. The story about these observing sites is amazingly long, since it became a concern at the origin of iObserve, when it was only a … dashboard widget, 10 years ago, with OSX Tiger! This is a new implementation now, and the definitive one, in arcsecond.io. It will allow to build a community-based source of information about observatories and observing sites.

As for now, the static list of observatories available in iObserve is available in GET only. But a lot more will come. Here is an example:

api.arcsecond.io/1/observingsites/La Silla

Same as before, leave empty the site name to get the whole list. You can also filter the list by continent (more filters will come):

api.arcsecond.io/1/observingsites?continent=Europe

Next, Telegrams! Very happy to have started this section. Troubles will come with IAU Circulars, but at least here comes a partial implementation of Astronomer’s Telegram (GRBs will also be added soon):

api.arcsecond.io/1/telegrams/ATel/7899

And finally converters. This is the logical completion of the sections, following iObserve own sections. As for now, I have only a basic support for converting coordinates. For instance:

api.arcsecond.io/1/converters/coordinates/ra/3.456/dec/70.2455

Stay tuned for a lot more!

Image Credits: E.S.O. / B. Tafreshi (twanight.org)

iObserve 1.4.2 just pushed for review!

iObserve 1.4.2 has just been pushed for review. This update fixes two small issues with manual coordinates for new objects, as well as the webservice URL for individual exoplanets. This should prevent the planet mass and other important parameters to disappear upon update. Oh, and the buttons (and other small visual tweaks) in Mavericks have been restored in their previous shine… Interesting to realise that many users are still on Mavericks! I guess you all wait to be sure that iraf, MIDAS, ds9 work as usual on Yosemite…