BigData vs BigScience: same as Day and Night

I am reading the book « Data-ism » by Steve Lohr nowadays. Something strucked me on the first chapter. Steve explains (first chapter!) that BigData follows the principle « Measure everything first, ask questions later«  (we could even say « find questions to ask later »). Boom. In one sentence, this could summarise what BigData « is ». Funny enough, in a french-speaking book I am reading in parallel (and a bit pompously subtitled « BigData: think man and world differently« ), it says the exact same thing. And here too, to my surprise again…

Science, and more specifically Big Science (which is not a new and buzzy expression, but rather a term coined by historians) is just plain the opposite: questions it explores have been asked since humans are humans. And we are still struggling to grab all that data about it. When I say struggling, I mean it. Not only it is difficult to obtain it by itself (the tools required are  expensive such a way that only states or large organisations can afford them), but also because the right to obtaining it is the result of a fierce competition and a process over months. 

I propose here to quickly illustrate the fantastic contrast between the data in BigData and that in a Big Science such as modern observational astrophysics. They compare as Day and Night. And you’ll see that there is a key reason for that difference.

Let start with some context. For a long time, the use of the word ‘data’ was probably concentrated in spots like universities, research laboratories, probably some governmental agencies etc. Then became the fantastic revolution that the Internet is, and the the enormous increase of the amount of data that is today collected and processed is one of its by-products. Not only BigData induces important technology shifts, but as quantity becomes a quality, it induces totally new ways at considering, and using this data. One could even say that: because it is too large to be seen (read: grasp in a look), new ways of thinking arise.

I think that BigData can be… seen quite simply as the result of a combination of successive technical progresses. First, the fundamental step of interconnecting all computers (the www). Second, the storage becoming a quasi unlimited and extremely cheap resource (remember the first day of Gmail, with 1GB free, April 1st 2004?). Sounds normal today, but it was not in 2004! And finally, the mobile revolution where everybody is living with a computer-data-sensor-communicator all the time (2007, the iPhone). Sensor is key, here.

With all these technological advances combined, the amount of data produced by everybody starts to follow an exponential. The scale of that amount created new problems, unseen before, about the collect, transmission, storage, organisation and structure, mining, sharing, analysis of that data. Most of that is now comprised by the generic term ‘Cloud’ (although some people keep us warned that there is no such cloud…). Hence the new tools developed for it: for instance MongoDB for databases – a non-relational DB, or Hadoop, a Java framework to allow working with thousands of nodes and petabytes of data. Ok. #BigDeal. 

It is apparently such new exciting stuff for many people in the IT industry that they look like discovering a toy so large that couldn’t dreamed about it before. The world, or more precisely – and this is key – the vision they have of the (economic) world starts to be quantifiable. It’s all about startups, algorithms, data « intelligence », companies being reshaped to « accept data », or re-organised around data, or data marketing, Gafa (Google, Apple, Facebook, Amazon) etc. Not talking about the bazillions of dollars it drives (more on that later).

Measure everything first, ask questions later. How strong the contrast is with #BigScience!

Astrophysics is a Big Science because of the size and scale of the tools it requires to perform at minimum its first mission: exploration. Telescopes, observatories, satellites, instruments, arrays of gigantic antennas (thanks to Photo Ambassadors of E.S.O. for sharing the amazing pictures that makes this site beautiful) etc are large and expensive tools requiring very specialised and trained people of many different disciplines. Astrophysics is also known to produced bazillions of bytes of data. <note>Astrophysics is not the science producing the largest amount of data, however. That title remains probably the property of the … CERN, where the web has been created. For another post. </note>

For the sake of comparison with the amount of data an iPhone can takes (dozens of GBs per month), and how easy it is to share it with various services, let’s briefly outline the process a single astronomer has to go through to obtain his/her data, taking the example of major world-class observatories. It is as follows: Every 6 months, a « Call for proposals » is open. Proposals are very specialised forms to be prepared by a scientific team. It must contains (and that absolutely key) a meaningful combination of, first and foremost, science motivation (is the subject worth the effort?), operational conditions (are you in the right place in the right moment, with the right observing conditions? – think about coordinates, brightness, phenomenon phase of the subject, moon phase, latitude etc) and technical capabilities (is the telescope and its instrument and the requested sophisticated configuration the right one to possibly answer the scientific question you want to ask, assuming this question is valid?)… 

It is hard. You usually need an advanced university degree to reach that level. Simply because it is very hard to ask sensible questions.

Let’s assume you have this combination, and you managed to write it down it in a very precise yet concise way and… on time. Your proposal is reviewed by a panel of ‘experts’ judging all proposals, and ranking them (a necessarily imperfect choice, and tensions and conflicts arise regularly, but nobody has proposed a better way so far). A threshold is set, and the amount of nights above that threshold (assuming there are no conflicts of targets, dates, configurations between proposals) is compared to the amount of nights available. And well, there is only about 180 nights inside 6 months of time, when accounting for special / technical nights. So only the top proposals are granted. At that stage you still have 0 bytes of data. 

Let assume that, given a pressure factor between 3 and 20, your proposal is granted. Wouhou! Congratulations. Between 3 and 9 months after that day (i.e. between 6 and 12 after you submitted your proposal), you travel to the observatory. There, you are welcomed by support astronomers who will guide you through the complex process of preparing your observations, with dedicated custom software, giving you the latest details about the instrument, the observatory, the constraints, the news etc. Assuming the observatory, telescope and instrument is all running well (that’s far from guaranteed in small observatories), you finally cross your fingers for the wind to remain low, humidity not too high and more importantly, that there will be no clouds. And if by any bad chance clouds prevent you to obtain a single bit of data, too bad for you! Thanks for coming, but other people are waiting down the line, for the next coming nights. Please, come back next year. If your new proposal is accepted.

If that’s all well (and yes, a majority of the time, it is all going pretty well, fortunately), you are working during that night, manipulating complex opto-mechanical instruments, with dedicated software to obtain… raw data. That is, data full of imperfections, full of problems, more or less manageable. Once back home, you’ll have to work, sometimes entire weeks, to transform these raw data into usable and true scientific data. That’s it. Now, the work of thinking about your scientific question can continue…

Isn’t the contrast amazing? The scientific data in this case is just extremely expensive in terms of energy, efforts, risk of failure, people involved, time spent preparing it, and justifying it! Day and Night.

At the end, to me, this difference all comes from the difference of approach I mentioned in the beginning: BigScience has tons of open questions. But they are very very hard to answer to. And they requires very sophisticated tooling and observational performance to be able to brush the surface of the question. BigData is flowing through our devices. And yet we look for questions to ask with it. But what questions? New business questions? Some call « revolutions » things that are only innovations, or more simply progresses in a field… 

I may be a bit too simplistic here. There are indeed very important domains for humans (such as health, quite obviously – too obviously?) that would benefit from a « measure first, think later » approach (that’s the first example in Steve Lohr’s book). So the key difference is not so much the volume of data, its variety (or its velocity, the 3V). BigScience is accustomed to at least the first one.

No, what struck me most, when reading things about BigData or DataScience, is the absence of two words: knowledge, and understanding. It seems that BigData doesn’t work to increase knowledge. I do not mean detecting « patterns » (which some are so fascinated about) in highly noisy data. I mean reproducible knowledge, gained through the understanding of the underlying phenomena. You, know, science…

Calling « science » something that is not focusing on knowledge and understanding is a bit problematic to me. The rush to the new gold era of BigData and DataScience (which is real, with sky rocketing amount of investments in it) will all appear slightly artificial to usual (academic) « scientists ». For sure, if they embrace the business side of the force, scientists leaving academia have a definitive experience at thinking data, hence having a critical opinion about it.

Talking about thinking… (ok, these are only Tweets).

« Fitting » isn’t meaningful by itself. (Click on the image link – pic.twitter.com… – to see what I mean).

When it’s beautiful, it tends to be over interpreted. I wish every one could follow Edward Tufte courses, or read his absolutely stunning and brilliant books… See what I mean?

Thinking is slow. Thinking right is very slow. Business if fast. Decision making must be fast, especially(?) in business. Bringing the two together is an interesting challenge!

As a matter of fact, all that isn’t making perfect sense? It may be a huge chance that BigData mostly contains very noisy, hard-to-reproduce, poorly meaningful « patterns ». That’s the only way thinking with its 3V – Volume, Variety, Velocity – is humanly possible. Just imagine that amount of data, at that rate, but as meaningful as what BigScience Data can produce?… That’s an interesting question for the next post!

Introducing Data Archives @ arcsecond.io

This stuff is probably infinite. How exciting! The scope of arcsecond.io is expanding. And I have additional ideas for the future… In the meantime, here comes basic support for Data Archives: ESO’s, MAST etc. I had a quick connector already written for ESO programmes summary, hence that’s the only thing available for now. For instance:

api.arcsecond.io/1/archives/ESO/073.D-0473(A)/summary/

But more will come! I am already working on the access to data sets (hopefully, with some local meteo data…), in a completely integrated manner, with access to objects, observatories etc. 

arcsecond.io v1 (alpha)

Here is a coming service: arcsecond.io (which replace the ugly eclipt.is I mentioned in a previous blog post). APIs aren’t ready yet, but I intend to give access to basic properties of objects by the end of the month. At least, you can enjoy the front end. 🙂 

arcsecond.io aims at providing browsable APIs where each piece of astronomical information has its own Uniform Resource Locator (a.k.a. URL). SIMBAD, NED, exoplanet.eu, JPL Horizons, astro telegrams are among the sources of information that arcsecond.io intend to unify. Because we believe in the force of integration.

SwiftAA: first release

I am happy to announce the first release of SwiftAA, a Swift / Objective-C framework wrapping around JP Naughter’s C++ implementation of the Astronomical Algorithms. It contains all the Objective-C wrappers as well as the Swift headers and files for you to access all the Astronomical Algorithms implementation of AA+ in Swift.

It is for iOS8+ as well as OSX Yosemite+

There is a very basic Swift Playground included. It will be developed in the coming months to exemplify the use of the AA.

Ongoing projects status…

Too much heat these days here… (it’s midnight, and I am coding in my terrasse…). And I need to make a quick summary of what’s going on @ onekilopars.ec… 

First, and foremost, iObserve. An update version in the 1.4 serie is in preparation. It will include some usual small fixes, one clear usability fix for the Fluxes converter, and more importantly an important improvement to the Coordinates converter. No more loading the full DB. An import/export will replace it. Since it is not the most fun to write, it takes a bit of time. Thanks however to some recent input from a user, I’ll try to also include exoplanets transit times in Airmass and Visibility Plots.

Some of the fixes of iObserve will probably need to be ported to iObserve Touch on iPad. The problem here is that the whole app needs to be updated (again). Apple is always moving forward, and I made a choice for that app long ago that prevent me to go a lot further today (buhhh). Yes, MultiTasking won’t be possible with the current configuration (with that super-cool-awesome-but-very-custom split view controller allowing you to have a master table – with tabs! – in the left, the detail on the right, a top times bar, and the whole master thing wrapped up in a popover when being in portrait). The planning for this is: Undefined, unfortunately, even if it is fun to adopt more recent technologies.

Next, the iObserve 2 story… The Desktop client is well underway. I have multiple-windows, and a lot more power features for big screens. But I wanted 2 major stuff for iObserve 2: a dedicated backend, and some advanced algorithms for planets (and not only Moon as I have today).

The backend is in preparation… but guess what, along the way, I found it very interesting and it opens tons of new possibilities. It is in a very pre-alpha state, and it will be called … arcsecond.io. It is very unstable, as I am struggling a bit with the Django-Python-HTML-Bootstrap-Heroku-Postgres stack. But I am actually pushing code now and then into the repo.

As for the algorithms, the obvious reference is Jean Meeus’ Astronomical Algorithms textbook. I have a copy of the book. And I implemented some of it. But there is an existing implementation that is a lot more complete than mine, and a lot more tested. It’s called AA+, by P.J. Naughter. After some discussion he agreed for letting me put hist code into a GitHub open-source repo, for me to write an Objective-C wrapper… which became actually a Swift wrapper! It’s open-source, so have a look!

Of course, putting my hands into Swift, I thought I could do some interesting stuff with it. And one very nice ESO guy suggested me to write an app for monitoring the ESO archive live, every night. That’s a great occasion to start a new app with all the amazing new stuff iOS8 introduced. So here it will be: SkyDataFlows.

Of course, to read correctly the ESO DB, one must parse the VOTable output. And there is no VOTable parser written in Swift… so here it is! SwiftVOTable, open-source, boum.

And beside this, waiting, is my iTiunes-For-FITS file app (which benefits from the progress of the development of iObserve 2), and some other web projects with a friend of mine. Dev is a lifestyle, a mental life with ups and down! #yeah