I've been working on data-intensive projects recently and I'm sure that there is a point in every computational researcher's life when he begins to think about the data that they are generating – how are they going to store it, how is it going to be tracked, what code/circumstances were used to create or collect it, how is it going to be associated or linked with the results, how will someone who questions the research reproduce or validate it? Most large projects plan for these sorts of issues early on in their life cycle, but for many smaller projects it seems like an afterthought – if thought about at all.
I recently reviewed a paper that was being submitted for publication and the authors, while on target with their overall thesis, supported such with some broad claims whose veracity was supported only by some pictorial charts (units were not displayed). There was no detail regarding the number of times the tests were run to produce the resultant chart. It wasn't explained that the values represented an average over many runs, or what level of variance was represented by the result set. There was no pointer to the raw data set, or detailed test archive, etc. The test that they had run had inherent variability in the source (Internet latency/contention) yet no explanation was given as to how this was accounted for in the published results. Essentially, as a reviewer, I was being asked to sign off on a set of assertions for which I had nothing beyond the credibility of the authors as validation. If I were simply a reader of the publication and held a critical view of the view being presented, I would have no means of learning further or accurately countering the author's claims (assuming that the goal of scientific publication is not only the dissemination of knowledge but the constructive debate of theories leading to a community-refined understanding of reality). Maybe I am naive, but I think we can do better.
I recently attended a workshop and one of the speakers (he was a researcher at Google but I don't remember his name/position) mentioned (almost in passing) during his talk that he and a colleague had been discussing the need for a reality in which every experiment can be reproduced and independently validated at any point in time. He quickly admitted that this was a lofty aspiration and there exist many hurdles that would have to be overcome to facilitate such, but I found myself strongly agreeing with the core sentiment. As a relative newcomer to the scientific community, I've been a bit surprised at the shroud of secrecy that most researchers place around the raw data from their work. There seems to be a prevailing desire for self aggrandizement over fostering collaborative solutions to hard problems. I'm probably somehow missing the boat, but I find myself hoping for a scenario in which data is published early and often – critiqued and validated by others, pointing the community at large towards solutions rather than individuals towards papers.
While thinking about this problem area, I was reminded of Project Trident – an effort by Microsoft Research to solve a similar problem. As I recall, this platform bundles the variables, originating source, and resultant data together in a repository for subsequent validation and archival. I hope that they are successful in this effort and that similar tools are developed in the community. Ideally, the scientific community will embrace the “cloud” for more than simply large scale compute, but also as a means to build a platform such as one referred to earlier in this post whereby any person with interest could browse through existing experiments, and re-execute them with constraints similar to the originals. Then, as the collective imagination grows, the community can experiment with other permutations or derivative works.
b0810281-82ad-4dbb-a65c-dca754d82247|0|.0
Cloud Computing, Theory
theory, research