Data and Published Results in Scientific Research

12. February 2010

I've been working on data-intensive projects recently and I'm sure that there is a point in every computational researcher's life when he begins to think about the data that they are generating – how are they going to store it, how is it going to be tracked, what code/circumstances were used to create or collect it, how is it going to be associated or linked with the results, how will someone who questions the research reproduce or validate it? Most large projects plan for these sorts of issues early on in their life cycle, but for many smaller projects it seems like an afterthought – if thought about at all.

I recently reviewed a paper that was being submitted for publication and the authors, while on target with their overall thesis, supported such with some broad claims whose veracity was supported only by some pictorial charts (units were not displayed). There was no detail regarding the number of times the tests were run to produce the resultant chart. It wasn't explained that the values represented an average over many runs, or what level of variance was represented by the result set. There was no pointer to the raw data set, or detailed test archive, etc. The test that they had run had inherent variability in the source (Internet latency/contention) yet no explanation was given as to how this was accounted for in the published results. Essentially, as a reviewer, I was being asked to sign off on a set of assertions for which I had nothing beyond the credibility of the authors as validation. If I were simply a reader of the publication and held a critical view of the view being presented, I would have no means of learning further or accurately countering the author's claims (assuming that the goal of scientific publication is not only the dissemination of knowledge but the constructive debate of theories leading to a community-refined understanding of reality). Maybe I am naive, but I think we can do better.

I recently attended a workshop and one of the speakers (he was a researcher at Google but I don't remember his name/position) mentioned (almost in passing) during his talk that he and a colleague had been discussing the need for a reality in which every experiment can be reproduced and independently validated at any point in time. He quickly admitted that this was a lofty aspiration and there exist many hurdles that would have to be overcome to facilitate such, but I found myself strongly agreeing with the core sentiment. As a relative newcomer to the scientific community, I've been a bit surprised at the shroud of secrecy that most researchers place around the raw data from their work. There seems to be a prevailing desire for self aggrandizement over fostering collaborative solutions to hard problems. I'm probably somehow missing the boat, but I find myself hoping for a scenario in which data is published early and often – critiqued and validated by others, pointing the community at large towards solutions rather than individuals towards papers.

While thinking about this problem area, I was reminded of Project Trident – an effort by Microsoft Research to solve a similar problem. As I recall, this platform bundles the variables, originating source, and resultant data together in a repository for subsequent validation and archival. I hope that they are successful in this effort and that similar tools are developed in the community. Ideally, the scientific community will embrace the “cloud” for more than simply large scale compute, but also as a means to build a platform such as one referred to earlier in this post whereby any person with interest could browse through existing experiments, and re-execute them with constraints similar to the originals. Then, as the collective imagination grows, the community can experiment with other permutations or derivative works.

Cloud Computing, Theory ,

Isn’t it Time for 64Bit?

3. September 2008

 

 

I’m getting frustrated with application vendors and their support for 64-bit O/Ses. I’d admit that a year or two ago the consumer-level device support for 64-bit O/Ses was a bit weak, but considering I can now walk into Best Buy and pick up a consumer-grade laptop that runs Vista 64, the major software vendors really need to get their act together. In the last few weeks I’ve been bitten by a “we don’t support 64-bit” story a number of times and it feels ridiculous… Maybe i’m an edge case, but every Vista Machine I own is 64-bit (work laptop, home desktop, my wife’s laptop, etc).

  • I was writing code this AM attempting to integrate with QuickBooks 2008 and was forced to go back and re-compile with the processor-specific x86 switch due to the fact that their SDK doesn’t support 64-bit O/S.
  • I had purchased a 3-computer license to ETrust Anti-Virus (Computer Associates) and upon recently replacing my home desktop and my wife’s computer I had to throw away those licenses and replace them with another vendor’s product because CA can’t figure out how to build an AntiVirus app for 64-bit Vista
  • At work, I am forced to run a 32-bit Vista Virtual Machine because the tool provided by our workflow vendor for designing business processes doesn’t run on 64-bit (they claim they’ve been working on a 64-bit version and it should be out “any time”… but that was February…)
  • At work we’ve been working on an electronic records system and are fighting with the vendor because they don’t support 64-bit OS on the server… seriously? An enterprise-scale server-based product that has been around for a few years doesn’t yet support 64-bit? I’m amazed…

Theory ,

Thinking the Cloud…

18. July 2008

I’ve been talking quite a bit with a co-worker about “the cloud” and how organizations can and will leverage it over time, and how application development/design may change as a result. Microsoft’s Sql Server Data Services (SSDS) is only one example of a major paradigm shift in the industry away from internal-only systems to treating certain things as commodity-style resources.

I’ve been thinking through a problem for a non-profit that I work with wherein they needed to share approximately 7.5 GB worth of corporate documents amongst a geographically dispersed team. We’ve been facilitating this by using a WSS site hosted on a little box at my house for the last year or so, but have had increasing frustration with normal home-hosted issues (power blinks, server goes off while I’m out of town, etc.) so I’ve been researching how to solve this problem inexpensively but also well enough to “make the problem go away”. Because of the other discussions my co-worker and I have been having recently, I naturally looked for a “cloud-based” solution.

Here’s the list of things I reviewed:

  • Microsoft Office Live Small Business – http://www.officelive.com – this looked to be a very interesting option… you sign up, get some custom domain mail accounts, a little website if you’d like, and some private space which is essentially a highly-tailored/restricted WSS platform. $15/month for 5GB of space. I contacted their helpline, they assured me I could add to that to meet my 7.5 GB requirement so I started uploading… after 4.9 GB (and a LOT of time – my poor cable modem…) I went to add another 5GB only to have the control panel deny me that option. Another call to customer service and the nice-but-feature-ignorant customer service representative told me “I sure thought you could do that but I guess not”. Cancelled the account and threw away the upload time… oh well… (UPDATE: I’ve since been called back by another rep who assured me that it was, in fact, possible and that all would be well, but the ship has sailed…)
  • Microsoft Office Online (http://www.officeonline.com) – this is the full-blown version of Hosted SharePoint… would have been great however it is currently in beta and has no prices listed. Based on the target audience and the pricing for their hosted live meeting service, my gut tells me it is going to be too pricey for the non-profit to swallow so I moved on…
  • SkyDrive – this would be great… it’s exactly what I needed… but I need 7.5 GB… not 5… I couldn’t find any way (even offering to pay) to get more than 5 GB… on to other options…
  • <Insert your favorite file share here>: Found a bunch of services that might work… some of which I had heard of before, others I wasn’t sure of, some looked too good to be true, some I wasn’t convinced would be around long enough for me to get my data uploaded much less 4-5 years from now…

Then, a friend recommended I look at Amazon S3 and I’m pretty glad he did. Amazon offers “Object Storage” in the cloud for very cheap prices… and expose a series of XML Web Services to interact with the service. It takes almost nothing to get setup, and there’s a number of code samples available on CodePlex (http://codeplex.com) to illustrate working with it. I’m currently playing with a share-ware tool called BucketExplorer (~$50) that works as a file client for the service and, besides being a resource hog, is workign fairly well. The best part about this solution is that it is incredibly cheap ($0.15/GB/Month!) and I can integrate it directly into our existing admin control panel without the staff knowing that the actual data “lives someplace else”. The Internet storage has become a commodity – something that I can just assume is available… pretty slick if you ask me.

image

Cloud Computing, Theory ,

Stop (re-)Inventing the Wheel!

22. June 2008

This is more a personal reminder than anything else…

In my “day job”, I’m working with an organization wherein we are coaching a group of about 80 developers to view opening Visual Studio as their last viable option when looking to solve a problem. This doesn’t mean coding is bad (I certainly hope not… if so, I think I’d be out of a job soon), but rather represents a mind-set that recognizes that we have an enormous collection of functionality/tools already available to us (we are building on top of MOSS 2007) and we need to fully vet the OOTB functionality prior to deciding we need to “roll our own” anything. Directly tied to this approach is the theory that using OOTB functionality and/or configuration of such (rather than raw coding) leads to better long-term maintainability and upgrade-ability, not to mention helping to avoid “hit by a bus” syndrome.

However, sometimes the “preacher” needs to look inwardly and I found myself doing that this weekend. I was working on a project for a non-profit organization I work with, and found myself looking at what I had amassed for solving the problem of site-wide search and was displeased. I immediately reverted to my “code first” tendencies (something I think every developer is born with) and began (mentally) listing the discrepancies with the current solution and designing a “right” solution. Thankfully, prior to actually writing any code, I was kicking around some blog posts and something in one of them (honestly don’t remember what/which) got me thinking of the various “existing” search engines and the fact that they often provide site-specific, nearly OOTB search dialogs that you can embed into your site. I kicked a couple of them around, and settled on one (ended up with the live.com search using the XML web services API), and, rather quickly had a fully-functioning search platform on my site…

The “purist” in me immediately thinks of a couple of reasons why this solution “isn’t as good as what I would have built” (i.e. less control over the actual search results/order, less “immediacy” to updating the index, etc), but then my more realistic side kicks in and I realize that I’m not a search engine expert… not even close… Some might argue as to wether or not those at live.com are either :), but I can guarantee you that they are more so than I, and that the solution “they delivered” is much more accurate and flexible than I would have built…

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I found myself reminding myself to focus on where I can add value, and to leave the rest to others… that’s the only way to consistently deliver adaptable solutions in an environment where the surrounding technology is changing so quickly…

SharePoint, Theory , , ,