A colleague of mine sent me a link to this story about using cassettes for storing data and asked for my thoughts. This was clearly a good-natured jab at me in light of a prior conversation we had debating the appropriateness of tape or disk for a research data storage platform. I had been arguing against tape as an out-moded and inappropriate storage mechanism.
The problem is… I was wrong.
And so was he.
I am using the word “wrong” not in the moral sense but to mean “less than the ideal”, “unfortunate”, “sad”, “depressing”, <enter your own term here>.
You see, I was wrong in that as he argued (and is well articulated in the article) there is simply too much data being generated to make disks a tractable solution. Even with recent and projected growth in hard drives (60TB expected by 2016) our ability to produce data – particularly in an automated means via sensors and scientific instruments – already does, and will continue to out-pace these advancements. Even if hard drives were able to keep up with the space demands, the power required to keep those drives running quickly becomes prohibitive. And let’s not even talk about disk transfer rate issues.
Whether or not he was “wrong” is probably a bit more subjective (an admission I’m certain he’ll enjoy). My frustration with his position is that it tends to be synonymous with “slow” or “laborious”. That said, an unfortunate reality in the digital sciences is that with data sets of any significant size, a researcher often has to plan well in advance of his experiment to stage the data. Data has to be loaded into online storage from some cold storage mechanism (often a tape library). This significantly limits one’s curiosity and causes some questions to go unanswered (i.e. “I wonder if… well, it’s probably not worth the time/effort to load up the data just to see…”). I suppose that if I’m honest, I’m simply manifesting the “Google effect” – the expectation that I can as a question of tons of data and get an answer instantly. I’d love for this to be possible of all scientific data – but as any data scientist will tell you, that desire is simply naïve. Providing platforms such as Google’s is hard work, and not achieved without significant planning and effort. Admitting this still doesn’t mean I can’t hope for it…
The real nugget buried in the article and underlying our friendly debate, is that storage technologies are nowhere close to where we need them to be. No matter which option we choose it will be a compromise between a.) discarding data – unfortunate no matter how you look at it, b.) spending astronomical amounts of money on both hardware and power, or c.) using slow, offline, and deterioration-prone devices such as tapes. Frankly, none of these options are attractive. There is a little part of me that dies when I think of a researcher having to choose to discard data simply because he doesn’t have space to store it and doesn’t currently think it is important to his work. What if he’s wrong? What if that data is (or was) the key to solving part of the problem, he just didn’t know it yet? Or maybe it is the key to solving a problem he doesn’t yet know he has…
So here’s hoping that the researchers working on storage technologies will be successful. That they will develop means and methods for us to store massive amounts of data, access it in increasingly shorter times, and with a power envelope that is reasonable and sustainable. That’s not asking too much, is it?