[ExI] Saving the data

Mon Nov 30 21:00:52 UTC 2009

The real scandal in "climategate" is of course that people throw away 
data! Every bit is sacred! To delete anything is to serve entropy!

Only half joking. Clearly any research project should try to ensure that 
its data remains accessible for as long as its papers are used in 
science - and given that we occasionally refer back to Almagest and 
sumerian clay tablets to ask questions unimaginable to their 
originators, that is a *long* time. But long term data storage is also a 
terribly tricky problem. Formats change, media decay. And after the 
initial period interest in the data wanes, making people less motivated 
to save it.

I think there are two kinds of datasets, having different problems. One 
is the "big" dataset that taxes available resources. They are big enough 
that people recognize their importance, but they are hard to move and 
copy. These run the risk of being deleted to save space (like the BBC 
did with its early tv programs) and are often stored in just one place - 
plenty of risk of being destroyed by the occasional war, flood or fire.

The other kind is the "small" dataset that does not tax resources that 
much. Their problem is usually that they are badly documented and once 
they become uninteresting they can easily run into format or media 
decay. How many projects have not been permanently deleted when the 
research group repurposes one of the old PCs as a printer server or a 
part of the Beowulf cluster?

Given the rapid growth of storage capacity (just look at 
http://www.mkomo.com/cost-per-gigabyte !) it seems that we could 
probably save *all* datasets in the world under a certain fraction of 
typical hard drive size. Imagine making it a publication requirement to 
place the dataset and software to make it (in an ideal world with 
metadata explaining how to run it) in a distributed server if they are 
smaller than X gigabytes. There could be an escrow system limiting 
access, perhaps making data freely available after 20 years and before 
then available by request to the authors, journal or sufficient number 
of funding bodies. Over time X would increase (doubling every 14 months?)

This scheme would of course require funding, but also a very stable 
long-term organisation that can move to new media. Perhaps allowing 
forking would be one way (some cryptographic trickery here for the 
escrow) so that even amateurs might be able to run their own version 
with all data smaller than Y gigabytes. Sounds very much like something 
the Long Now Foundation might have been considering for their 10,000 
year library.

-- 
Anders Sandberg,
Future of Humanity Institute
Philosophy Faculty of Oxford University