[ExI] Cost/reliability tradeoff in long-term storage

Keith Henson hkeithhenson at gmail.com
Fri Dec 30 18:10:52 UTC 2011


This showed up on a closed list I read.  I asked the author if I could
repost it on this list.  He said yes and would appreciate feedback
because the economic model in an early stage.  Lots of meat here as
well as in the links.

Keith

Date: 28 Dec 2011 08:38:59 +0100
From: "David S. H. Rosenthal" <dshr at abitare.org>
Subject: Cost/reliability tradeoff in long-term storage

> Subject: Cost analysis of archiving methods
>
> I was thinking about trying to do a cost analysis on archiving methods...

The tradeoff between reliability and cost of long-term digital storage
is a difficult and important problem. I've been working on it for some
time.

> First, longevity of information is a "negative" feature: longevity is an
> absence of loss. There are no guarantees, only probabilities.
> Using multiple methods adds redundancy, lowering the probability of loss,
> raising the cost (sum of cost of each method plus overhead of setting up and
> deciding between them).
>
> Can costs and probabilities be quantified?
>

As regards the probability of loss, there are three problems:
- - You need to predict the probability.
- - The probability needs to be extremely low.
- - The probabilities of loss events, especially rare "black swan"
 events that lose large amounts of data, are highly correlated.
These mean that you will not have adequate data to make a viable
prediction.

Here is a talk I gave at the Netherlands B&G:

http://blog.dshr.org/2011/03/how-few-copies.html

based on this ACM Queue article:

http://queue.acm.org/detail.cfm?id=1866298

making essentially this argument. Here is a paper by Kevin Wylie et al
proposing a better metric than mine for data loss:

http://www.usenix.org/events/hotstorage10/tech/full_papers/Greenan.pdf

So the loss probability side of the equation is hard. How about the
cost side? It turns out that this is hard too.

> pvscost(M, K, Y) be the present value of the cost of setting up storage of K
> bits of storage using method M, and pvrcost(M, K, Y) be the (present value)
> of the cost of retrieving the data X years.  Only consider M for which the
> minimum time to store K bits is insignificant compared to Y; moving stars
> around doesn't apply.    Can we assume pvrcost(M, K, Y) decreases as Y
> increases? Maybe it's insignificant, although some of the methods discussed
> it isn't.

The problem here is that it has recently become clear that the
discounted cash flow method used to assign present values to future
costs works neither in practice nor in theory:
- - Here is work from the Bank of England showing that in practice
 investors use unrealistically high discount rates:
 http://www.bankofengland.co.uk/publications/speeches/2011/speech495.pdf
- - Here is work from the Santa Fe Institute & Yale showing that in
 theory the method depends on unrealistic assumptions:
 http://cowles.econ.yale.edu/P/cd/d17a/d1719.pdf
Here is a talk I gave at the Library of Congress at the start of the
work I'm currently doing to build an economic model of long-term
storage:
http://blog.dshr.org/2011/09/modeling-economics-of-long-term-storage.html

Here is a blog post on the current (primitive) state of this model:
http://blog.dshr.org/2011/12/cni-talk-on-economic-model.html

> The certainty of the probabilities decreases as X gets larger, to the point
> where it seemed impractical to do this work for longer than 100 years.
> Reliable data archiving for 100 years is a practical necessity, though, for
> contracts, life insurance policies, building blueprints, and other current
> data storage methods.  These are documents and information sources that we
> are using now, and conversion from physical to digital media for some of
> them requires more assurances of longevity.

Indeed. Alas, the assumption seems to be that society can happily
convert everything to digital form and all will be well because:
Kryder's Law (exponential drop in disk price) will continue and disks
are remarkably reliable (which they are). In the short term the Thai
floods disprove this. In the not-so-longer term this paper:
http://www.dssc.ece.cmu.edu/research/pdfs/After_Hard_Drives.pdf
casts considerable doubt.

Note that one major problem is because long-term data archiving is a
very small part of the overall data storage market, and because most
people believe that the problem is trivial, the premium they are
willing to pay for a perceived high probability of long-term survival
of data over the cost of regular storage is small. Thus building
products specifically for archival use is unrewarding.  Long-term
storage systems have to be built from consumer or enterprise storage
components to be economic. The search for ultra-reliable storage media
is futile, because you're never going to be sure enough of the medium's
persistence, you're never going to be able to afford it, and the
reliability of the medium is only a small part of the overall
reliability of the storage system.

       David.



More information about the extropy-chat mailing list