[Paleopsych] Technology Review: The Fading Memory of the State
Premise Checker
checker at panix.com
Tue Jul 12 18:58:36 UTC 2005
The Fading Memory of the State
http://www.technologyreview.com/articles/05/07/issue/feature_memory.asp?p=0
By David Talbot July 2005
First, the summary, dated 5.7.11
The Chronicle of Higher Education: Magazine & journal reader
http://chronicle.com/prm/daily/2005/07/2005071101j.htm
A glance at the July issue of Technology Review: Preserving digital
history
The U.S. National Archives and Records Administration is struggling to
find a way to preserve the enormous volume of digital records the
federal government produces, writes David Talbot, a senior editor at
the magazine.
"Electronic records rot much faster than paper ones, and NARA must
either figure out how to save them permanently or allow the nation to
lose its grip on history," he argues. While the "most famous documents
in NARA's possession -- the Declaration of Independence, the
Constitution, and the Bill of Rights -- were written on durable
calfskin parchment and can safely recline for decades behind glass in
a bath of argon gas," it will take "a technological miracle to make
digital data last that long," he writes.
To make matters worse, NARA is facing "thousands of incompatible data
formats cooked up by the computer industry over the past several
decades, not to mention the limited life span of electronic-storage
media themselves," Mr. Talbot writes.
And the records continue to pile up, he writes. Mr. Talbot quotes
Eduard Mark, a U.S. Air Force historian, as saying that already
"history as we have known it is dying, and with it the public
accountability of government and rational public administration."
If NARA doesn't act quickly, Mr. Talbot concludes, today's history
will be lost.
--Gabriela Montell
---------------------------
The official repository of retired U.S. government records is a boxy
white building tucked into the woods of suburban College Park, MD. The
National Archives and Records Administration (NARA) is a subdued
place, with researchers quietly thumbing through boxes of old census,
diplomatic, or military records, and occasionally requesting a copy of
one of the computer tapes that fill racks on the climate-controlled
upper floors. Researchers generally don't come here to look for
contemporary records, though. Those are increasingly digital, and
still repose largely at the agencies that created them, or in
temporary holding centers. It will take years, or decades, for them to
reach NARA, which is charged with saving the retired records of the
federal government (NARA preserves all White House records and around
2 percent of all other federal records; it also manages the libraries
of 12 recent presidents). Unfortunately, NARA doesn't have decades to
come up with ways to preserve this data. Electronic records rot much
faster than paper ones, and NARA must either figure out how to save
them permanently, or allow the nation to lose its grip on history.
One clear morning earlier this year, I walked into a fourth-floor
office overlooking the woods. I was there to ask Allen
Weinstein--sworn in as the new Archivist of the United States in
February--how NARA will deal with what some have called the pending
"tsunami" of digital records. Weinstein is a former professor of
history at Smith College and Georgetown University and the author of
Perjury: The Hiss-Chambers Case (1978) and coauthor of The Story of
America (2002). He is 67, and freely admits to limited technical
knowledge. But a personal experience he related illustrates quite well
the challenges he faces. In 1972, Weinstein was a young historian
suing for the release of old FBI files. FBI director J. Edgar
Hoover--who oversaw a vast machine of domestic espionage--saw a
Washington Post story about his efforts, wrote a memo to an aide,
attached the Post article and penned into the newspaper's margin:
"What do we know about Weinstein?" It was a telling note about the
mind-set of the FBI director and of the federal bureaucracy of that
era. And it was saved--Weinstein later found the clipping in his own
FBI file.
But it's doubtful such a record would be preserved today, because it
would likely be "born digital" and follow a convoluted electronic
path. A modern-day J. Edgar Hoover might first use a Web browser to
read an online version of the Washington Post. He'd follow a link to
the Weinstein story. Then he'd send an e-mail containing the link to a
subordinate, with a text note: "What do we know about Weinstein?" The
subordinate might do a Google search and other electronic searches of
Weinstein's life, then write and revise a memo in Microsoft Word 2003,
and even create a multimedia PowerPoint presentation about his
findings before sending both as attachments back to his boss.
Definitions
Megabyte
1,024 kilobytes. The length of a short novel or the storage available
on an average floppy disk.
Gigabyte
1,024 megabytes. Roughly 100 minutes of CD-quality stereo sound.
Terabyte
1,024 gigabytes. Half of the content in an academic research library.
Petabyte
1,024 terabytes. Half of the content in all U.S. academic research
libraries.
Exabyte
1,024 petabytes. Half of all the information generated in 1999.
What steps in this process can be easily documented and reliably
preserved over decades with today's technology? The short answer:
none. "They're all hard problems," says Robert Chadduck, a research
director and computer engineer at NARA. And they are symbolic of the
challenge facing any organization that needs to retain electronic
records for historical or business purposes.
Imagine losing all your tax records, your high school and college
yearbooks, and your child's baby pictures and videos. Now multiply
such a loss across every federal agency storing terabytes of
information, much of which must be preserved by law. That's the
disaster NARA is racing to prevent. It is confronting thousands of
incompatible data formats cooked up by the computer industry over the
past several decades, not to mention the limited lifespan of
electronic storage media themselves. The most famous documents in
NARA's possession--the Declaration of Independence, the Constitution,
and the Bill of Rights--were written on durable calfskin parchment and
can safely recline for decades behind glass in a bath of argon gas. It
will take a technological miracle to make digital data last that long.
But NARA has hired two contractors--Harris Corporation and Lockheed
Martin--to attempt that miracle. The companies are scheduled to submit
competing preliminary designs next month for a permanent Electronic
Records Archives (ERA). According to NARA's specifications, the system
must ultimately be able to absorb any of the 16,000 other software
formats believed to be in use throughout the federal bureaucracy--and,
at the same time, cope with any future changes in file-reading
software and storage hardware. It must ensure that stored records are
authentic, available online, and impervious to hacker or terrorist
attack. While Congress has authorized $100 million and President
Bush's 2006 budget proposes another $36 million, the total price tag
is unknown. NARA hopes to roll out the system in stages between 2007
and 2011. If all goes well, Weinstein says, the agency "will have
achieved the start of a technological breakthrough equivalent in our
field to major 'crash programs' of an earlier era--our Manhattan
Project, if you will, or our moon shot."
Data Indigestion
NARA's crash data-preservation project is coming none too soon;
today's history is born digital and dies young. Many observers have
noted this, but perhaps none more eloquently than a U.S. Air Force
historian named Eduard Mark. In a 2003 posting to a Michigan State
University discussion group frequented by fellow historians, he wrote:
"It will be impossible to write the history of recent diplomatic and
military history as we have written about World War II and the early
Cold War. Too many records are gone. Think of Villon's haunting
refrain, 'Ou sont les neiges d'antan?' and weep....History as we have
known it is dying, and with it the public accountability of government
and rational public administration." Take the 1989 U.S. invasion of
Panama, in which U.S. forces removed Manuel Noriega and 23 troops lost
their lives, along with at least 200 Panamanian fighters and 300
civilians. Mark wrote (and recently stood by his comments) that he
could not secure many basic records of the invasion, because a number
were electronic and had not been kept. "The federal system for
maintaining records has in many agencies--indeed in every agency with
which I am familiar--collapsed utterly," Mark wrote.
Of course, managing growing data collections is already a crisis for
many institutions, from hospitals to banks to universities. Tom Hawk,
general manager for enterprise storage at IBM, says that in the next
three years, humanity will generate more data--from websites to
digital photos and video--than it generated in the previous 1,000
years. "It's a whole new set of challenges to IT organizations that
have not been dealing with that level of data and complexity," Hawk
says. In 1996, companies spent 11 percent of their IT budgets on
storage, but that figure will likely double to 22 percent in 2007,
according to International Technology Group of Los Altos, CA.
Still, NARA's problem stands out because of the sheer volume of the
records the U.S. government produces and receives, and the diversity
of digital technologies they represent. "We operate on the premise
that somewhere in the government they are using every software program
that has ever been sold, and some that were never sold because they
were developed for the government," says Ken Thibodeau, director of
the Archives' electronic-records program. The scope of the problem, he
adds, is "unlimited, and it's open ended, because the formats keep
changing."
The Archives faces more than a Babel of formats; the electronic
records it will eventually inherit are piling up at an ever
accelerating pace. A taste: the Pentagon generates tens of millions of
images from personnel files each year; the Clinton White House
generated 38 million e-mail messages (and the current Bush White House
is expected to generate triple that number); and the 2000 census
returns were converted into more than 600 million TIFF-format image
files, some 40 terabytes of data. A single patent application can
contain a million pages, plus complex files like 3-D models of
proteins or CAD drawings of aircraft parts. All told, NARA expects to
receive 347 petabytes (see "Definitions") of electronic records by
2022.
Currently, the Archives holds only a trivial number of electronic
records. Stored on steel racks in NARA's 11-year-old facility in
College Park, the digital collection adds up to just five terabytes.
Most of it consists of magnetic tapes of varying ages, many of them
holding a mere 200 megabytes apiece--about the size of 10
high-resolution digital photographs. (The electronic holdings include
such historical gems as records of military psychological-operations
squads in Vietnam from 1970 to 1973, and interviews, diaries, and
testimony collected by the U.S. Department of Justice's Watergate
Special Prosecution Force from 1973 to 1977.) From this modest
collection, only a tiny number of visitors ever seek to copy data;
little is available over the Internet.
Because the Archives has no good system for taking in more data, a
tremendous backlog has built up. Census records, service records,
Pentagon records of Iraq War decision-making, diplomatic messages--all
sit in limbo at federal departments or in temporary record-holding
centers around the country. A new avalanche of records from the Bush
administration--the most electronic presidency yet--will descend in
three and a half years, when the president leaves office. Leaving
records sitting around at federal agencies for years, or decades,
worked fine when everything was on paper, but data bits are nowhere
near as reliable--and storing them means paying not just for the
storage media, but for a sophisticated management system and extensive
IT staff.
Data under the Desk
The good news is that at least some of the rocket science behind the
Archives' "moon shot" is already being developed by industry, other
U.S. government agencies, and foreign governments. For example,
Hewlett-Packard, IBM, EMC, PolyServe, and other companies have
developed "virtual storage" technologies that automatically spread
terabytes of related data across many storage devices, often of
different types. Virtualization frees up IT staff, balances loads when
demand for the data spikes, and allows hardware upgrades to be carried
out without downtime. Although the Archives will need technologies far
beyond virtual storage, the commercial efforts form a practical
foundation. The Archives may also benefit from the examples of digital
archives set up in other nations, such as Australia, where archivists
are using open-source software called XENA (for XML Electronic
Normalizing of Archives) to convert records into a standardized format
that will, theoretically, be readable by future technologies. NARA
will also follow the lead of the U.S. Library of Congress, which in
recent years has begun digitizing collections ranging from early
American sheet music to immigration photographs and putting them
online, as part of a $100 million digital preservation program.
But to extend the technology beyond such commercial and government
efforts, NARA and the National Science Foundation are funding research
at places like the San Diego Supercomputer Center. There, researchers
are, among other things, learning how to extract data from old formats
rapidly and make them useful in modern ones. For example, San Diego
researchers took a collection of data on airdrops during the Vietnam
War--everything from the defoliant Agent Orange to pamphlets--and
reformatted it so it could be displayed using nonproprietary versions
of digital-mapping programs known as geographic information systems,
or GIS (see [3]"Do Maps Have Morals?" Technology Review, June 2005).
Similarly, they took lists of Vietnam War casualties and put them in a
database that can show how they changed over the years, as names were
added or removed. These are the kinds of problems NARA will face as it
"ingests" digital collections, researchers say. "NARA's problem is
they will be receiving massive amounts of digital information in the
future, and they need technologies that will help them import that
data into their ERA--hundreds of millions of items, hundreds of
terabytes of data," says Reagan Moore, director of data-knowledge
computing at the San Diego center.
Another hive of research activity on massive data repositories: MIT.
Just as the government is losing its grip on administrative, military,
and diplomatic history, institutions like MIT are losing their hold on
research data--including the early studies and communications that led
to the creation of the Internet itself. "MIT is a microcosm of the
problems [NARA] has every day," says MacKenzie Smith, the associate
director for technology at MIT Libraries. "The faculty members are
keeping their research under their desks, on lots and lots of disks,
and praying that nothing happens to it. We have a long way to go."
Now MIT is giving faculty another place to put that data. Researchers
can log onto the Internet and upload information--whether text, audio,
video, images, or experimental data sets--into DSpace, a storage
system created in collaboration with Hewlett-Packard and launched in
2002 (see "[4]MIT's DSpace Explained"). DSpace makes two identical
copies of all data, catalogues relevant information about the data
(what archivists call "metadata," such as the author and creation
date), and gives each file a URL or Web address. This address won't
change even if, say, the archivist later wants to put a given file
into a newer format--exporting the contents of an old Word document
into a PDF file, for instance. Indeed, an optional feature in DSpace
will tell researchers which files are ready for such "migration."
Because the software behind DSpace is open source, it is available for
other institutions to adapt to their own digital-archiving needs;
scores have already done so. Researchers at MIT and elsewhere are
working on improvements such as an auditing feature that would verify
that a file hasn't been corrupted or tampered with, and a system that
checks accuracy when a file migrates into a new format. Ann Wolpert,
the director of MIT Libraries (and chair of Technology Review's board
of directors), says DSpace is just a small step toward tackling MIT's
problems, never mind NARA's. "These changes have come to MIT and other
institutions so rapidly that we didn't have the technology to deal
with it," Wolpert says. "The technology solutions are still emerging."
Robert Tansley, a Hewlett-Packard research scientist who worked on
DSpace, says the system is a good start but cautions that "it is still
quite new. It hasn't been tested or deployed at a massive scale, so
there would need to be some work before it could support what the
National Archives is looking at."
Digital Marginalia
But for all this promise, NARA faces many problems that researchers
haven't even begun to think about. Consider Weinstein's discovery of
the Hoover marginalia. How could such a tidbit be preserved today? And
how can any organization that needs to track information--where it
goes, who uses it, and how it's modified along the way--capture those
bit streams and keep them as safe as older paper records? Saving the
text of e-mail messages is technically easy; the challenge lies in
managing a vast volume and saving only what's relevant. It's
important, for example, to save the e-mails of major figures like
cabinet members and White House personnel without also bequeathing to
history trivial messages in which mid-level bureaucrats make lunch
arrangements. The filtering problem gets harder as the e-mails pile
up. "If you have 300 or 400 million of anything, the first thing you
need is a rigorous technology that can deal with that volume and
scale," says Chadduck. More and more e-mails come with attachments, so
NARA will ultimately need a system that can handle any type of
attached file.
Version tracking is another headache. In an earlier era, scribbled
cross-outs and margin notes on draft speeches were a boon to
understanding the thinking of presidents and other public officials.
To see all the features of a given Microsoft Word document, such as
tracked changes, it's best to open the document using the same version
of Word that the document's creator used. This means that future
researchers will need not only a new piece of metadata--what software
version was used--but perhaps even the software itself, in order to
re-create fonts and other formatting details faithfully. But saving
the functionality of software--from desktop programs like Word to the
software NASA used to test a virtual-reality model of the Mars Global
Surveyor, for example--is a key research problem. And not all software
keeps track of how it was actually used. Why might this matter?
Consider the 1999 U.S. bombing of the Chinese embassy in Belgrade.
U.S. officials blamed the error on outdated maps used in targeting.
But how would a future historian probe a comparable matter--to check
the official story, for example--when decision-making occurred in a
digital context? Today's planners would open a map generated by GIS
software, zoom in on a particular region, pan across to another site,
run a calculation about the topography or other features, and make a
targeting decision.
If a historian wanted to review these steps, he or she would need
information on how the GIS map was used. But "currently there are no
computer science tools that would allow you to reconstruct how
computers were used in highconfidence decision-making scenarios," says
Peter Bajcsy, a computer scientist at the University of Illinois at
Urbana-Champaign. "You might or might not have the same hardware,
okay, or the same version of the software in 10 or 20 years. But you
would still like to know what data sets were viewed and processed, the
methods used for processing, and what the decision was based on." That
way, to stay with the Chinese embassy example, a future historian
might be able to independently assess whether the database about the
embassy was obsolete, or whether the fighter pilot who dropped the
bomb had the right information before he took off. Producing such data
is just a research proposal of Bajcsy's. NARA says that if such data
is collected in the future, the agency will add it to the list of
things needing preservation.
Data Curators
Even without tackling problems like this, NARA has its hands full. For
three years, at NARA's request, a National Academy of Sciences panel
has been advising the agency on its electronic-records program. The
panel's chairman, computer scientist Robert F. Sproull of Sun
Microsystems Laboratories in Burlington, MA, says he has urged NARA
officials to scale back their ambitions for the ERA, at least at the
start. "They are going to the all-singing, all-dancing solution rather
than an incremental approach," Sproull says. "There are a few dozen
formats that would cover most of what [NARA] has to do. They should
get on with it. Make choices, encourage people submitting records to
choose formats, and get on with it. If you become obsessed with
getting the technical solution, you will never build an archive."
Sproull counsels pragmatism above all. He points to Google as an
example of how to deploy a workable solution that satisfies most
information-gathering needs for most of the millions of people who use
it. "What Google says is, 'We'll take all comers, and use best
efforts. It means we won't find everything, but it does mean we can
cope with all the data,'" Sproull says. Google is not an archive, he
notes, but in the Google spirit, NARA should attack the problem in a
practical manner. That would mean starting with the few dozen formats
that are most common, using whatever off-the-shelf archiving
technologies will likely emerge over the next few years. But this kind
of preservation-by-triage may not be an option, says NARA's Thibodeau.
"NARA does not have discretion to refuse to preserve a format," he
says. "It is inconceivable to me that a court would approve of a
decision not to preserve e-mail attachments, which often contain the
main substance of the communication, because it's not in a format NARA
chose to preserve."
Meanwhile, the data keep rolling in. After the 9/11 Commission issued
its report on the attacks on the World Trade Center and the Pentagon,
for example, it shut down and consigned all its records to NARA. A
good deal of paper, along with 1.2 terabytes of digital information on
computer hard disks and servers, was wheeled into NARA's College Park
facility, where it sits behind a door monitored by a video camera and
secured with a black combination lock. Most of the data, which consist
largely of word-processing files and e-mails and their attachments,
are sealed by law until January 2, 2009. They will probably survive
that long without heroic preservation efforts. But "there's every
reason to say that in 25 years, you won't be able to read this stuff,"
warns Thibodeau. "Our present will never become anybody's past."
It doesn't have to be that way. Projects like DSpace are already
dealing with the problem. Industry will provide a growing range of
partial solutions, and researchers will continue to fill in the
blanks. But clearly, in the decades to come, archives such as NARA
will need to be staffed by a new kind of professional, an expert with
the historian's eye of an Allen Weinstein but a computer scientist's
understanding of storage technologies and a librarian's fluency with
metadata. "We will have to create a new profession of 'data
curator'--a combination of scientist (or other data specialist),
statistician, and information expert," says MacKenzie Smith of the MIT
Libraries.
The nation's founding documents are preserved for the ages in their
bath of argon gas. But in another 230 years or so, what of today's
electronic records will survive? With any luck, the warnings from air
force historian Mark and NARA's Thibodeau will be heeded. And
historians and citizens alike will be able to go online and find that
NARA made it to the moon, after all.
References
3. http://www.technologyreview.com/articles/05/06/issue/review_maps.asp
4. http://www.technologyreview.com/articles/05/07/issue/feature_mit.asp
More information about the paleopsych
mailing list