Drowning in Data

Categories:

Once upon a time, a megabyte was huge. One of my prized photographs (now digitized) shows me standing by a single platter from the head per track disk system for ILLIAC IV, once the world’s fastest computer. The platter and parts of ILLIAC IV are now in the Computer History Museum. Things change quickly.

Maybe you remember the IBM 2314 washing machine, with a stunning 29 MB on each removable disk pack? Maybe you played with the 40 MB RP03 disks on a DEC PDP-10 or PDP-11?

Did you lust for a 470 MB Fujitsu Eagle? I used one of these babies (only $10,000) as the storage for my first network of SUN 3 diskless workstations. I loved that 16 MHz 68020 SUN 3/50; it was so much faster than the VAX 11/780, and I had more disk space than I could ever use (or so I thought).

Last month, Dell announced that it would start shipping 1 TB Hitachi drives on its high end desktop PCs. We routinely carry 1 GB USB memory sticks and we’ve seen disk capacities increase by five orders of magnitude in a single professional lifetime. Such is the nature of exponential change.

More to the point, consumers, universities, industry and government are now drowning in digital data. Whether it’s digital music, video and photographs; scientific data from high resolution sensors; or just the traffic of daily business life, petabytes and exabytes are upon us. Hal Varian and his collaborators at UC Berkeley produced a landmark 2003 study on digital data, estimating that roughly five exabytes of data were produced in 2002.

The tsunami has only grown. A recent IDC study estimated that humanity produced 161 exabytes of data last year. As Ian Foster and others have noted, we’re on pace to produce a zettabyte per year by 2010! We’re drowning in data, and deep data mining and semantic indexing are our potential lifeboats. However, these techniques bring deep privacy and security questions. Our lives and our history are now digital, not parchment or paper.

I was reflecting on this profound change last week, when I was in Washington. I am a member of the electronic records advisory committee (ACERA) for the National Archives and Records Administration (NARA). Last week, we met to discuss the status of NARA’s electronic records software and the associated development and testing plan. These meetings are always interesting, because they bring together practicing archivists, legal experts and computing types (like me).

The complementary perspectives on digital data preservation are illuminating, particularly given the enormous growth of digital data and NARA’s federal mandate to archive almost everything. It’s also humbling to stroll through the building and see the maps, drawings, documents and (later) photographs documenting the history of the United States.

One of the interesting discussion items at NARA was the explosive growth of Presidential email, from a few tens of thousands of messages from the administration of George H. W. Bush through millions from the Clinton administration to hundreds of millions from the George W. Bush presidency. Although classifying and organizing the email text is daunting, it pales in comparison to the diversity and size of the email attachments. Nevertheless, one can easily imagine the power and possibility of rich text mining and social network visualization to construct relationship maps and chronologies of major events.

Technologies change, the knowledge base grows, access broadens. The codex replaced the scroll, printed text replaced the hand illuminated manuscript, and digital storage is now supplanting printed text.

At the end of World War II, Vannevar Bush, who had coordinated much of the scientific community’s contribution to the war, turned his attention to the future of information management. In an insightful essay, As We May Think, Bush outlined the challenges inherent in the knowledge explosion, namely rapid specialization and the increasing inability to remain aware of critically relevant information. The essay continued with a prescient technical solution:

Consider a future device for individual use, which is a sort of mechanized private file and library. It needs a name, and, to coin one at random, “memex” will do. A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory.

Sixty years later, we’re not there, but we can see it from here: the personalized, indexed and readily accessible infosphere.


Discover more from Reed's Ruminations: The Past, Present, and Future

Subscribe to get the latest posts sent to your email.

Discover more from Reed's Ruminations: The Past, Present, and Future

Subscribe now to keep reading and get access to the full archive.

Continue reading