In the JAH Interviews we read in week 3, Dan Cohen included a cautionary note against commitment to the mode of close reading that ignores new technological possibilities for parsing and retrieving the massive amounts of data now available digitally. Our readings this week explore methods and applications for handling large quantities of data… all of the data. Cohen’s “From Babel to Knowledge,” deals with some methods utilised to mine data, using applications to retrieve data from all public places on the web, making use of the great library of the Web and thereby solving the problem of replicating and storing data (an issue also brought up in week 4, this time the book with Roy Rosenzweig, in the chapter “Getting Started”) in order to perform analyses.
One goal of distant reading is to move eventually, and hopefully smoothly, back into close reading; another, expressed best in Franco Moretti’s text, Graphs, Maps, Trees (Verso, 2005), is to perform and interpret quantitative analyses themselves made possible by the new ability to handle an enormous sample effectively.
Criticisms of both approaches to data centre on continued problems with the sample itself. Tim Burke’s Review of Moretti in the Valve is favourable to Moretti’s project, but expresses concern that a project that purports to handle all of the data has to somehow account for losses — the regular fallout of time, not just objects not yet digitised or stored in an available fashion. Burke may be overstating the problem (for Moretti’s project, a sufficiently representative sample has to yield the same results as a complete body or we have other problems here) but sample representation on the Web does seem to pose a concern.
Ted Underwood’s blog introduction to Text Mining notes the incompleteness of digitisation as a problem. Gatedness poses a problem as well, as does conversely the ability for anyone to publish anything online. In his Library of Babel, Cohen asserts that quantity is better than quality when you can sample so much information that errors are displaced by accurate information by frequency of correct information. I see a concern with that assertion in the problem of mass reduplication of information on the Web (like the reproduction of a Wikipedia article on every informational website). How does this affect the sample especially when errors are reduplicated? In a related vein, the dominance of certain information types over others available online affects sampling in ways that if we cannot control, we need to at least be observant to if we depend on the Web as the wellspring of data.