Data Curation for Research in History and Public History

{This is a piece that I wrote in 2011 for an IMLS funded project on data curation. It seems unlikely that it will ever be published through that venue so I offer it here for what its worth as an artifact of that time and place.}

Introduction

The notion of data curation may not be a familiar one for many historians, both academic researchers and public historians. Public historians mostly operate with a traditional definition of curation that is grounded in the museum profession and that includes collection, assessment, description, and preservation of physical collections. These tasks are the purview of the curator, who serves as the caretaker for and expert on those collections. Academic historians often claim less responsibility for the care and preservation of their source materials, leaving those tasks to archivists and other history professionals. Nonetheless, creating historical knowledge and interpretation can be viewed as a process of curation that involves surveying and selecting appropriate sources, carefully reading and contextualizing those sources, and examining them in relationship to one another to answer questions about the past: what happened, why did it happen, what does it mean for our understanding of the event, time period, movement, etc.

Historians attempt to answer questions about the past from source material that is unique in its conditioning by place and time. They have only the evidential remains of the past with which to work, and have no direct access to events, unlike journalists who may be on the scene or ethnographers who might be engaged in participant observation. Thus, historians work to make meaning out of those remains, looking for the influences of context, multiple perspectives and causation, and change and continuity over time. This larger mission of creating meaning out of historical evidence takes the form of some sort of narrative, often marked by periodization, regional demarcations, or other kinds of boundaries that help to create richly textured answers to inquiry questions.

It’s not altogether easy to provide a comprehensive view of data curation for history and public history research simply because the scope of source materials that historians use in their work is so varied. Social, cultural, political, and economic historians address totally different kinds of evidence. While one historian might deal solely with archival materials housed in a university’s special collections, others may be providing interpretations of changing physical environments, and still others maybe offering perspectives on the political importance of a musical genre. Similarly, at first glance, academic historians appear to deal with a wide array of relatively unprocessed materials–archival collections that are not described down to the item or even folder level, source materials found in print collections, elements of material culture that may have minimal provenance information. On the other hand, some contemporary historians function in a world of highly structured data, marked by the presence of massive searchable digital collections. For most practicing historians, the reality of their investigative terrain lays somewhere along a continuum between the two. Rob Townsend of the American Historical Association has reported that a 2010 survey of just over 4,000 history faculty at 4 year colleges reveals that nearly 69% considered themselves active users of digital tools and of those, over 90% used library-supported databases, online search engines, and online archives or primary sources in their work.[1]

If it is not possible to offer suggestions that provide detailed guidance for individual types of historical source material, it is possible to look at the types of cognitive work historians do and offer suggestions to support that work. Sam Wineburg’s research into the heuristics that undergird professional historical work provides us some points of entry for considering how data can be curated to support historical investigations. In Historical Thinking and Other Unnatural Acts (2001), Wineburg reports from observing professional historians interacting with historical evidence that those historians engage in “sourcing” evidence–inquiring into the issue of perspective–before they embark upon the act of analyzing the material itself.[2] Sourcing, or the assessment the conditions of creation for a piece of evidence, demands that we pay special attention to providing metadata not only about the creator, but also about the time and place of creation. Sourcing leads to a process of contextualization that informs and assists in the analysis of evidence. While reading of sources, whether the traditional act of closely reading every word of a document for tone and word choice or what Franco Moretti has termed “distant reading” with computationally aided analysis of large corpora of texts, is essential to historical work, that reading does not take place in isolation.[3] Just as important as sourcing evidence, the act of corroborating materials–reading and analyzing them in relationship to one another–allows historians to come to an understanding of the past that arises out of the necessarily partial source material. Together, sourcing, reading, contextualizing, and corroborating evidence make up the key work of history, and need to inform the ways that archivists, curators, and digital historians present their collections of historical sources online.

Locating Sources

The creation of standardized and exposed metadata for historical sources significantly aids in their discoverability. Different sources and types of collections will require different descriptive standards, but a minimal set of fields that are common across most systems make a baseline of interoperability possible. Many scholars have had the time-consuming experience of working with a digital collection that provides only minimal and unstructured description, leaving the scholar with recourse only to the blunt instrument of a keyword search. This situation makes it necessary for the research to examine huge numbers of item from a collection before locating the key pieces of evidence that are central to her investigation. Needless to say, this situation is less than ideal. Carefully prepared collections descriptions with easily accessible, standardized metadata can dramatically decrease the amount of labor that historians put into locating materials.

The archival profession has settled on Encoded Archival Description (EAD) as the standard by which they will create their machine readable descriptive finding aids. Initiated in 1993 at the University of California, Berkeley under the direction of Daniel Pitti, EAD has undergone a number of updates and revisions, with the most current scheme (Version 2.0) being released in 2002. While contemporary archivists are realistic about the amount of resources they can dedicate to collection description, even minimal metadata that provides clues about a source’s creation and relationship to other materials which are useful to historians. Moreover, well structured EAD provides archivists with the opportunity to offer researchers powerful faceted searching through collections.

Not all collections that are of historical interest are managed by archivists who have the capacity to create rich finding aids. Curators and public historians are mounting significant collections on the web for historians to use in their research. Unlike archivists, these professionals have come to less accord about the appropriate ways to describe their collections so that they can most easily be located by researchers. Many older collections exhibit home-grown data models that do not conform to any standard, making upgrade migration to new platforms quite difficult. More recently, some consensus has formed around the use of the Dublin Core Metadata Initiative as a light-weight standard for collections description. With fifteen core elements in the unqualified set, Dublin Core strikes an important balance between thoroughness and ease of implementation. As a result, it serves as a baseline of metadata for many types of exchange. For example, public broadcasters have developed PBCore as a variation on Dublin Core to describe audiovisual media. Similarly, it provides the practicing historian with all of the key descriptors she might need to work with a particular item in a collection: title, creator, subject, date, description, publisher, coverage, relation, etc.

The exposure of all of this structured metadata through RSS, ATOM, JSON, XML, or other means has made it possible for historians to conduct research across vast aggregations of collections. The Open Archives Initiative’s Protocol for Metadata Harvesting (OAI-PMH) is one of the most promising standards for the exposure and exchange of collections data. Leveraging unqualified Dublin Core as a baseline for interoperability, OAI-PMH serves as a lightweight protocol for gathering the metadata from online collections through HTTP and XML. By setting up digital archives as OAI-PMH repositories, institutions allow those with functioning OAI-PMH harvesters to ingest collections data for additional processing, whether through search, mash-ups, or other combinations. Thus, OAI-PMH provides an easy method of data exchange which dramatically extends the reach of collections. Historians benefit from this open exchange of data through the types of services applied to aggregated data that enable them to work across vast collections. Currently, OAIster, hosted by the Online Computer Library Center, draws together millions of open access records from over 1,000 sources, and in a more focused way, Opening History, aggregates US history materials from library, museum, and archive collections. In the future, linked data, Resource Description Framework (RDF) and the semantic web may make it possible for vast amounts of historical materials which are currently disbursed across the web to be accessed and assessed by practicing historians.

Introductions

Standards and Resources

Applications in Projects

Issues

Assessing Perspective through Sourcing

Questions of perspective in historical evidence is central to the ways that historians work to answer their questions about the past. As a result, data curation for historical research must necessarily pay attention to exposing thorough metadata about source creation. By providing additional contextualizing information about the key creators and the conditions of creation, archivists, curators, and digital historians significantly contribute to the work of contemporary scholars. One way to provide this additional context is to supplement EAD finding aids by providing Encoded Archival Context–Corporate Bodies, Persons, and Families (EAC-CPF) entries to describe source creators and contributors. Exemplary work in this area is being done by Daniel Pitti’s team working on the Social Networks and Archival Context Project (SNAC), which will provide tools to build rich descriptions of people by extracting important data from existing EAD and standardized authority files. Those standardized authority files represent another important element in the description of historical sources. The Library of Congress maintains name, subject, and title authority files that can constitute a baseline for sharing data about historical actors. Similarly, the J. Paul Getty Trust maintains the Thesaurus of Geographic Names (TGN), and other standardized vocabularies that can contribute to a shared understanding of geospatial context for the creation of historical evidence. Finally, individual scholars can used standardized metadata to create their own network analysis using simple tools such as NodeXL.

Standards and Resources

Applications in Projects

Tools

Reading Sources: Close and Distant

There is much that can be done to aid in the process of reading historical evidence. The most clear computational techniques apply to text sources. Historians are known for their traditional approach to close reading of text sources, and other types of evidence. Nonetheless, large scale digitization has opened the possibility of new methodological approaches that generally full under the rubric of “distant reading,” to use Moretti’s term. For this type of historical work to be successful a combination of efforts on the part of collections stewards are required. The availability of clean, full text archives are essential. Google Books, the Hathitrust and Open Libary have made major steps in providing access to such collections, but full-text access is important in smaller collections as well. The Library of Congress’s Chronicling America collection of historic American newspapers allows for full-text search and provides an Application Programming Index (API) so that individual users can shape and extend their interaction with the archive. On another front, full text transcriptions of oral histories have long been the standard, and can allow for sophisticated analysis across a large body of interviews.

While the blunt instrument of full-text search can provide historians with some insights in their work, the emerging field of text-mining offers the promise of more sophisticated analysis. Unlike literary scholars who maybe be accustomed to working with highly processed texts, historians are more likely to want to operate on large scale collections that may not have been marked up using TEI or some other XML standard. This tendency to operate on raw text pushes historians toward particularly kinds of tools in their text mining efforts. SEARS, TAPoR, and Voyeur are the most popular at the present. Often these tools provide a sense of word frequency and proximity, while others can show a strength of relationship among entities across a body of text. In the end, many historians will want to engage in a combination of close and distant reading, so the text-mining tools with user interfaces that allow for easy switching between the two will be most successful.

Introductions

Corpora

Tools

Application in Projects

Corroboration through Visualization

Historians are never content to examine one source. The process of readings sources in relationship to one another is central to the process of historical work. While traditional historical work can involve flipping back and forth among a set of sources that speak to a particular historical problem or event, larger clusters of sources can lend themselves to comparison through visualization techniques. The standardization and accessibility of geographic and temporal data permits scholars to create rich depictions of historical sources on these key axes of historical inquiry. Scholars wishing to engage in this type of work who do not have access to server-intensive geo-spatial tools would do well to turn to the Library of Congress’s ViewShare tool, which enables the quick creation of interactive maps and timelines using a wide range of digital collections. Other kinds of visualizations that do not rely upon spatial or temporal data are possible with the use of highly flexible visualization tools such as IBM’s Many Eyes and Google’s Chart Tools.

Introductions

Tools

Resources

Applications in Projects

Personal Research Curation

Upon locating the appropriate sources to investigate a particular historical question, each individual is faced with the question of how to manage and maintain his or her own research archive. In the past, this type of work might have involved an elaborate system of photocopies, notes and file folders, but more robust digital solutions have been developed to organize the vast amounts of collection records and digital sources that are available. Both client-side and cloud-based systems allow current researchers to collect, sort, tag, and create references from the metadata associated with their materials.

Similarly, historians may find themselves in the position of inheriting or creating data sets that then might be usefully examined through various text mining or visualization tools. Usually that data must be normalized before it is ready for use with those tools. Stanford University’s Data Wrangler is helpful in that process.

Tools

________________
[1]Robert Townsend, “How is New Media Reshaping the Work of Historians?” Perspectives on History (Nov., 2010). http://www.historians.org/perspectives/issues/2010/1011/1011pro2.cfm.
[2] Sam Wineburg, Historical Thinking and Other Unnatural Acts: Charting the Future of Teaching the Past (Philadelphia: Temple University Press, 2001).
[3] Franco Moretti, Graphs, Maps, and Trees: Abstract Models for Literary History (New York: Verso, 2005).