Thinking with Linked Data; Representing History

In 2006, Tim Berners-Lee articulated vision for a web that was made of vast mesh of truly linked data connecting information across domains using a simple set of principles. Those principles included:

  1. Use URIs as names for things
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)
  4. Include links to other URIs. so that they can discover more things.

Though creators of content on the web have been slow to warm to implementing a linked data universe, it holds an enormous degree of promise for bringing together scholarly work that was once siloed and disparate. The creation of linked data let us be explicit about the relationship between the resources that we are representing on the web, growing a set of connections and elaborating a knowledge base, link by link.

Of course, Berners-Lee was not thinking about historical data when he set out these principles. They are designed to be general enough that the specific type of “thing” being linked does not matter. But for historians, the “thing” does matter, quite a lot. Our ability to describe faithfully people, places, and events, and how they relate to one another, based on historical evidence is paramount. Though there are issues with using linked data to model the past, the system does have a number of things to recommend it.

One of the benefits of using linked data is that the system allows historians to create a stable space on the web to represent each person within a historic frame. Every single person can have a URI to accumulate knowledge about that person. In the case of my current work on the group of individuals enslaved by members of the Maryland Province of Jesuits in the eighteenth and nineteen centuries, I have been particularly interested in providing a fixed place on the web to accumulate everything we can know about each one of those individuals. This is a large community, now over 1,100 people, where it will eventually make sense to look at the population at scale using some kinds of data visualizations. At the same time, it is essential that each person also is represented individually, and in as much detail as possible. This helps to mitigate the dehumanization of the quantitative data that can build up around an institution that has the expanse and reach that slavery in the US did.

Once I have established those URIs other scholars have the opportunity to integrate individuals where appropriate into their own work. Currently, there are two linked data projects that stand out as logical candidates for connection. First, individuals within my community show up as key actors in the Early Washington D.C., Law & Family project because they were party to freedom law suits against the Jesuits. Representations of these individuals should be tied through a “Same As” link across both projects. Second, the Enslaved project at Matrix at MSU offers the possibility of finding other representations of the individuals from my community in any of the many data sets to be aggregated there. Given the vast numbers of scholars and organizations working to uncover information about individual enslaved people and their lives, undoubtedly these opportunities for integration and aggregation through linked data will only grow in the coming years.

There are, of course, issues of representation and practice that arise when we turn to linked data as a way to model the past (rather than existing archival or cultural heritage collections). This is particularly true when we are trying to represent people and experiences that were by definition embedded in systems of oppression and inequality. These issues are more complicated where surviving records are spotty and were created by the oppressors. In these conditions, historians working to derive data sets from historical documents must strive to avoid re-inscribing systems of oppression through the creation of a data model and the individual elements of the data set.

Having created individual records for each person, I am now working to create associations of individuals to one another, as members of kinship networks, and as participants in events, each tied to the documentary record. This expanding web of connections begins to capture some of the important experiences for these individuals and communities, but it can never truly capture the full conditions of enslavement or the innumerable checks on everyday freedoms and independent decision making experienced by the enslaved. Furthermore, while I might be able to represent documented kinship networks, I have to be cognizant of and make clear the ways that enslavement constrained people’s ability to shape and sustain their own relationships. With little ability to leave the community and little control over free association, what can we say about the networks that enslaved people formed? How do we recognize their agency in day-to-day events while not minimizing the constraints of the situation? Data derived from a documentary record created by the enslavers will only show the events in which individuals participated, not those which were foreclosed to them. Strive as we might for some kind of balance, the data set will nonetheless always be a mere surface representation of these very human lives and struggles. And in this respect, the data set can never stand on its own without the support of additional interpretative framing.

While scholars who work with data sets frequently talk about the time and effort involved in cleaning a received data set, those discussions often fail to touch on the degree to which historians are pressed to create data sets from scratch. And even for those who begin with an existing data set, as Katie Rawson and Trevor Muñoz suggest in their post, “Against Cleaning,” there is no underlying order to be uncovered when working with data. Rather each effort to shape the data for use results in the creation of new data. A very small amount of this work is transforming data formats (i.e., normalizing dates, or concatenating fields). Sometimes it involves creating and implementing controlled vocabularies, and augmenting derived data with other descriptors. More often than not, it involves deciding what elements to include and how to represent them semantically. Each cell in a rectangular data set represents a choice that results in a new representation of the past.

My current work is based on information derived from reading boxes and boxes of archival sources. Those sources—letters, wills, organization proceedings, account books, inventories—do not automatically present information that conforms to an Resource Description Framework (RDF) standard. A significant amount of modeling and transformation needs to take place between the first read of the individual primary sources and the publication of linked data representations.

Thus, the choices made in the creation of new data sets are about making the unruly historical information fit a rigid and limiting data model. Hopefully the model has been created to reflect the information at hand, but the creation of the model and the fitting of the data to the model is a necessarily reductive, if reciprocal, process. The standards for expressing linked data are constraining. RDF specifies that relationships among unique pieces of data be express in a sentence form: subject—predicate–object. That sentence form produces a level of simplicity and fixity that does not align with the messiness and uncertainty of historical knowledge.

Furthermore, it is important to remember that in linked data the model itself is descriptive because the mechanics of the linkages are semantic, representing not just a relationship, but a particular kind of relationship. The choice of the predicates — the properties that describe the connection between a URI and some other piece of data (another URI, a number, a date, an element of a controlled vocabulary, a string of text) is the heart of the data model, and a particularly fraught element of the work. Most existing controlled vocabularies and linked data schemas are an imperfect fit for the people, places, and events that historians would use them to describe.

Given these limitations, historians need to work hard to prevent the data model itself from becoming a site of distortion and misrepresentation that wrongly projects a false degree of stability and permanence. For example, in the community I am studying, information about partner relations between adults comes in many forms. There are few clearly documented sacramental marriages, but many couples are listed together as parents of children, and others are discussed in terms of family units in ledgers and correspondence. In my data model, I have decided to use the Relationship Vocabulary property “Spouse Of” to be the predicate connecting these individuals. That choice signals the likely relationship in question, but it offers us no way to note the precarity and uncertainty of those relationships under slavery. Does the RDF structure lend an impression of stability and fixity to that relationship that likely does not reflect the historical reality? It may. And it is my job as a scholar to adequately make those possible distortions clear throughout the many facets or the project.

Thus, as Richard Jean So reminds us in his reflective piece on Franco Moretti’s work, models are representations and are good to think with, but are always, to some degree, wrong. While never a seamless reflection of reality, these models should encourage us to traffic back and forth between the model and the sources that it represents. In this way, we can productively look for errors and try to improve our work, acknowledging that all we will ever have is a flawed representation of the past. In the case of my project, the model is a linked data representation of an enslaved community over time, but it will be accompanied by a data visualization model and a narrative model that hope to capture and surface the complexities and uncertainties in my understanding of this history. All are sites of interpretive judgement and subject to the review and critique to which all scholarship should be subject. And, eventually, some historian with more information will come to these models and help me understand more fully how they are wrong.