On Building A Database

December 14, 2017 Mat Paskins No comments exist

This is the second in a short series of posts about making scholarly practices more visible and material, and the useful outcomes which might result from that. The previous one was about flying—you can read it here. This one describes my work over the past couple of years in assembling “The Past Futures Database”, a collection of about 25,000 articles about science technology and medicine from British and American publications across the twentieth century. (At the time of writing the database is almost ready for launch; this post will be updated when it goes live). The collection aims to offer its users a way to explore how techno-social futures were represented in popular media during the twentieth century. It does this by including a wide range of utterances about possible futures related to science and technology; the ways in which these utterances have been collected is meant to unsettle some received notions about how we collect images of the future. It also allows users to search week by week for the whole twentieth century, to see what discussions of science, technology and the future were happening at which points. A number of important magazine collections—including the New Scientist, Popular Mechanics, and Life magazine—are available through Google books. These, however, can only be searched by keyword and can be cumbersome to use. The goal of our database is to provide an overview which allows for explorations of various kinds.

 

The database was a digital humanities (DH) project, and producing it involved thinking hard about digital methods of scholarship and archives. In an article from a few years ago for Science, Technology and Human Values, Claire Waterton argues that there has recently been ‘a “move toward exposure of the guts of our archives and databases, toward exposing the contingencies, the framing, the reflexivity, and the politics embedded within them.” Waterton analyses these processes in terms of a “convergence between the world of social theory and those worlds concerned with building archives”, exploring in detail three examples of “active experimentation with the collection, representation, and display of data about natural/social worlds, which are partly informed by the work of STS and partly by other influences, including social theory”.

 

My work on the database has been very much concerned with these questions of how the technical minutiae involved in the production of digital resources can have serious epistemic, political and organisational consequences.

 

In a recent review of works on the aesthetics of the Digital Humanities, Jessica Hurley observes “the infrastructures of the digital humanities are, like all the best infrastructures, simultaneously omnipresent and invisible”. The fields depends on, and operates through

 

A vast, interlocked network of objects, capital, people, and ideologies: ASCII code; fiber-optic cables; tenure lines; server farms; research centers and literature labs; wage laborers and graduate students who scan, attach metadata and program search functions; the Defense Advanced Research Projects Agency (DARPA); the manpower, capital, and geopolitical location required to apply for a .edu domain name ($185,000, US institutions only); laptops; postdoctoral fellowships; silicon mines; Silicon Valley; the contemporary fetish for STEM in higher education.

 

Thinking about building a database involves reflecting on these things as well.

 

***

There have been quite a number of extraordinarily successful digital resources produced in the history, going back twenty five years or so. Some, such as the digitised collections of Charles Darwin’s correspondence and the Newton project, are direct heirs to massive scholarly undertakings from the nineteenth and twentieth centuries. Vast in scope and with digital aspects which mean they can be investigated by a large number of different audiences, both scholarly and not, they are some of the great intellectual monuments of our time. The visions of Newton and Darwin which are now readily available are critically, historiographically, and politically more interesting and more available than the sources to which previous generations had access. Why are you even reading this? Go and find out about Newton.

 

Other recent DH projects have focused on crowd-sourcing: using platforms to identify illustrators in tens of thousands of pages of Victorian periodicals, for example. All of this is valuable on an empirical level and raises interesting theoretical questions about the sources through which we tell historical stories for large publics, and how those publics can contribute to our work. The use of digital crowdsourcing recalls the great outsourced philological enterprises and correspondence networks of the nineteenth century, such as the far flung groups of people who made contributions to the Oxford English Dictionary. This in turn brings up questions about how such contributors are recognised, and credit given to them for their labours. These questions have far-reaching and under-addressed consequences for the digital humanities, for reasons I’ll return to below.

 

Our database couldn’t be like these projects, for a few different reasons. One: we didn’t have enough money. Two: a lot of the materials we wanted to include were in copyright. Three: even the ones which weren’t in copyright had often already been digitised as parts of large-scale projects by private companies such as ProQuest and Gale. (What I’m going to say next is a little critical of these companies, and I want to be clear about my stance. I don’t think they are ‘corrupting academia’. I also don’t think that they should make all the results of their extremely expensive digitisation programs available for free. But I think that we who work and study at universities should be a lot clearer on the landscape which working with these resources actually leaves for us. I also don’t want to turn this into a ‘the U.K. is lagging behind digitally’ type of argument. Many digital resources, such as the searchable records and historical archive of Hansard, are world-class and incredibly useful.)

 

When I started I was pretty confident in using Microsoft Access due to a past job. I thought (naively) that I’d be able to design the database using this software and then straightforwardly put it online on the website which Sam Robinson had built for our project. This turned out to be impossible: there’s no good integration between WordPress (which our site is built in) and Access. We found various workarounds but they weren’t particularly elegant.

 

Who cares, right? I can imagine two kinds of reader who will have got this far in this post. The first understand the humanities, and will see these technical problems as glitches which should be solved using bought-in technical expertise. The second is people who know databases, who will be laughing about how I thought Access would do what I wanted it to. Of course there are plenty of humanists with good groundings in computer science and data architecture—they’ll be a bit scornful on both levels, I imagine.

 

For good or ill, though, I was anxious that we should get things right before hiring in a database designer, and keen that we on the project should do as much as possible ourselves. Why this obstinacy? I’ve encountered situations in the past where a design for a site or a database has gone out to a designer, and this has locked in features which turn out to cause real problems later on. Take, for example, bulk importing—updating a database with multiple entries at a time. Now I know you almost certainly don’t care about bulk importing, unless you have spent any significant amount of time doing data entry. But whether a database needs to deal with records one at a time, or whether it can pull a whole huge spreadsheet all at once can make the difference between spending the next six weeks inputting and being able to go off in twenty minutes to drink sweet filter coffee at the Aberystwyth arts centre, or whatever your local equivalent might be. It’s the difference between looking at a large collection of data and seeing your Monday, your Tuesday, your Thursday evaporate, and being able to finish the task and do something else. Plus data inputting hurts when you do it for a long time—it involves little tiny movements, often very repetitive and can set off your RSI. For the workflow of the person who is entering data, these features make huge differences.

 

To deal with our Access issue, I spent a long time looking for off the shelf packages to produce online databases. These fall into two main categories: those aimed at academic users and those intended for business users. The former group are pretty cheap but often rigid and can require quite a bit of expertise to use. The latter group are usually eye-wateringly expensive, flexible, and extremely well-supported. I tried a standard app-builder programme called Zoho Creator and it helped me clarify the structure for the database. In the end it turned out that because Zoho charge by the record, our database was going to be too big to be affordable on their platform. In addition, some of the ways I wanted to relate tables on our database exceeded their current functionality. Evaluating the product in this way was a good exercise: it gave me a better idea of how to relate different bits of data. And I think that suggesting new uses or features for commercial products is a good service which academics can perform. In principle, it’s much better idea for us to work with companies which can deliver the kinds of things we want at scale, rather than constantly starting from scratch and insisting on bespoke designs for every digital project we undertake.

 

Alongside Zoho, I worked a lot with Zotero—a program designed for researchers which will extract metadata (such as title, author, publication date) from webpages you visit and gathers them in a database. Zotero is amazingly powerful, in my opinion, though it struggles to import data from other sources. Partway through the biggest gathering of data, the way Zotero integrates with Firefox changed. This little change meant it no longer extracted publication dates from a lot of the sources I was using. This little change multiplied the time each entry took by about 1.5. Across thousands of entries, that adds up to a lot. But there’s no question that for my purposes Zotero was a fantastic resource.

The portal for the database is in Omeka, which I didn’t discover until very late in the day. Omeka cheaply does lots of the things we wanted: it imports bulk files nicely. Its two major drawbacks are that it is not really a relational database, because it treats all entries as essentially being the same kind of data, and that it doesn’t deal with dates in a very sophisticated way. For the former of these problems I developed a solution; the latter is as far as I can make out intractable, at least for now. I would recommend Omeka as an accessible format for anyone who’s happy to muck around a little but lacks extensive coding experience.

 

For sources, we couldn’t rely on publications being out of copyright. I also had little appetite to attempt to work through very long runs of magazines in print form when digital versions were already available, even if they were inaccessible. (I could have done this: perhaps I should, as it would have acted as a finding aid for collections which can appear daunting in size. Knowing that I would be duplicating information which was readily available for subscribers, however, stayed my hand). Instead I looked mainly at publications whose copyright remained with the original publisher, and which had made previews of their articles available online as a way to entice people to subscribe. These publicly available collections of metadata were often of substantial size—thousands of articles were available in this format—and by featuring them in our database we would be drawing attention to these magazines as going concerns, encouraging scholarly and other publics to buy access if they wanted to see the whole article. This seemed like an elegant solution to what I had come to think of the ProQuest problem.

 

Still, the labour involved in assembling these collections was considerable. I had substantial help from Sam Jackson, an undergraduate intern at Aberystwyth, as well. Sam went through two more than two decades’ worth of the magazine. His help was invaluable. For TIME magazine, which makes up the most significant collection in the database, it involved looking at every single issue for the period 1923 and 2007, (about five and a half thousand in total) opening each relevant article in a separate tab and then adding it to the initial collection using Zotero. Other magazines involved more or less work. I wanted to do the Spectator, all of whose archives are available online for free, but the quality of digitisation is extremely poor and cleaning up the articles would have taken more time than I had.

 

 

I am aware that these mundane considerations are a long way from the tone in which Digital Humanities are usually discussed. Among its advocates, DH is supposed to offer a radically changed model of scholarship, at once a pushback against the previous dominance of critical theory, and a way of opening up the academy to new users. Among its critics, it is regarded as a threat to the intellectual integrity of scholarship in the humanities—a presumptive challenge based on a fetishisation of data, science and technology, which has failed to live up to its promises. I am moved by arguments on both sides of the divide, but think that much of both the advocacy and the complaint are misguided. They are misleading because they model both the the role of the human scholar and the role of digital technologies in ways which abstract from the kinds of labour and dependency which this kind of work actually involves. Sometimes this is in support of a belief that methods derived from Artificial Intelligence will soon be able to take over many routine clerical tasks, or the claim that the kinds of reading at a distance which digital approaches enable can give access to a larger, and hence superior range of texts. For more critical scholars it is intended to present digital methods as foreclosing on the less quantifiable, more particularistic aspects of scholarship—the thorny paths of inquiry which are held to be our mode of resistance.

 

What is missing from both approaches is a serious engagement with the massive amount of—often precarious and poorly paid—human labour which digital processes actually continue to involve. Anyone who has worked a lot with Google Books will have come across a page with an inadvertently scanned image of a thumb on it—these have something like the same status as monk’s marginal notes from the medieval period, noting how cold it is. In much less cutesy ways, the appalling conditions in which (for example) Facebook moderators have to work has been the subject of occasional press reports. The increasing numbers of moderators now employed by companies like Alphabet for its YouTube subsidiary are part of the digital landscape; so are the ‘Trolls’ who work in so-called click farms. Of course everyone working in these fields has access to a greater or lesser degree of automation: moderators can, to a degree, use AI; and the Trolls are so adept at creating puzzling variants of existing kids’ programmes that a nostalgic clip show produced in 2047 looking back on childhoods of the Twenty Teens is likely to feature fond memories of watching some frightening shapes and traumatising off-brand cartoon pigs.

 

For the foreseeable future, though, a very high proportion of digital work will continue to involve human judgment and exposure of workers to content which can be very unpleasant indeed. Absent the enforced leveraging of huge crowd-sourcing which Google books was able to employ, digitisation projects will also continue to demand huge amounts of routine, often boring labour. Because the people doing this work are often casual employees of private contractors, they are often not regarded as making serious contributions to the scholarly enterprise. For the nineteenth century Oxford English Dictionary we have a good idea of at least some of the people who read through texts and made excerpts and sent their slips into the lexicographers. There has rightly been a move to celebrate the highly skilled but low status human calculators and computers who contributed to everything from the construction of the Nautical Almanac to the Space Programme*. But by and large not academics simply do not care about who scanned the documents; who added tags to the old articles on which our new articles are based; who used their judgment to correct the optical character recognition which made the text readable.

 

A lot of discussion around the role of the digital economy has focused on how much better paid digital jobs are than those in other sectors. But this exclusive focus on the better jobs in the sector risks obscuring the myriad ways in which bad jobs are outsourced. If the fact that moderators are at serious risk of PTSD from routine exposure to images of animal cruelty** (never mind anything else) online were treated as a serious issue of occupational health, wouldn’t this make a difference for how those jobs are rewarded and supervised? It is easy to lament this as one of the ill effects of capitalism and perhaps more difficult to see how better recognition of the hazardous conditions of much digital work could be achieved.

 

It may be objected that most academic digital labour is not hazardous in the same way as content moderation. However true this may be, I think we should think more about how the divisions of labour which lead some laborious tasks to be handed off to external contractors, to poorly paid lower level members of staff, and others. Instead of accepting that academics should primarily adopt a managerial role with respect to these other processes, which should be more interested in how never having to do the really boring routine stuff might feed obscurantist fantasies about how digital tools actually work. If we never see the work being done, and never participate in it ourselves, it is perhaps easier to believe ‘a robot did it.’ This makes it more difficult to reflect seriously on the combination of human and non-human agency through which the digital realm is maintained.

 

I wouldn’t compare the database we’ve created for this project with the major scholarly works described above. But I hope it might provide an example of how such resources can be created in-house, using the resources which are immediately available to us. Bumping up against those restrictions is a way to start thinking more critically about the human labour which digital work demands.

Mat Paskins

 

*I am not thinking exclusively of those who were promoted to do more interesting work, like Katherine Johnson, but the other calculators as well, who as David Allen Grier has argued were for a long time written out of history.

*Here’s Wired’s description of the experiences of one YouTube moderator: ““If someone was uploading animal abuse, a lot of the time it was the person who did it. He was proud of that,” Rob says. “And seeing it from the eyes of someone who was proud to do the fucked-up thing, rather than news reporting on the fucked-up thing—it just hurts you so much harder, for some reason. It just gives you a much darker view of humanity.”

Leave a Reply