Presenting the Records of Historical Decision-Making, 27 February 2019
The most recent Negotiated Texts Network workshop took place in Pembroke College on Wednesday 27 February. Thanks to additional funding from a Knowledge Exchange grant, the network was able to benefit from the presence of a number of US participants.
1. Speakers and projects
Neal Millikan, Adams Papers Editorial Project
Michael Pidd, The Digital Humanities Institute
Alexander von Lünen, Hansard at Huddersfield
Ida Nijenhuis, REPUBLIC
Kathleen Richman, LLMC
Fraser Dallachy, Historical Thesaurus of English
Julie Silverbrook, ConSource
Stan Swim, Bill of Rights Institute
Nicholas Cole, Quill
A number of common themes emerged throughout the day:
(a) The evolution of projects to focus on end-user needs
Early digitization projects were focussed on getting material online in a ‘raw’ form without particular consideration for how it might be used. The shift away from this approach was clearly evident throughout the workshop in the emphasis on and consultation with end users. To some extent this shift is driven by funders who want to see ‘impact’. However, the provision of supposedly unmediated raw material was always a fallacy as editorial judgement was inevitably exercised, and, as the volume of material online continues to grow, it has become increasingly difficult for users to sift through it to find what they need.
A number of provisos were made to this focus on the end user. Firstly, there needs to be a more qualitative measure of impact than hits on a site, since visiting a particular webpage does not necessarily equate to a successful interaction with the primary materials or data available as Alexander von Lünen illustrated. Here, the requirements of funding-bodies to show widespread public engagement can be at odds with the need to provide specialist tools for researchers: promoting public understanding and driving better scholarly insights may be at odds. The SAMUELS project successfully employed cutting edge technology to tag Hansard reports with historical linguistic information which helped researchers to trace trends across time, but in order to successfully manipulate the data, some skill in computational linguistics was required, limiting its use by a non-expert audience. The follow-on project, Hansard at Huddersfield, has therefore invested considerable energy in consulting potential users and observing how they currently acquire information before designing a new interface and search tools. They hope one outcome will be users spending longer on the site as they successfully interact with the material. Secondly, there is a danger of skewing research and funding priorities towards current user trends and the biggest user groups, normally teachers and school children, and losing sight in the process of the broader responsibility to preserve materials and further new research within the academic community. This was particularly brought home by Kathleen Richman’s report that so many libraries are disposing of physical collections and relying on online archives. LLMC therefore sees their responsibility as being just as much to preserve original documents (in salt mines in Kansas) as to digitize. Quill has encountered similar problems in trying to access newspaper archives. Hard copies have been destroyed in some cases even before high quality, complete digital versions have been created. Thirdly, Michael Pidd noted from his long experience at the Digital Humanities Institute, that a well-designed project can be readily adapted for different user needs by relatively simple updates to the interface, meaning that in some respects that users do not need to be the fundamental consideration.
The compromise seems to be a greater emphasis on digital curation. In an ideal world, this would involve more collaboration between projects and the application of common standards, as the John Quincy Adams Papers Editorial Project are piloting in the Primary Source Cooperative where they are providing a platform for four smaller digital editions with overlapping interests. This kind of collaboration opens to the door to better search tools and more visibility for smaller collections. How this could be conducted on a larger scale is hard to imagine. However, good curation will at least provide tools to facilitate access to materials and will signpost users to other related collections, as Quill has successfully modelled in its resource collections. By working together to offer access to and commentary on various collections of material, the end user is better served and a larger userbase is created for all the projects involved. Michael Pidd made the point very convincingly that users often first connect with a project and begin to interact with the primary materials and data it offers after searching for background information and resources. While this kind of material may not be the primary purpose of the project concerned, adequate time to create it should be built in as a way of building a userbase.
The afternoon session was especially focussed on teachers and school children as end-users. The point was made that for historical materials they are often the largest single audience, as the academic community accessing the materials is so much smaller. ConSource and the Bill of Rights Institute have considerable experience in producing materials for the classroom and highlighted some of the challenges. Teachers are seldom subject specialists and start with very limited knowledge themselves. It is therefore essential to educate the teachers in the correct use of the source materials. Secondly, many schools have poor connectivity and it is unlikely, for example, that every student will have access to a computer. As a result, however good the technology of a particular project, there is demand for materials to continue to be offered in PDF format to allow teachers to download and print.
(b) The challenges and opportunities of technology
Michael Pidd’s presentation highlighted some of the opportunities and challenges presented by new technologies. He began by noting that the advent of online editions had very little impact initially on the processes of transcription and marking up developed in the 1990s for CDRom versions. However, increasingly, the task of digital projects is not so much digitization and transcription, but manipulation of multiple pre-existing datasets, as illustrated by the Digital Panopticon project. This involves not just the technical challenges of training an algorithm to understand decision-making processes, but also overcoming the challenges of inconsistent and incomplete data, an issue which also confronted the SAMUELS team, as well as gaining access to material held behind third-party paywalls, something also encountered by Quill. Michael went on to demonstrate how applying the AI technology which his team developed for the Digital Panopticon project is now transforming their approach to more traditional digital edition projects. NLP can be used to encode transcriptions and produce rough results to be cross-checked by interns and researchers. This kind of supervised training reduces workflows and ultimately produces better results for end users by allowing them to explore more data in a wider context.
For projects dealing with large corpora of material which have not been digitized, transcription and mark up continue to be pose challenges, with claims made for digital transcription software often being exaggerated, particularly in relation to handwritten documents. Ida Nijenhuis described the 4-year project she is embarking on to produce an online publication of the resolutions of the Dutch States General 1567-1796—some 500,000 pages, many handwritten. She spoke very frankly about the challenges of transcriptions using HTR and OCR. The project pilots have a success rate of only up to 60% using Transkribus for handwritten texts. They hope to use crowd-editing to make the corrections and are developing interfaces to facilitate this.
Fraser Dallachy gave a very frank description of some of the other technological issues of working with large quantities of data. These included interoperability issues, but also straightforward capacity problems and an inability to process data in an acceptable timeframe. In the case of the SAMUELS project, they were able to overcome some of these issues by setting more realistic goals and collaborating with a US institution with better processing power. However, his presentation clearly illustrated the continuing limitations of the technology in fully manipulating the datasets available.
It was broadly felt that both transcription software and algorithms used to process large volumes of data do greatly speed up and enhance workflow. However, considerable human intervention and correction will continue to be required, whether through crowd-editing, interns, or expert project staff. And although the promise of AI to recognize both print and hand-writing more accurately has been relied upon in the way that many current projects have been framed, the ability of current AI to actually deliver is unproven. Projects represented at the meeting were interested to discover that high error-rates were being encountered by many projects (above the error-rate advertised by various vendors) — suggesting that that problems they had been attributing to peculiarities in their own data-sets were in fact problems with the state of the underlying algorithms. Technology remains a tool to facilitate research rather than replacing the decisions made by experts. It will also continue to be a moving target and so updating is constantly required; this raises questions of sustainability and how projects can continue to be managed after initial funding has elapsed.
(c) The constraints of the current funding environment
Substantial differences emerged between the funding scenes in Europe, where applications are usually to public bodies and are peer-reviewed, and the US where more funding is from privately managed foundations. In the US funders are moving away from funding the human elements in digital projects in favour of, sometimes fanciful, AI solutions. In Europe, the primary focus is on the research question, with secondary consideration given to factors such as digital preservation and public impact.
In both continents, but particularly in the US, there is some frustration over the cost and time of some of the large digitization projects. However, as discussions throughout the day illustrated, there are no easy solutions. In order to be fully exploited at later stages in the pipeline, due attention needs to be given to the foundational stages of preservation, digitization and transcription. New technologies are facilitating this, but human investment remains critical.
All participants noted that there is a danger of claiming too much for the technology in funding applications and thereby setting unrealistic expectations. The US participants expressed a desire to see funders attending more academic conferences and roundtable discussions where they could be better exposed to the challenges and limitations faced by digital projects. However, at other stages in the pipeline, the reverse problem can also be observed, with Humanities scholars not sure what it is possible or what to ask for in order to exploit and visualize data. Continued collaboration and communication are therefore vital to ensure ambitious, but achievable, projects are commissioned.
Finally, it was noted that there will always be a tendency for the current focus of funders to skew criteria for success. There is a responsibility within the whole academic and funding community to ensure sufficient attention is paid to whole pipeline.