A few years ago, I wrote a post about some preliminary experiments I ran using Optical Character Recognition (OCR) technology on medieval manuscripts. Fortunately, after I wrote that up, I had quite a bit of feedback from others who had used OCR with older printed books, and with languages like Latin and Greek. At one point, my post was even featured on the Hacker News aggregator. Clearly there was interest in the subject.
Several months after I published my original post, I was contacted by some folks in England who had created a not-for-profit company called Rescribe to use OCR on historic sources. They had started with early modern printed books and they hoped to branch out to medieval manuscripts. Once we talked through some of our goals, ideas, and research questions, we decided that we had a lot of potential for collaboration. I was fortunate to meet Antonia Karaisl and Nick White, who both have knowledge of medieval literature, manuscripts and early printed books, as well as digital technology like machine-learning OCR engines.
Over the past few years, the three of us have experimented with using open-source neural network OCR software on medieval manuscripts. I’m happy to say that we’ve seen some surprising results and that we have an article about them forthcoming: “Modelling Medieval Hands: Practical OCR for Caroline Minuscule,” in Digital Humanities Quarterly. Below is the abstract and introduction to the article (with a few notes removed). Watch for the full article in the next several months.
Over the past few decades, the ever-expanding media of the digital world, including digital humanities endeavors, have become more reliant on the results of Optical Character Recognition (OCR) software. Yet, unfortunately, medievalists have not had as much success with using OCR software on handwritten manuscripts as scholars using printed books as their sources. While some projects to ameliorate this situation have emerged in recent years, using software to create machine-readable results from medieval manuscripts is still in its infancy. This article presents the results of a series of successful experiments with open-source neural network OCR software on medieval manuscripts.Results over the course of these experiments yielded character and word accuracy rates over 90%, reaching 94% and 97% accuracy in some instances. Such results are not only viable for creating machine-readable texts but also pose new avenues for bringing together manuscript studies and digital humanities in ways previously unrealized. A closer examination of the experiments indicates regular patterns among the OCR results that could potentially allow for use cases beyond pure text recognition, such as for paleographic classifications of script types.
In an age replete with digital media, much of the content we access is the result of Optical Character Recognition (OCR), the rendering of handwritten, typed, or printed text into machine-readable form. On a more specific scale, OCR has increasingly become part of scholarly inquiry in the humanities. For example, it is fundamental to Google Books, the Internet Archive, and HathiTrust, corpus creation for large-scale text analysis, and various aspects of digital humanities. As a number of recent studies and projects have demonstrated, the results of OCR offer a wide range of possibilities for accessing and analyzing texts in new ways.
OCR has shifted the range and scope of humanistic study in ways that were not possible before the advent of computers. As stated by a team of scholars led by Mark Algee-Hewitt at the Stanford Literary Lab, “Of the novelties produced by digitization in the study of literature, the size of the archive is probably the most dramatic: …now we can analyze thousands of [texts], tens of thousands, tomorrow hundreds of thousands” [Algee-Hewitt et al. 2016, 1] (cf. [Moretti 2017]). Beyond literary study specifically, new possibilities due to the mass digitization of archives have emerged across the humanities. Historians of books, texts, and visual arts (to name just a few areas) now have ready access to many more materials from archives than previous generations. Among the new pursuits of humanities scholars, computer-aided studies often no longer focus only on a handful of texts but encompass large-scale corpora—that is, collections of hundreds or thousands of texts. Much of this is made possible by OCR.
Yet most studies and applications of OCR to date concern printed books (see, e.g., [Rydberg-Cox 2009]; [Strange et al. 2014]; and [Alpert-Abrams 2016]). The majority of text-mining projects in the humanities focus on eighteenth- and nineteenth-century printed texts. One way to expand the potential for humanistic studies even further is to apply OCR tools to extract data from medieval manuscripts, but this area of research has received much less attention. Indeed, the current situation with using OCR on medieval manuscripts is not much different from 1978, when John J. Nitti claimed that “no OCR (Optical Character Recognition) device capable of reading and decoding thirteenth-century Gothic script was forthcoming” [Nitti 1978, 46]. Now, as then, there has been little progress on using OCR to decipher Gothic or any other medieval script, regardless of type, date, or origin.
This article presents the results of a series of experiments with open-source neural network OCR software on a total of 88 medieval manuscripts ranging from the ninth through thirteenth centuries. Our scope in these experiments focused mainly on manuscripts written in Caroline minuscule, as well as a handful of test cases toward the end of our date range written in what may be called “Late Caroline” and “Early Gothic” scripts (termed “transitional” when taken together). In the following, we discuss the possibilities and challenges of using OCR on medieval manuscripts, neural network technology and its use in OCR software, the process and results of our experiments, and how these results offer a baseline for future research. Our results show potential for contributing to not only text recognition as such but also other areas of bibliography like paleographical analysis. In all of this, we want to emphasize the use of open-source software and sharing of data for decentralized, large-scale OCR with manuscripts in order to open up new collaborative avenues for innovation in the digital humanities and medieval studies.