The team at the UC Davis Digital Scholarship Lab has completed its first successful experiment using the Arch-V image recognition software package to track the re-use of individual sorts of type across multiple pages in multiple works.* Working in collaboration with Jessie Owens, a faculty member in the Music department here at UC Davis, we identified a sort with a ligature that was physically damaged such that the resulting printed letters were uniquely compressed:
We were then able to use Arch-V to search all page images in the collection and find where that sort was used. The image below shows an example match where the single sort image (the small blob at the top of the white margin on the left of the image) is being matched to an occurrence on one of the pages in the collection:
In the above image, the red circles that appear over the letter represent the individual feature points that are being matched and the blue indicate the actual match. Arch-V does not actually compare shapes. Rather, it identifies individual sub-shapes in an image, called ‘features’, and then performs its matching by looking for the recurrence of the same or similar features. Below is a larger depiction of found features in a woodblock impression from the English Broadside Ballad Archive:
Discovering matches of typeface requires a very large feature vocabulary. Each page image in the test collection contains an average of 30,000 features:
As can be seen in the above image, the amount of feature information is both visually and cognitively overwhelming to humans, even when considering only a single page. Scaled to an entire collection, the numbers are significant. There are, for example, about 255 million data points in the English Broadside Ballad Archive collection. And we predict that if we had digital versions of everything in the ESTC, it would result in a data collection of well over 300 trillion data points. While this would be impossible to cope with as a human, the computer is able to make sense of data at this scale. Our small test of 260 images took 43 minutes to generate all feature points in the library when run on a single processer on a MacBook Air, and the same system was able to sort through all of these in search of matches in less than 1 minute. If run on a multi-threaded server, the whole process would take less than a minute. (We’ll have benchmarks for this soon.)
My specific purpose in pursuing this technology is to firm up the print history for the items in the English Broadside Ballad Archive. Our knowledge of who printed what and when in the early modern period is squishy, at best. The vast majority of items printed in the period contain no specific information regarding date of publication. Printing dates for most items are established by identifying references within the document to known historical events or by identifying the printer/publisher responsible for that item and assigning a value equivalent to the known years in which that printer was active.
Most items do contain licensing imprints that identify the person licensed to print and distribute the broadside, such as, for example, “Printed by M.P.” But such imprints can be difficult to disambiguate. Is “M.P.” the famous printer Martin Parker, or some other lesser known individual? And they also do not distinguish between actual printers and those licensed to print, who frequently contracted actual printing to a multitude of printers.
The net result of the above is that most established dates for the printing of early modern materials appear as ranges, frequently spanning 30-50 years. Putting that in a modern context, this would be as if the majority of items printed from the beginning of Ronald Regan’s presidency to items rolling off the press today were assigned the same date. It’s easy to see how this lack of specificity would dramatically impinge on a scholar’s ability to understand our contemporary history and culture. Much of what we currently know about the early modern period rests on this same lack of temporal specificity.
Tracking physical sorts of type will help us to bring specificity to the current dating morass in a couple of ways. First, it will help us to cut through the ambiguity of the printer/publisher problem. Physical sorts of type are associated with physical print shops and not with publishing contractors. By tracking sorts of type that were used in items where we definitively know the actual printer, we can identify other items printed by that same printer, regardless of who held the publishing license. Patterns of association should ultimately emerge from this data. For example, we may find that Publisher A used Printer A from 1635-1638, Printer B, from 1638-1639, and Printer C from 1640-1647. With this knowledge in hand, we can reassign dates that are currently assigned as 1635-1647 to their appropriate, more limited range based on printer assignment.
We also plan on combining sort tracking results with similar tracking measures applied to other items on the page, such as ornament and woodblock impressions. This combined, 3 factor network will provide a very rich map of the early modern printing landscape.
We have already completed mapping the reuse of woodblock images across EBBA (the results of this work are currently live on the EBBA website), and we are currently in the process of doing the same for ornaments. Now that we have successfully utilized Arch-V to match sorts of type, we will begin to build this network as well. Once all three initiatives are complete, we will begin the task of statistically analyzing the results in combination with each other and with our existing cataloging metadata to create a significantly more accurate and discrete dating history of the popular press during the early modern period.