Textual Criticism as Language Modeling: Viral Texts, Networked Authors, and Computational Models of Information Propagation
Abstract: The era of mass digitization seems to provide a mountain of source material for scholarship, but its foundations are constantly shifting. Selective archiving and digitization obscures data provenance, metadata fails to capture the presence of texts of mutable genres and uncertain authorship embedded within the archive, and automatic optical character recognition (OCR) transcripts contain word error rates above 30% for even eighteenth-century English. The condition of the mass-digitized text is thus closer to the manuscript sources of an edition than to a scholarly publication. On the computational side, models that treat collections as sets of independent documents fail to capture the processes by which new texts are generated from existing ones.
In this talk, I will discuss several aspects of our work on "speculative bibliography" with computational methods. Starting from a simple model of the composition of historical newspaper pages, with applications to text denoising, I describe models of how texts transform their sources, applied to modern science journalism, medieval Arabic historians, and the generically hybrid forms in nineteenth-century newspapers. I conclude by discussing methods for inferring network structure and mapping information propagation among texts and publications.
This is joint work with Ryan Cordell, Rui Dong, Ansel MacLaughlin, Abby Mullen, Ryan Muther, and Shaobin Xu.