THOTH: Transcribing Historical Objects with Tabulated Handwriting

THOTH is a collaboration between Oliver Dunn , Alexis Litvine and Yiannos Stathopoulos (Computer Science).

Unlocking the potential of big data for the humanities and social sciences

One persistent obstacle for the application of quantitative techniques to un-digitised sources is the time (and cost) required to: (1) film documents, (2) transcribe them, (3) extract meaningful data and, (4) structure them into usable datasets. The task is more arduous for sources with complex layouts, including tabulated data, such as censuses, administrative records, civil registers etc. THOTH automates tasks 2-4 to make extracting meaningful data on a very large scale faster and more affordable, hence unlocking the potential of advanced statistical analysis of large historical sets of sources.

Handwritten Text Recognition (HTR) exploits machine learning technology to recognise idiosyncratic handwriting styles and transcribe them into useful digital information. Darker pixels picked out from lighter page background allow AI pattern recognition of handwriting script. The linguistic meaning of these patterns has to be "taught" by a palaeographer familiar with the source by transcribing a suitably large sub-sample for the AI to learn specific scripts. The data is then used by the machine to recognise similar script in other images and to transcribe it automatically.

Turning historical documents into datasets

THOTH offers an integrated and automated solution to collect data from historical sources. We adapt proven HTR and Keyword Spotting (KWS) technology developed by CO-OP READ (Transkribus API) and optical character recognition (OCR) to the needs of social scientists.

Current projects and previous realisations:

since 2019 - ongoing) We are transcribing a very large number of military draft lists for France for the period 1812-1914. For this project we are transcribing over 26M lines representing all cohorts of males aged 20 in France over a century. We automated table segmentation to greatly improve the accuracy of the HTR transcriptions and create a directly usable dataset. Funded by the ANR (French National Research Agency).
(2020) Dataset created from tabulated UK Statistical Abstracts 1870-1937 covering transport sector development. Data created for the UK National Infrastructure Commission.
(2020 - ongoing) We are transcribing a large number of French census enumerators' lists for 1831-6, 1861 and 1891. For now the project covers three full départements, but will expand in 2021.
(2021) Dataset on communications and the telegraph network. Data created for the UK National Infrastructure Commission.
(2021 part I completed - part II ongoing). Extraction of eighteenth-century custom records CUST-3 for Prof Noam Yuchtman (LSE) and Prof Lukas Leucht (UC Berkeley)

Projects we work with

INCHOS: The aim of INCHOS is to develop a genuinely comparative history of occupational structure by using a common occupational coding system (PSTI – a modified version of E.A. Wrigley's PST system) and common methodologies to ensure commensurable results. Our interest is not in a particular period but in the long-run process of industrialization, which means that the focus is on different time periods in different countries.
INED, Louis Henry and Jean-Noël Biraben's demographic surveys of parish records used to reconstitute families and retrace the population dynamics of France from 1740 to 1830. Dr Isabelle Séguy is currently digitising and transcribing a large number of survey cards created by Henry and Biraben's research assistants.
INCAM-HEADS: The Interdisciplinary Centre for the Analysis and Modelling of Historical demography, Economic history, Applied Digital humanities and Spatial studies will be based at INED (Paris) and Cambridge (History Faculty and Cambridge Digital Humanities). It will be a collaborative research hub for digital humanities applied to quantitative history. Funding is being sought from the International Research call (SHSS & IRSWG).
Transport, urbanization and economic development in England and Wales c.1670-1911: In this project, funded by the Leverhulme Trust, the National Science Foundation, the Isaac Newton Trust, and the Keynes Fund, we are taking advantage of the new technological possibilities created by Geographical Information Systems (GIS) to explore the relationships between improvements in transport infrastructure, urbanisation, market access, technological change and long-run economic development.
ExPLOT: A research and learning network based in Cambridge aiming to gather historians, archaeologists, anthropologists, geographers, economists, and researchers in other disciplines to present a range of spatial approaches to the past.
ANR Commune: The COMMUNE HIS-DBD project will build the first historical-GIS capturing all changes in the boundaries of French communes since the Revolution and create a multi-modal dataset of transport networks from 1750 to the present.

We are also collaborating with various incipient research projects in addition to the above. Any expression of interest for collaboration or for any application of THOTH please contact Oliver Dunn (od226@cam.ac.uk) or Alexis Litvine (adl38@cam.ac.uk).

Further funding to develop THOTH has been awarded from the University of Cambridge (CHRG), Cambridge Digital Humanities (CDH), and further funding is being sought from the ESRC.

THOTH: Transcribing Historical Objects with Tabulated Handwriting

People

THOTH: Transcribing Historical Objects with Tabulated Handwriting

Unlocking the potential of big data for the humanities and social sciences

Turning historical documents into datasets

Current projects and previous realisations:

Projects we work with

About

Colophon

Study at Cambridge

About the University

Research at Cambridge