Unearthing the Forgotten: Using Data Mining to Identify Esoteric Figures in 18th-Century Correspondence

Introduction

History is often written by the victors and the hyper-visible. In the study of the 18th century—an era defined by the Enlightenment, the birth of modern scientific inquiry, and the upheaval of revolutions—our historical narratives are dominated by a handful of “great men” and well-documented salons. Yet, beneath these prominent figures lie thousands of digitized, transcript-ready letters that hold the key to a more complex, nuanced past.

The challenge for historians and researchers is no longer a lack of data, but an overwhelming surplus of it. Manual reading is a noble pursuit, but it is insufficient when confronted with millions of pages of correspondence. By employing data mining techniques, we can shift from reading documents to interrogating them. This article explores how to use computational methods to identify under-researched, esoteric figures who operated on the margins of intellectual history, providing a roadmap for turning cold data into historical insight.

Key Concepts

To identify hidden actors in 18th-century archives, you must master a few essential computational concepts:

Named Entity Recognition (NER): The process of automatically identifying and classifying names of people, places, and organizations within unstructured text.
Network Analysis (Social Network Analysis – SNA): Mapping the relationships between individuals based on their co-occurrence in letters. By visualizing “edges” (correspondence) and “nodes” (people), we can identify “bridges”—individuals who connected disparate intellectual groups.
Topic Modeling (Latent Dirichlet Allocation – LDA): A statistical method that identifies latent themes across a corpus. If an unknown name consistently appears alongside discussions of “botany,” “taxation,” or “hermetic studies,” you have discovered their intellectual domain.
Corpus Linguistics: Analyzing word frequency and collocations to see how specific individuals are described over time, helping to distinguish between a household name and a persistent, yet overlooked, contributor.

Step-by-Step Guide

Curate Your Data Repository: Access established digital archives like the Electronic Enlightenment, the Founders Online repository, or the Bodleian Library’s digitized collections. Ensure the data is in machine-readable formats (TXT, JSON, or XML).
Pre-processing and Cleaning: 18th-century correspondence is notoriously difficult for computers due to erratic spelling, antiquated shorthand, and OCR (Optical Character Recognition) errors. Use tools like Python’s spaCy for entity normalization to ensure that “Mr. J. Smith,” “John Smith,” and “J. Smith of London” are recognized as the same person.
Run NER Pipelines: Deploy a customized NER model trained on historical texts to extract every personal name mentioned in your corpus. Filter out the “celebrities” (e.g., Voltaire, Franklin, Jefferson) to create a list of frequent, yet obscure, mentions.
Develop a Co-occurrence Matrix: Build a database that maps which names appear in the same letters or within the same intellectual circles. An “esoteric figure” often appears in high-value letters without being the primary subject.
Apply Centrality Metrics: Use software like Gephi to calculate “Betweenness Centrality.” This metric identifies nodes that act as gatekeepers or conduits for information, even if they aren’t the most famous people in the room.
Contextual Verification: Cross-reference your results with marginalia, parish registers, or local court records to move from a data point to a human biography.

Examples and Case Studies

Consider the case of a mid-18th-century naturalist whose letters are tucked away in the archives of a provincial scientific society. A researcher performing traditional reading might skip over letters addressed to “Mr. H. Thorne” regarding plant hybridization. However, by using data mining, the researcher identifies that Thorne received letters from four different prominent European biologists, each citing his specific, unpublished experimental methods.

The data shows a high “Betweenness Centrality” score for Thorne—he was the clearinghouse for information that the famous scientists were using to write their own books. He wasn’t the author of the book, but he was the structural foundation upon which the science was built. Through data mining, we move Thorne from a mere name on a page to a significant, albeit esoteric, contributor to 18th-century science.

Common Mistakes

Ignoring OCR Noise: Many researchers trust digitized text implicitly. 18th-century typeface creates high error rates. Always perform a quality check on a sample of your text; if your software is reading “The Philosopher” as “The Ph11osopher,” your data mining will fail.
Over-relying on Centrality: Just because someone is mentioned often does not make them historically relevant. They might be a common tax collector or a repetitive letter carrier. Always combine computational findings with qualitative historical context.
The “Famous Person” Bias: If you do not filter out the “big names” of the century, your network analysis will be so dominated by them that the esoteric figures will be washed out (the “Rich-Get-Richer” phenomenon in network theory).

Advanced Tips

To take your research to the next level, move beyond simple mention-counting. Implement Sentiment Analysis to gauge how these esoteric figures were viewed by their contemporaries. If a relatively unknown individual is consistently addressed with language signifying “deference” or “intellectual authority,” you have found a figure whose influence significantly exceeded their public profile.

Additionally, utilize Geospatial Mapping. By extracting address data from correspondence headers, you can visualize the physical flow of intelligence across Europe and the Atlantic. Esoteric figures often clustered in “hidden hubs”—intellectual nodes in port cities or rural estates that were as important as the major metropolitan salons. Seeing these clusters on a map often reveals the “why” behind their influence.

Data mining is not a replacement for the historian’s eye; it is a lens that sharpens our focus, allowing us to see the vast, interconnected web of actors who actually built the intellectual world of the 18th century.

Conclusion

The archives of the 18th century are deep, but they are no longer impenetrable. By applying data mining, we can democratize historical study, moving away from the narrow confines of the “great person” narrative and toward a more inclusive understanding of how ideas actually traveled, evolved, and were refined.

The methodology is clear: digitize, clean, map, and contextualize. By identifying those who acted as the bridges and conduits of Enlightenment thought, we can write a more vibrant, inclusive history. Start by identifying a small, under-analyzed collection, run an entity recognition sweep, and watch as the names of the “forgotten” rise to the surface, ready to be rediscovered and analyzed for their true historical contributions.