Chronicles and Code

Introduction
Chapter 1 Why Digital Methods for European History?
Chapter 2 Research Design and Questions in the Digital Age
Chapter 3 Building and Curating Corpora: Archives, Licensing, and Ethics
Chapter 4 Digitization Workflows: Scanning, OCR, and HTR
Chapter 5 Text Cleaning and Preprocessing: From Raw OCR to Readable Corpora
Chapter 6 Metadata and Standards: TEI, Dublin Core, and IIIF
Chapter 7 Exploratory Text Analysis: Frequencies, Collocations, and Concordances
Chapter 8 Natural Language Processing for Historians: Tokenization to NER
Chapter 9 Topic Modeling and Thematic Change over Time
Chapter 10 Stylometry and Authorship in Multilingual Europe
Chapter 11 Sentiment, Framing, and Discourse Analysis
Chapter 12 Networks from Sources: Prosopography and Social Graphs
Chapter 13 Network Analysis in Practice: Centrality, Communities, and Dynamics
Chapter 14 Historical Geography: Sources, Gazetteers, and Geocoding
Chapter 15 GIS Basics with QGIS: Projections, Layers, and Joins
Chapter 16 Spatial Analysis: Density, Interpolation, and Spatial Autocorrelation
Chapter 17 Deep Mapping and Narrative Cartography
Chapter 18 Time, Uncertainty, and Event Modeling
Chapter 19 Data Visualization Principles for Historians
Chapter 20 Crafting Visualizations in R and Python
Chapter 21 Interactive Maps and Dashboards for the Web
Chapter 22 Reproducible Research: Notebooks, Version Control, and Pipelines
Chapter 23 Packaging, Sharing, and FAIR Data
Chapter 24 Case Studies from European History: Text, Maps, and Networks
Chapter 25 Sustaining Projects: Collaboration, Documentation, and Impact

Introduction

Europe’s archives and libraries preserve centuries of human experience, written in dozens of languages and recorded across shifting borders and institutions. Chronicles and Code is a hands-on guide to turning those sources into structured, analyzable evidence with contemporary digital methods. It is designed for graduate students and researchers who want to complement close reading with computational approaches, and to do so in a way that is transparent, rigorous, and reproducible.

This book begins from the premise that methods should serve historical questions. Each chapter connects concrete tools—OCR and handwritten text recognition, text mining and NLP, GIS and network analysis—to the interpretive moves historians routinely make: establishing context, tracing change over time, following actors, and comparing places. You will find step-by-step workflows, annotated examples, and suggestions for further reading that bridge technical detail with historical practice.

Working with European sources brings distinctive opportunities and challenges. Multilingual corpora, historical spelling variation, non-Latin scripts, and jurisdictional differences in copyright and data protection all shape what is possible. We address practical strategies for building lawful, ethical datasets; documenting provenance; and handling uncertainty without flattening the past. Throughout, we emphasize open standards (such as TEI and IIIF), community data (from Europeana and historical gazetteers), and methods that travel across languages and periods.

The text analysis chapters guide you from raw scans to research-ready corpora: designing digitization projects; improving OCR/HTR output; cleaning and normalizing text; and applying techniques like concordances, collocations, named entity recognition, topic modeling, and stylometry. Rather than treating algorithms as black boxes, we show how to validate results, interpret model outputs historically, and communicate limitations honestly. Examples draw on newspapers, administrative records, correspondence, and printed works from across Europe.

Spatial thinking is central to historical inquiry. Our GIS chapters introduce coordinate reference systems, geocoding with historical place names, and aligning historical maps with modern basemaps. You will learn how to explore spatial patterns, test relationships, and represent change through deep mapping and narrative cartography. We pay special attention to temporal uncertainty—events that unfold over ranges rather than points—and to representing ambiguity rather than erasing it.

Visualization is not the end of analysis but part of the reasoning process. We cover visual design principles for clarity and legibility, accessible color palettes, and the craft of building figures that make claims explicit. Whether you prefer R or Python, you will learn how to produce static plots for print, interactive maps for the web, and dashboards for public engagement, all while keeping your workflow reproducible.

Reproducibility threads through the entire book. You will practice version control with Git, organize projects for longevity, and automate pipelines with notebooks and scripts so that results can be rerun, checked, and extended. We align with FAIR data principles, discuss licensing and sharing, and model collaborative habits that help teams maintain momentum over the lifespan of a project.

Chronicles and Code is meant to be used actively: try the exercises, adapt the templates, and bring your own sources to each chapter. You can read it straight through or dip into the parts that fit your current project. Our aim is not to replace traditional historical methods but to expand your toolkit—so that text analysis, GIS, and network approaches become everyday companions to interpretation, enabling new questions about Europe’s past and new ways to share your findings.

CHAPTER ONE: Why Digital Methods for European History?

European history has always been a discipline of careful reading and patient searching. We sift through bundles of letters tied with faded ribbon, decipher marginalia in parish registers, and coax meaning from brittle newspapers that crackle at the slightest touch. The craft is intimate: a conversation between scholar and source. Yet the landscape of the past is vast, and the traces it leaves behind are multiplying at an astonishing rate. Digitization projects have put millions of pages within reach, and metadata is accumulating like snowfall. The challenge is no longer access alone, but how to make sense of such abundance without losing the texture and nuance that give history its human resonance.

Digital methods offer a bridge across that gap. They are not a replacement for the slow, reflective work of reading, but a way to guide and test our intuitions at scale. Imagine trying to trace the circulation of a political rumor across a dozen European cities using only printed periodicals. With optical character recognition and simple text searches, you can quickly locate candidate mentions. With text mining, you can map patterns of repetition and difference across time and place. With GIS, you can plot those mentions against transport networks or voting patterns. The computational steps don’t answer your questions; they sharpen them and focus your attention where it matters.

Many historians worry that turning sources into data strips away context or flattens complexity. That fear is healthy if it pushes us to be careful, but it is often misplaced if it stops us from exploring. A well-managed digital project preserves context through metadata and careful transcription. It captures nuance by annotating uncertainty, and it invites comparison rather than imposing uniformity. The trick is to keep the historian’s eye on the data at every stage, from scanning to visualization. Algorithms are tools, not oracles. They produce models of the past, not the past itself, and our job is to interrogate those models with rigor and curiosity.

Consider the simple act of counting. Historians count all the time, if only in their heads: how many women appear as signatories in petitions; how often a given term appears in speeches; how many parishes report a particular crop failure. Counting with a computer scales that habit up, but also makes it reproducible and transparent. If you ask, “How did the rhetoric of ‘civilization’ change in French and German newspapers from 1870 to 1914?” a frequency analysis can show you when the term spikes. It cannot, on its own, tell you why. But it can prompt you to look at the surrounding text, compare editorials across languages, and consider the political events that shaped editorial choices.

European sources come with distinctive quirks that make digital methods both exciting and tricky. Orthography shifts over time and place; you might find “Moscow” and “Moskva” in different documents referring to the same city, or “Cöln” instead of “Köln.” Scripts change too: historical documents in Hungarian used different orthographies before and after reforms, and records across the Habsburg empire may appear in German, Hungarian, Latin, Croatian, and more. Borders are unstable; cities and regions change names and jurisdictions. A place that is Polish in one decade may be German or Russian in another. Digital workflows must acknowledge this instability, or they risk misclassifying and misrepresenting the past.

The good news is that digital methods have matured in ways that accommodate complexity. OCR engines have improved on historical typefaces; handwritten text recognition is making headway with cursive scripts; and multilingual NLP tools can handle code-switching and named variants with guidance from historians who know the material. Gazetteers such as GeoNames and the Historical Gazetteer of Galicia help align historical place names with modern coordinates. Shared standards like TEI for encoding texts and IIIF for interoperable images mean that your annotations can travel across tools and institutions. These are not fixes for every problem, but they give us flexible frameworks to work within.

Another reason to embrace these methods is transparency. Traditional historical work is often opaque to readers; we see the published book, but not the thousands of small decisions that shaped its arguments. Digital workflows invite us to document our steps explicitly: which OCR model we used, how we cleaned the text, which thresholds we set for topic models, and why we chose a particular map projection. Keeping a computational notebook—whether in Jupyter or R Markdown—captures those decisions in a form that others can inspect, challenge, and reproduce. That openness strengthens historical argumentation and makes our scholarship more robust.

At the same time, the ethics of data work are front and center. Historical archives contain names, places, and intimate details that, even if old, can be sensitive. Europe’s GDPR governs personal data processing, and national archives have their own access rules and licensing terms. Digitization can be expensive and labor intensive, and not all collections are equally represented online. Ethical practice means checking rights and privacy considerations, thinking about the potential impact of publishing datasets, and engaging with communities whose histories are entangled with the records. It also means documenting provenance, acknowledging uncertainties, and resisting the temptation to treat messy sources as cleanly machine-readable when they are not.

Let’s ground this with a concrete scenario. Suppose you’re interested in how migration from rural areas to industrial cities was portrayed in regional newspapers across Italy and Germany between 1890 and 1914. A first step might be to assemble a corpus of scanned issues. With OCR, you convert the images to text, then search for terms like “migration,” “emigration,” “workers,” and their local variants. You can plot the frequency of these terms by month and place. If you find clusters, you can look at the adjacent articles to see what events triggered coverage. If you’re ambitious, you might apply named entity recognition to identify mentions of specific cities or factories, then geocode those places to see if reporting correlates with known migration routes.

For some questions, networks are more illuminating than maps. Take the correspondence of a transnational scientific society. Each letter is an edge between two people; the content is the edge label. By extracting names and affiliations, you can build a network that shows who corresponded with whom, when, and how often. Centrality measures will reveal the brokers who carried information between subgroups. Community detection might uncover regional clusters. Combined with close reading of key letters, this network becomes an argument about the structure of knowledge exchange. It’s still history, just with a graph instead of a narrative as the main vehicle.

Working with European sources also requires patience with formats and metadata. A scan might be a high-resolution TIFF in one repository and a low-quality JPEG in another. Descriptive metadata may use different standards: some libraries prefer Dublin Core, others use MARC, and specialized projects may adopt TEI header conventions. Image metadata may be encoded in IIIF manifests that allow deep zoom and consistent annotation across platforms. Aligning these layers so that your corpus is coherent takes time. It is not glamorous work, but it is essential. Garbage in, garbage out remains a truism; thoughtful metadata in, meaningful analysis out.

A common entry point is digitization. If you already have access to digitized collections, you may skip scanning and go straight to OCR. But if you’re working with unique or local archives, you may need to digitize yourself. Good scans are the foundation: 300 dpi for text, clean images, consistent orientation. OCR engines vary in their handling of historical fonts; it’s worth testing a few pages to see which works best for your material. Handwritten sources require different tools. Handwritten text recognition systems can learn styles from your specific collections, but they need training data. A pragmatic approach is to start with a small, representative sample and iterate.

Once you have text, you must clean it. OCR introduces errors, especially with long s and ambiguous characters. Historical spelling variation and abbreviations compound the problem. A preprocessing pipeline will normalize common variants, strip artifacts, and handle punctuation consistently. It’s tempting to automate everything, but a historian’s judgment is irreplaceable here. Sometimes an “error” is a clue about the printing practices of a particular workshop. Decide what to keep and what to fix, and keep a record of those choices. The cleaned corpus is not just a dataset; it’s an interpretive object shaped by your research question.

Exploratory analysis often starts with frequencies and collocations. Seeing how often a term appears over time can hint at shifting discourses. Collocations—words that appear together—can reveal patterns, like “labor” co-occurring with “female” in certain periods and “child” in others. Concordances show a term in its immediate context, letting you inspect how it’s used. These simple techniques can be surprisingly powerful. They encourage you to refine your queries, refine your corpus, and refine your questions. They also make it easier to explain to others why you chose particular paths in your analysis.

More advanced techniques, like topic modeling, can surface themes across large collections. Topic models don’t “understand” meaning, but they group words that tend to co-occur in documents. Historians can then interpret those groups as discursive clusters—discussions of public health, trade policy, or religious reform, for example. Because topic models are sensitive to parameters and preprocessing, they are as much a craft as a science. It’s essential to inspect the topics, rename them thoughtfully, and compare their prevalence over time and place. In European contexts, you may need multilingual models or separate models per language, then align results conceptually.

Named entity recognition helps you locate people, places, organizations, and dates in text. Good NER models for historical languages exist, but they often benefit from adaptation to your corpus. If you plan to map places, you’ll need to resolve historical names to coordinates. This is not trivial: “Constantinople” and “Istanbul” point to the same city but carry different temporal connotations. A gazetteer can help you track name changes and boundary shifts. You may also need to model uncertainty: if a place name could refer to one of three towns, you can represent that ambiguity rather than choosing arbitrarily. This respect for uncertainty is part of the historical craft.

Stylometry—measuring stylistic features of texts—can help with questions of authorship. In multilingual Europe, this gets interesting: writers often code-switch or adapt style to audience. Stylometric measures might detect the same author’s hand in a French letter and a German memo. But these methods are probabilistic and context dependent. They work best when combined with documentary evidence and internal clues. Authorship is rarely settled by numbers alone; instead, the numbers add a line of evidence that you weigh alongside other clues. That cautious integration is a hallmark of good digital history.

Visualizing your results matters both for analysis and communication. A simple time series of term frequencies can help you spot anomalies; a map of geocoded entities can reveal regional biases in coverage; a network diagram can show who sits at the structural holes of a conversation. The design choices—axes, colors, labels—affect how others read your claims. Using accessible color palettes, clearly labeled axes, and annotated legends reduces misinterpretation. Visualizations are arguments, not decoration. They make choices explicit and invite critique, which is precisely what scholarly dialogue requires.

Reproducibility is a key theme. If you cannot rerun your analysis a year later, you cannot defend it confidently. Keeping your code in a version-controlled repository, documenting your data sources, and writing your workflow as a series of executable steps ensures that your results are stable. Notebooks are a convenient way to interleave code, commentary, and outputs. They make your reasoning visible. When a colleague asks how you arrived at a figure, you can point them to a specific cell in a notebook. Reproducibility also protects your future self from the confusion of forgotten scripts and unexplained choices.

There are pitfalls to avoid. Don’t treat OCR output as perfect text. Don’t treat NER output as ground truth. Don’t treat a topic model as a definitive thematic map. And don’t let the tool drive the question. The best way to avoid these traps is to cycle between computation and reading. Run an analysis to find patterns, then read representative passages to understand the texture. If your analysis flags an anomaly, treat it as a prompt to investigate. This cycle maintains the interpretive core of history while harnessing the scale and precision of computational methods.

It also helps to keep a sense of scale and proportion. Some questions demand massive corpora and sophisticated models; others are answered by a few well-chosen examples and a small script. The size of your dataset should fit the scope of your question, and the complexity of your method should fit your time and skills. There is no virtue in using a neural network if a regex will do. Begin simply, check your assumptions, and scale up only when the added complexity brings additional clarity. Your goal is insight, not tech for tech’s sake.

A practical note about collaboration: digital projects often benefit from working with others. Librarians can advise on scan quality and metadata. Computational colleagues can suggest efficient workflows and check code. Domain experts can interpret patterns you uncover. Collaboration requires shared documentation and clear credit. It’s worth thinking early about authorship, roles, and how results will be shared. Even if you’re working alone, imagine a future collaborator reading your notes. Clarity now will save time later and make your work more usable by others.

As you start, it helps to have a few reference points. For text, you might explore corpora from national libraries or newspaper archives; for GIS, you can lean on open datasets from national mapping agencies and historical gazetteers; for networks, you might draw on directories, biographical databases, or correspondence catalogs. The tools themselves range from user-friendly desktop applications like QGIS to programming libraries in R and Python. Choose the tool that matches your habits and your project’s needs. The important thing is to start, even if the first steps are small and imperfect.

The chapters ahead will walk you through these steps with concrete examples and templates. You’ll learn how to design a research project that fits digital methods; how to build and curate corpora responsibly; how to turn scans into text and clean that text; how to add and manage metadata; and how to explore your corpus with frequencies, collocations, and more. You’ll move into NLP, topic modeling, and stylometry; you’ll build networks and analyze them; you’ll geocode and map; you’ll visualize and share; and you’ll do it all in a reproducible way. Case studies will show how these pieces fit together in real historical projects.

European history is rich, complicated, and beautifully diverse. Digital methods give you new ways to see its structures, track its changes, and tell its stories. They won’t replace the quiet pleasure of reading a diary by a window in an archive, but they can help you find which diaries matter, connect them to other lives, and show how patterns of language and place weave through the past. With Chronicles and Code, you’ll build the skills to do that work carefully, ethically, and well. The next chapter takes up research design: how to move from a curiosity to a question, and from a question to a plan.

This is a sample preview. The complete book contains 27 sections.

Table of Contents

Chronicles and Code

Table of Contents

Introduction

CHAPTER ONE: Why Digital Methods for European History?