CAA2026 – Turning Legacy Archaeological Images into Reusable Data

At CAA2026, I will be presenting a small but very practical piece of work: a reproducible workflow to turn a legacy archaeological image collection into a structured, machine-learning-ready dataset.

This talk is part of the Little Minions session, which focuses on those small tools we build in the background to actually get research done.

Image of a hand axe
Fig. 1 - Image of a hand axe


The problem: access vs usability

Archaeology has no shortage of digital data.

Repositories such as the Archaeology Data Service (ADS) have made large datasets publicly accessible, including image collections, metadata, and documentation. However, in many cases, access does not equal usability.

A good example is the dataset Lower Palaeolithic technology, raw material and population ecology by Marshall, Dupplaw, Roe and Gamble, 2002, hosted at the ADS which contains:

All of this is available online… but:

This creates a bottleneck. The data exists, but using it at scale becomes slow, repetitive, and difficult to reproduce.

Dataset query page in the ADS site
Fig. 1 - Dataset query page in the ADS site

Biface record page in the ADS site
Fig. 2 - Biface record page in the ADS site


A small tool approach

Instead of building a complex system, the goal is deliberately simple:

Can we turn this dataset into something usable with a lightweight, reproducible workflow?

The solution consists of two small, open tools:

  1. A web scraping script
  2. An image processing pipeline

Both are intentionally minimal, transparent, and easy to reuse.

This is very much in the spirit of “little minions”: small helpers that quietly improve research workflows without being the main focus of publications.


Step 1: scraping the dataset

The first step is automating what a researcher would otherwise do manually:

A short Python script handles this by:

Some important considerations:

The result:

What would normally require hours of manual work becomes reproducible in a single run.


Step 2: structuring and segmenting images

Once the data is local, the second step prepares it for computational analysis.

The pipeline performs three main tasks:

1. Standardising filenames

Images are renamed using UUIDs to avoid collisions when merging datasets.

2. Creating a COCO dataset

A JSON file is generated following the COCO format:

This provides a standard structure widely used in computer vision.

3. Segmenting artefacts

Using classical computer vision:

This works well because the dataset follows a controlled format:

The result is a segmentation-ready dataset, without manual annotation.

Processed image of a biface
Fig. 3 - Processed image of a biface


Why not deep learning?

One of the key decisions in this work is to avoid deep learning for segmentation.

Instead, the workflow uses:

This comes with trade-offs:

Pros

Cons

For this specific dataset, the controlled photographic setup makes classical methods a very reasonable choice.


Outputs and reuse

The workflow produces:

Crucially:

The original images are not redistributed. Only derived data is shared.

This respects repository licensing while still enabling open workflows.


What this enables

Once the dataset is structured, a lot becomes possible:

More importantly, the process becomes:


Limitations (and why they matter)

This is not a universal solution.

The workflow depends on:

Adapting it to other datasets may require:

So this should be seen as a particular helping tool, not a final solution


A broader point

One of the key takeaways from this work is quite simple:

You do not always need complex systems to improve research workflows.

Sometimes:

are enough to unlock the value of an existing dataset.


Preprint and code

If you want the full technical details, you can read the preprint here.

The scripts are available on GitHub here and here.


Final thoughts

This project is intentionally modest.

It focuses on one dataset, one workflow, and one specific use case. But it shows something important:

And maybe most importantly:

many of these “little minions” already exist in our daily work, we just do not share them enough.


If you are attending CAA2026 and interested in this topic, feel free to come say hi before or after the talk (during the talk might be a bit distracting).