CAA2026 – Turning Legacy Archaeological Images into Reusable Data

25 March 2026

At CAA2026, I will be presenting a small but very practical piece of work: a reproducible workflow to turn a legacy archaeological image collection into a structured, machine-learning-ready dataset.

This talk is part of the Little Minions session, which focuses on those small tools we build in the background to actually get research done.

Fig. 1 - Image of a hand axe

The problem: access vs usability

Archaeology has no shortage of digital data.

Repositories such as the Archaeology Data Service (ADS) have made large datasets publicly accessible, including image collections, metadata, and documentation. However, in many cases, access does not equal usability.

A good example is the dataset Lower Palaeolithic technology, raw material and population ecology by Marshall, Dupplaw, Roe and Gamble, 2002, hosted at the ADS which contains:

10,668 images
3,556 artefacts
Rich metadata (raw material, measurements, provenance)

All of this is available online… but:

no bulk download
no API
manual navigation required for each record

This creates a bottleneck. The data exists, but using it at scale becomes slow, repetitive, and difficult to reproduce.

Fig. 1 - Dataset query page in the ADS site

Fig. 2 - Biface record page in the ADS site

A small tool approach

Instead of building a complex system, the goal is deliberately simple:

Can we turn this dataset into something usable with a lightweight, reproducible workflow?

The solution consists of two small, open tools:

A web scraping script
An image processing pipeline

Both are intentionally minimal, transparent, and easy to reuse.

This is very much in the spirit of “little minions”: small helpers that quietly improve research workflows without being the main focus of publications.

Step 1: scraping the dataset

The first step is automating what a researcher would otherwise do manually:

visit each record page
extract metadata
download images

A short Python script handles this by:

iterating through record IDs
parsing HTML pages
saving metadata into a CSV
downloading associated images

Some important considerations:

respecting robots.txt
adding delays between requests
checking ADS Terms of Use

The result:

a complete local copy of the dataset (images + metadata)
a structured CSV ready for analysis

What would normally require hours of manual work becomes reproducible in a single run.

Step 2: structuring and segmenting images

Once the data is local, the second step prepares it for computational analysis.

The pipeline performs three main tasks:

1. Standardising filenames

Images are renamed using UUIDs to avoid collisions when merging datasets.

2. Creating a COCO dataset

A JSON file is generated following the COCO format:

image metadata
bounding boxes
segmentation masks
collection-level information

This provides a standard structure widely used in computer vision.

3. Segmenting artefacts

Using classical computer vision:

detect contours
select the largest contour (the artefact)
generate:
- binary masks
- bounding boxes

This works well because the dataset follows a controlled format:

one object per image
dark background
labels separated from the artefact

The result is a segmentation-ready dataset, without manual annotation.

Fig. 3 - Processed image of a biface

Why not deep learning?

One of the key decisions in this work is to avoid deep learning for segmentation.

Instead, the workflow uses:

thresholding
contour detection

This comes with trade-offs:

Pros

transparent
lightweight
easy to reproduce
no training data required

Cons

sensitive to lighting and contrast
less robust in complex images

For this specific dataset, the controlled photographic setup makes classical methods a very reasonable choice.

Outputs and reuse

The workflow produces:

CSV with full metadata
local image collection
UUID mapping table
COCO annotation file
segmentation masks and bounding boxes

Crucially:

The original images are not redistributed. Only derived data is shared.

This respects repository licensing while still enabling open workflows.

What this enables

Once the dataset is structured, a lot becomes possible:

morphometric analysis
computer vision experiments
cross-dataset comparisons
integration with other collections

More importantly, the process becomes:

reproducible
transparent
reusable

Limitations (and why they matter)

This is not a universal solution.

The workflow depends on:

a stable website structure
a specific image format
one object per image
controlled backgrounds

Adapting it to other datasets may require:

new scraping logic
different segmentation strategies
possibly manual annotation or ML models

So this should be seen as a particular helping tool, not a final solution

A broader point

One of the key takeaways from this work is quite simple:

You do not always need complex systems to improve research workflows.

Sometimes:

a short script
a clear data structure
and a focus on reproducibility

are enough to unlock the value of an existing dataset.

Preprint and code

If you want the full technical details, you can read the preprint here.

The scripts are available on GitHub here and here.

Final thoughts

This project is intentionally modest.

It focuses on one dataset, one workflow, and one specific use case. But it shows something important:

legacy datasets are not “closed”
small tools can unlock them
reproducibility does not require complexity

And maybe most importantly:

many of these “little minions” already exist in our daily work, we just do not share them enough.

If you are attending CAA2026 and interested in this topic, feel free to come say hi before or after the talk (during the talk might be a bit distracting).