Reproducible workflows for archaeological image datasets

This project develops a reproducible workflow for transforming legacy, web-based archaeological image collections into structured datasets suitable for computer vision and quantitative analysis. The work addresses a common problem in digital archaeology: while large image collections are increasingly available online, they are often difficult to access programmatically and are not provided in formats that support systematic reuse or machine learning.

The workflow was developed using the Lower Palaeolithic handaxe and biface collection curated by the Archaeology Data Service (ADS) as a case study. Two open source components form the core of the approach. The first is a web scraping and metadata extraction pipeline that systematically retrieves record pages, downloads associated images, and captures archaeological metadata, while respecting repository terms of use and ethical scraping practices.

The second component is an image processing and segmentation pipeline designed to prepare archaeological images for computer vision workflows. Images are renamed using UUID-based identifiers to avoid filename collisions, and classical computer vision techniques are used to generate binary masks and bounding boxes for individual artefacts. All derived information is stored in a COCO-compatible JSON file enriched with archaeological metadata, providing a standardised and reusable data structure.

A key design principle of the project is the separation between original source data and derived products. To comply with ADS licensing conditions, original images are not redistributed. Instead, scripts, segmentation masks, outlines, annotations, and metadata schemas are shared openly, allowing other researchers to reproduce the workflow or apply it to similar collections without duplicating the most time-consuming steps.

Overall, the project demonstrates how relatively lightweight tools can be used to unlock the analytical potential of existing archaeological image repositories, supporting more transparent, reproducible, and scalable computer vision research in archaeology.

The full technical description and validation of the workflow are available in the accompanying preprint: https://arxiv.org/abs/2512.11817