|
Persistent Identifier
|
perma:BSC/0VB0MC |
|
Publication Date
|
2025-06-26 |
|
Title
| A Dataset for Handwritten Text Recognition in Medieval Notarial Charters Written on Parchment |
|
Alternative Title
| AMSMB HTR |
|
Author
| Coll Ardanuy, Marionahttps://ror.org/05sd8tv96ORCIDhttps://orcid.org/0000-0001-8455-7196
Cuadrada, Coralhttps://ror.org/05sd8tv96ORCIDhttps://orcid.org/0000-0003-4577-2381
Sarobe, Ramonhttps://ror.org/05sd8tv96ORCIDhttps://orcid.org/0000-0003-2099-3567 |
|
Point of Contact
|
Use email button above to contact.
Coll Ardanuy, Mariona (Barcelona Supercomputing Center (BSC)) |
|
Description
| We present a new dataset for the task of handwritten text recognition on medieval manuscripts, focusing on notarial charters written on parchment dating from the 13th to 15th centuries. Our dataset is comprised of 100 digitized manuscripts (3,369 lines), which have been carefully selected to represent the large variation that is present in the sources, encompassing at least 80 distinct hands and various document types (from sales and inventories to last wills and marriage contracts). Written primarily in Medieval Latin with fragments in Medieval Catalan, these manuscripts exhibit varying stages of preservation and degrees of deterioration, resulting in a very heterogeneous dataset. The dataset consists of 100 images and their associated transcriptions in the PageXML format, as well as a csv file with metadata for each element. The process of creation of the dataset is documented in the accompanying datasheet. (2025-06-26) |
|
Subject
| Arts and Humanities |
|
Keyword
| handwritten text recognition
Late Middle Ages
automatic transcription |
|
Topic Classification
| medieval history
artificial intelligence
archival history
diplomatics
paleography
document analysis and recognition |
|
Related Publication
| Is Supplement To: Mariona Coll Ardanuy, Iban Berganzo-Besga, Ramon Sarobe, and Coral Cuadrada. 2025 (forthcoming). Evaluating Handwritten Text Recognition in Medieval Notarial Manuscripts: A New Dataset and Comprehensive Analysis. In International Conference on Document Analysis and Recognition |
|
Notes
| A derived version of the dataset for line-level handwritten text recognition is available on HuggingFace: https://huggingface.co/datasets/BSC-CSSH/AMSMB-line-transcription. The derived version was created using parse-pagexml: https://gitlab.bsc.es/cssh/releases/parse-pagexml. |
|
Language
| Latin; Catalan, Valencian |
|
Producer
| Barcelona Supercomputing Center (Barcelona Supercomputing Center) (BSC) https://bsc.es/ |
|
Production Date
| 2025-03-14 |
|
Production Location
| Barcelona |
|
Contributor
| Other: Arxiu dels Marquesos de Santa Maria de Barberà
Other: Arxiu Municipal de Vilassar de Dalt |
|
Funding Information
| AI4S fellowship within the “Generación D” initiative by Red.es, Ministerio para la Transformación Digital y de la Función Pública, for talent attraction, funded by NextGenerationEU through PRTR: C005/24-ED CV1 |
|
Distributor
| Barcelona Supercomputing Center (Barcelona Supercomputing Center) (BSC) https://bsc.es/ |
|
Distribution Date
| 2025-06-06 |
|
Depositor
| Coll Ardanuy, Mariona |
|
Deposit Date
| 2025-06-04 |
|
Time Period
| Start Date: 1208-01-01; End Date: 1499-12-31 |
|
Date of Collection
| Start Date: 2024-08-01; End Date: 2025-03-14 |
|
Data Type
| xml transcriptions; digitized manuscripts; csv metadata file; pdf file |
|
Software
| eScriptorium, Version: 0.14.2 |
|
Origin of Historical Sources
| Arxiu dels Marquesos de Santa Maria de Barberà: https://arxiumarquesosdebarbera.cat/ |
|
Characteristic of Sources
| Notarial charters written on parchment dating from the 13th to 15th centuries, carefully selected to represent the large variation that is present in the sources, encompassing at least 80 distinct hands, various document types (from sales and inventories to last wills and marriage contracts), and spanning three centuries; written primarily in Medieval Latin with fragments in Medieval Catalan, these manuscripts exhibit varying stages of preservation and degrees of deterioration. |
|
Documentation and Access to Sources
| The digitized images have been provided by the Arxiu Municipal de Vilassar de Dalt (AMVD), through its agreement with the Arxiu dels Marquesos de Santa Maria de Barberà (AMSMB). |