1 to 3 of 3 Results
Sep 18, 2025 - Language Technologies Laboratory
De Luca Fornaciari, Francesca; Mash, Audrey; Melero, Maite; Villegas, Marta, 2025, "CA-GL_Parallel_Corpus", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/VUTENU, BSC Dataverse, V1
The CA-GL Parallel Corpus is a Catalan-Galician synthetic dataset of parallel sentences created to support the use of co-official languages from Spain, such as Catalan and Galician, in NLP tasks, specifically Machine Translation. The dataset can be used to train Bilingual Machine Translation models between Galician and Catalan in any direction, as... |
Sep 18, 2025 - Language Technologies Laboratory
De Luca Fornaciari, Francesca; Mash, Audrey; Melero, Maite; Villegas, Marta, 2025, "CA-EU_Parallel_Corpus", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/A9UJA9, BSC Dataverse, V1
The CA-EU Parallel Corpus is a Catalan-Basque synthetic dataset of parallel sentences created to support the use of co-official languages from Spain, such as Catalan and Basque, in NLP tasks, specifically Machine Translation. The dataset can be used to train Bilingual Machine Translation models between Basque and Catalan in any direction, as well a... |
Sep 16, 2025 - Language Technologies Laboratory
Saiz Antón, José Javier; Palomar-Giner, Jorge; Villegas, Marta, 2025, "CATalog", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/FAFYBH, BSC Dataverse, V2
CATalog is a diverse, open-source Catalan corpus for language modelling. It consists of text documents from 26 different sources, including web crawling, news, forums, digital libraries and public institutions, totaling in 17.45 billion words. |
