21 to 30 of 31 Results
Sep 18, 2025 - Language Technologies Laboratory
Baybars Kulebi; Armentano i Oller, Carme; Alexandre Cristian Peiro Lilja, 2025, "openslr-slr69-ca-trimmed-denoised", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/OASOAN, BSC Dataverse, V1
This is a post-processed version of the Catalan subset belonging to the Open Speech and Language Resources (OpenSLR) speech dataset. Specifically the subset OpenSLR-69. We processed the data of the Catalan OpenSLR with the following recipe: Trimming, Resampling, Denoising. |
Sep 18, 2025 - Language Technologies Laboratory
Külebi, Baybars, 2025, "parlament_parla", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/RBGNSJ, BSC Dataverse, V1
This is the ParlamentParla speech corpus of more than 600 hours of speech from Catalan Parliament sessions. The audio segments were extracted from recordings the Catalan Parliament (Parlament de Catalunya) plenary sessions, which took place between 2007/07/11 - 2018/07/17. We aligned the transcriptions with the recordings and extracted the corpus.... |
Sep 18, 2025 - Language Technologies Laboratory
Jose Omar Giraldo Valencia; Armentano i Oller, Carme; Peiró-Lilja, Alex; Llopart, Martí, 2025, "festcat_trimmed_denoised", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/SJXGXT, BSC Dataverse, V1
This is a post-processed version of the Catalan Festcat speech dataset. We processed the data of the Catalan Festcat with the following recipe: Trimming (Long silences from the start and the end of clips have been removed), Resampling (From 48000 Hz to 22050 Hz, which is the most common sampling rate for training TTS models), Denoising (Although ba... |
Sep 18, 2025 - Language Technologies Laboratory
Messaoudi, Abir; Romero Diaz, Jacobo; Francisco Javier Hernando Pericas, 2025, "CAESAR-TV3", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/Z0E9OV, BSC Dataverse, V1
This corpus includes 5 hours and 45 minutes of Catalan speech code-switched with Spanish extracted from the original tv3_parla dataset. |
Sep 18, 2025 - Language Technologies Laboratory
Armentano i Oller, Carme; Hernández Mena, Carlos Daniel; Külebi, Baybars, 2025, "commonvoice_benchmark_catalan_accents", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/WOYNOE, BSC Dataverse, V1
This is a new presentation of the corpus Catalan Common Voice v17 - metadata annotated version with the splits redefined to benchmark ASR models with various Catalan accents: From the validated recording split, we have selected, for each of the main accents of the language (balearic, central, northern, northwestern, valencian), the necessary male a... |
Sep 18, 2025 - Language Technologies Laboratory
Solito, Sarah; Messaoudi, Abir; Külebi, Baybars, 2025, "parlament_parla_v3", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/JSUTRR, BSC Dataverse, V1
'parlament_parla_v3' is a speech corpus composed of Catalan Parliamentary Sessions.The v3 and last version of the corpus includes both clean and other quality segments, divided into short segments (less than 30 seconds) and long segments (more than 30 seconds). The total dataset encompasses 1059h 48m 04s of speech, including 945h 51m 06s for the sh... |
Sep 18, 2025 - Language Technologies Laboratory
Messaoudi, Abir; Solito, Sarah; Külebi, Baybars, 2025, "corts_valencianes_asr_a", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/YFCUPS, BSC Dataverse, V1
The Corts Valencianes Speech Corpus is a rich dataset composed of speech recordings from the sessions of the Corts Valencianes. The corpus includes both clean and other quality segments, divided into short segments (less than 30 seconds) and long segments (more than 30 seconds). The total dataset encompasses 270 hours, 5 minutes, and 34 seconds of... |
Sep 18, 2025 - Language Technologies Laboratory
De Luca Fornaciari, Francesca; Mash, Audrey; Melero, Maite; Villegas, Marta, 2025, "CA-GL_Parallel_Corpus", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/VUTENU, BSC Dataverse, V1
The CA-GL Parallel Corpus is a Catalan-Galician synthetic dataset of parallel sentences created to support the use of co-official languages from Spain, such as Catalan and Galician, in NLP tasks, specifically Machine Translation. The dataset can be used to train Bilingual Machine Translation models between Galician and Catalan in any direction, as... |
Sep 18, 2025 - Language Technologies Laboratory
De Luca Fornaciari, Francesca; Mash, Audrey; Melero, Maite; Villegas, Marta, 2025, "CA-EU_Parallel_Corpus", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/A9UJA9, BSC Dataverse, V1
The CA-EU Parallel Corpus is a Catalan-Basque synthetic dataset of parallel sentences created to support the use of co-official languages from Spain, such as Catalan and Basque, in NLP tasks, specifically Machine Translation. The dataset can be used to train Bilingual Machine Translation models between Basque and Catalan in any direction, as well a... |
Sep 16, 2025 - Language Technologies Laboratory
Saiz Antón, José Javier; Palomar-Giner, Jorge; Villegas, Marta, 2025, "CATalog", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/FAFYBH, BSC Dataverse, V2
CATalog is a diverse, open-source Catalan corpus for language modelling. It consists of text documents from 26 different sources, including web crawling, news, forums, digital libraries and public institutions, totaling in 17.45 billion words. |
