Barcelona Supercomputing Center

Metrics

33,620 Downloads

The BSC Dataverse is the institutional research data repository of the Barcelona Supercomputing Center - Centro Nacional de Supercomputación (BSC-CNS). It seeks to enable the storage, sharing, and search of research data coming from the BSC researchers, collaborators, and affiliated projects.

Computational Social Sciences & Humanities Dataverse

Earth Sciences

Life Sciences

BSC AI Factory

Red Española de Supercomputación

BSC Dissemination

Featured Dataverses

In order to use this feature you must have at least one published or linked dataverse.

Publish Dataverse

Are you sure you want to publish your dataverse? Once you do so it must remain published.

Publish Dataverse

This dataverse cannot be published because the dataverse it is in has not been published.

Delete Dataverse

Are you sure you want to delete your dataverse? You cannot undelete this dataverse.

61 to 70 of 107 Results

Composite Database of Ultra-Large Chemical Libraries Sep 29, 2025 - Life Sciences Filella Merce, Isaac; Guallar, Victor; Isaac Soul Garcia, 2025, "Composite Database of Ultra-Large Chemical Libraries", https://doi.org/10.82201/6043UT, BSC Dataverse, V1 The Composite Database consists of approximately 120 billion molecules sourced from five ultra-large (> 100 million compounds) and nine large publicly available chemical libraries. Developed to support early-stage drug discovery, it is the largest publicly available database of enlisted molecules, readily accessible for efficient analog searches an...
LaFrescat Sep 18, 2025 - Language Technologies Laboratory Baybars Kulebi; Marti Llopart; Jose Omar Giraldo Valencia; Armentano i Oller, Carme; Alexandre Cristian Peiro Lilja, 2025, "LaFrescat", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/ZP7GNT, BSC Dataverse, V1 We present LaFresCat, the first Catalan multiaccented and multispeaker dataset. This dataset was created with recordings from professional voice actors at Lafresca Creative Studio. The processing pipeline included several steps: Trimming long silences, Resampling from 48 kHz to 22.05 kHz, Stereo to mono conversion The dataset totals 3.75 hours afte...
cv17_es_other_automatically_verified Sep 18, 2025 - Language Technologies Laboratory Hernández Mena, Carlos Daniel; Armentano i Oller, Carme, 2025, "cv17_es_other_automatically_verified", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/2DFJYA, BSC Dataverse, V1 "cv17_es_other_automatically_verified," is, as the name suggests, the result of the automatic validation of the "other" portion of Common Voice 17.0. The validation process was carried out using OpenAI's Whisper large model. If Whisper produces the same text as the Common Voice prompt, the transcription is considered valid regardless of its votes....
openslr-slr69-ca-trimmed-denoised Sep 18, 2025 - Language Technologies Laboratory Baybars Kulebi; Armentano i Oller, Carme; Alexandre Cristian Peiro Lilja, 2025, "openslr-slr69-ca-trimmed-denoised", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/OASOAN, BSC Dataverse, V1 This is a post-processed version of the Catalan subset belonging to the Open Speech and Language Resources (OpenSLR) speech dataset. Specifically the subset OpenSLR-69. We processed the data of the Catalan OpenSLR with the following recipe: Trimming, Resampling, Denoising.
parlament_parla Sep 18, 2025 - Language Technologies Laboratory Külebi, Baybars, 2025, "parlament_parla", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/RBGNSJ, BSC Dataverse, V1 This is the ParlamentParla speech corpus of more than 600 hours of speech from Catalan Parliament sessions. The audio segments were extracted from recordings the Catalan Parliament (Parlament de Catalunya) plenary sessions, which took place between 2007/07/11 - 2018/07/17. We aligned the transcriptions with the recordings and extracted the corpus....
festcat_trimmed_denoised Sep 18, 2025 - Language Technologies Laboratory Jose Omar Giraldo Valencia; Armentano i Oller, Carme; Peiró-Lilja, Alex; Llopart, Martí, 2025, "festcat_trimmed_denoised", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/SJXGXT, BSC Dataverse, V1 This is a post-processed version of the Catalan Festcat speech dataset. We processed the data of the Catalan Festcat with the following recipe: Trimming (Long silences from the start and the end of clips have been removed), Resampling (From 48000 Hz to 22050 Hz, which is the most common sampling rate for training TTS models), Denoising (Although ba...
CAESAR-TV3 Sep 18, 2025 - Language Technologies Laboratory Messaoudi, Abir; Romero Diaz, Jacobo; Francisco Javier Hernando Pericas, 2025, "CAESAR-TV3", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/Z0E9OV, BSC Dataverse, V1 This corpus includes 5 hours and 45 minutes of Catalan speech code-switched with Spanish extracted from the original tv3_parla dataset.
commonvoice_benchmark_catalan_accents Sep 18, 2025 - Language Technologies Laboratory Armentano i Oller, Carme; Hernández Mena, Carlos Daniel; Külebi, Baybars, 2025, "commonvoice_benchmark_catalan_accents", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/WOYNOE, BSC Dataverse, V1 This is a new presentation of the corpus Catalan Common Voice v17 - metadata annotated version with the splits redefined to benchmark ASR models with various Catalan accents: From the validated recording split, we have selected, for each of the main accents of the language (balearic, central, northern, northwestern, valencian), the necessary male a...
parlament_parla_v3 Sep 18, 2025 - Language Technologies Laboratory Solito, Sarah; Messaoudi, Abir; Külebi, Baybars, 2025, "parlament_parla_v3", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/JSUTRR, BSC Dataverse, V1 'parlament_parla_v3' is a speech corpus composed of Catalan Parliamentary Sessions.The v3 and last version of the corpus includes both clean and other quality segments, divided into short segments (less than 30 seconds) and long segments (more than 30 seconds). The total dataset encompasses 1059h 48m 04s of speech, including 945h 51m 06s for the sh...
corts_valencianes_asr_a Sep 18, 2025 - Language Technologies Laboratory Messaoudi, Abir; Solito, Sarah; Külebi, Baybars, 2025, "corts_valencianes_asr_a", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/YFCUPS, BSC Dataverse, V1 The Corts Valencianes Speech Corpus is a rich dataset composed of speech recordings from the sessions of the Corts Valencianes. The corpus includes both clean and other quality segments, divided into short segments (less than 30 seconds) and long segments (more than 30 seconds). The total dataset encompasses 270 hours, 5 minutes, and 34 seconds of...

Composite Database of Ultra-Large Chemical Libraries

Sep 29, 2025 - Life Sciences

Filella Merce, Isaac; Guallar, Victor; Isaac Soul Garcia, 2025, "Composite Database of Ultra-Large Chemical Libraries", https://doi.org/10.82201/6043UT, BSC Dataverse, V1

The Composite Database consists of approximately 120 billion molecules sourced from five ultra-large (> 100 million compounds) and nine large publicly available chemical libraries. Developed to support early-stage drug discovery, it is the largest publicly available database of enlisted molecules, readily accessible for efficient analog searches an...

LaFrescat

Sep 18, 2025 - Language Technologies Laboratory

Baybars Kulebi; Marti Llopart; Jose Omar Giraldo Valencia; Armentano i Oller, Carme; Alexandre Cristian Peiro Lilja, 2025, "LaFrescat", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/ZP7GNT, BSC Dataverse, V1

We present LaFresCat, the first Catalan multiaccented and multispeaker dataset. This dataset was created with recordings from professional voice actors at Lafresca Creative Studio. The processing pipeline included several steps: Trimming long silences, Resampling from 48 kHz to 22.05 kHz, Stereo to mono conversion The dataset totals 3.75 hours afte...

cv17_es_other_automatically_verified

Sep 18, 2025 - Language Technologies Laboratory

Hernández Mena, Carlos Daniel; Armentano i Oller, Carme, 2025, "cv17_es_other_automatically_verified", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/2DFJYA, BSC Dataverse, V1

"cv17_es_other_automatically_verified," is, as the name suggests, the result of the automatic validation of the "other" portion of Common Voice 17.0. The validation process was carried out using OpenAI's Whisper large model. If Whisper produces the same text as the Common Voice prompt, the transcription is considered valid regardless of its votes....

openslr-slr69-ca-trimmed-denoised

Sep 18, 2025 - Language Technologies Laboratory

Baybars Kulebi; Armentano i Oller, Carme; Alexandre Cristian Peiro Lilja, 2025, "openslr-slr69-ca-trimmed-denoised", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/OASOAN, BSC Dataverse, V1

This is a post-processed version of the Catalan subset belonging to the Open Speech and Language Resources (OpenSLR) speech dataset. Specifically the subset OpenSLR-69. We processed the data of the Catalan OpenSLR with the following recipe: Trimming, Resampling, Denoising.

parlament_parla

Sep 18, 2025 - Language Technologies Laboratory

Külebi, Baybars, 2025, "parlament_parla", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/RBGNSJ, BSC Dataverse, V1

This is the ParlamentParla speech corpus of more than 600 hours of speech from Catalan Parliament sessions. The audio segments were extracted from recordings the Catalan Parliament (Parlament de Catalunya) plenary sessions, which took place between 2007/07/11 - 2018/07/17. We aligned the transcriptions with the recordings and extracted the corpus....

festcat_trimmed_denoised

Sep 18, 2025 - Language Technologies Laboratory

Jose Omar Giraldo Valencia; Armentano i Oller, Carme; Peiró-Lilja, Alex; Llopart, Martí, 2025, "festcat_trimmed_denoised", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/SJXGXT, BSC Dataverse, V1

This is a post-processed version of the Catalan Festcat speech dataset. We processed the data of the Catalan Festcat with the following recipe: Trimming (Long silences from the start and the end of clips have been removed), Resampling (From 48000 Hz to 22050 Hz, which is the most common sampling rate for training TTS models), Denoising (Although ba...

CAESAR-TV3

Sep 18, 2025 - Language Technologies Laboratory

Messaoudi, Abir; Romero Diaz, Jacobo; Francisco Javier Hernando Pericas, 2025, "CAESAR-TV3", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/Z0E9OV, BSC Dataverse, V1

This corpus includes 5 hours and 45 minutes of Catalan speech code-switched with Spanish extracted from the original tv3_parla dataset.

commonvoice_benchmark_catalan_accents

Sep 18, 2025 - Language Technologies Laboratory

Armentano i Oller, Carme; Hernández Mena, Carlos Daniel; Külebi, Baybars, 2025, "commonvoice_benchmark_catalan_accents", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/WOYNOE, BSC Dataverse, V1

This is a new presentation of the corpus Catalan Common Voice v17 - metadata annotated version with the splits redefined to benchmark ASR models with various Catalan accents: From the validated recording split, we have selected, for each of the main accents of the language (balearic, central, northern, northwestern, valencian), the necessary male a...

parlament_parla_v3

Sep 18, 2025 - Language Technologies Laboratory

Solito, Sarah; Messaoudi, Abir; Külebi, Baybars, 2025, "parlament_parla_v3", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/JSUTRR, BSC Dataverse, V1

'parlament_parla_v3' is a speech corpus composed of Catalan Parliamentary Sessions.The v3 and last version of the corpus includes both clean and other quality segments, divided into short segments (less than 30 seconds) and long segments (more than 30 seconds). The total dataset encompasses 1059h 48m 04s of speech, including 945h 51m 06s for the sh...

corts_valencianes_asr_a

Sep 18, 2025 - Language Technologies Laboratory

Messaoudi, Abir; Solito, Sarah; Külebi, Baybars, 2025, "corts_valencianes_asr_a", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/YFCUPS, BSC Dataverse, V1

The Corts Valencianes Speech Corpus is a rich dataset composed of speech recordings from the sessions of the Corts Valencianes. The corpus includes both clean and other quality segments, divided into short segments (less than 30 seconds) and long segments (more than 30 seconds). The total dataset encompasses 270 hours, 5 minutes, and 34 seconds of...

Add Data

Share Dataverse

Link Dataverse

Reset Modifications