11 to 20 of 29 Results
Nov 3, 2025 - Language Technologies Laboratory
De Luca Fornaciari, Francesca; Villegas, Marta; Melero, Maite; Mash, Audrey, 2025, "CA-EN_Parallel_Corpus", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/ERUHKY, BSC Dataverse, V2
The CA-EN Parallel Corpus is a Catalan-English textual dataset of parallel sentences created to support Catalan in NLP tasks, specifically Machine Translation. The dataset can be used to train Bilingual Machine Translation models between English and Catalan in any direction, as well as Multilingual Machine Translation models. |
Sep 29, 2025 - Language Technologies Laboratory
Rodriguez-Penagos, Carlos; Armentano i Oller, Carme; Villegas, Marta, 2025, "XitXat", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/642QYD, BSC Dataverse, V2
XitXat is a conversational dataset consisting of 950 chatbot–user conversations across 10 different domains. The conversations were created using the Wizard-of-Oz method. User interactions are annotated with intents and relevant slots, following the attached annotation guidelines. The dataset is designed to support research in natural language unde... |
Sep 29, 2025 - Language Technologies Laboratory
Aula-Blasco, Javier; Falcão, Júlia; Villegas, Marta; Sotelo, Susana; Paniagua Suárez, Silvia, 2025, "VeritasQA", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/65WHD9, BSC Dataverse, V2
VeritasQA is a context‑ and time‑independent truthfulness benchmark, comprising 353 questions and answers inspired by common misconceptions that are not tied to a specific country or recent event. It is designed for multilingual transferability, offering versions in Spanish, Catalan, Galician, and English. The benchmark aims to test the tendency of... |
Sep 29, 2025 - Language Technologies Laboratory
Rivera Hidalgo de Torralba, Paula; Gonzalez-Agirre, Aitor; Villegas, Marta; Aula-Blasco, Javier; Saiz Antón, José Javier, 2025, "EQ-Bench_ca", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/UECWEX, BSC Dataverse, V2
EQ‑bench_ca is the Catalan translation and linguistic adaptation of EQ‑Bench, a dataset for evaluating emotional reasoning in language models via dialogue prompts. It is intended to reflect how emotional expression and perception vary across languages, enabling evaluation in Catalan. |
Sep 29, 2025 - Language Technologies Laboratory
Saiz Antón, José Javier; Rivera Hidalgo de Torralba, Paula; Gonzalez-Agirre, Aitor; Villegas, Marta; Aula-Blasco, Javier, 2025, "EQ-bench_es", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/PVIYPG, BSC Dataverse, V3
EQ‑bench_es is the Spanish translation and adaptation of EQ‑Bench, designed for evaluating emotional reasoning in language models via dialogue prompts in Spanish. |
Sep 29, 2025 - Language Technologies Laboratory
Armentano i Oller, Carme; Rodriguez-Penagos, Carlos; Gonzalez-Agirre, Aitor; Villegas, Marta, 2025, "WikiCAT_ca", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/TIODTW, BSC Dataverse, V1
WikiCAT_ca is a text classification dataset in **Catalan**, automatically constructed from Catalan Wikipedia and Wikidata sources. Its purpose is to support multi-class topic classification of Wikipedia article text. The dataset contains article texts paired with one of 13 topic labels (e.g. “Ciència i Tecnologia”, “Economia”, “Esport”, etc.). It i... |
Sep 29, 2025 - Language Technologies Laboratory
Ruiz-Fernández, Valle; Gonzalez-Agirre, Aitor; Villegas, Marta; Falcão, Júlia; Vasquez Reina, Luis Antonio, 2025, "CaBBQ", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/1OO2M0, BSC Dataverse, V1
CaBBQ is the Catalan adaptation of the BBQ benchmark, adjusted to Catalan language and the social context of Spain. It aims to evaluate social bias in language models via a multiple-choice QA task, following the same 10 social categories as EsBBQ. |
Sep 29, 2025 - Language Technologies Laboratory
Ruiz-Fernández, Valle; Gonzalez-Agirre, Aitor; Falcão, Júlia; Vasquez Reina, Luis Antonio; Villegas, Marta, 2025, "EsBBQ", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/MJGCT3, BSC Dataverse, V1
EsBBQ is an adaptation of the the original BBQ benchmark to Spanish and the Spanish social context. It is used to evaluate social bias in language models via a multiple‑choice question answering task along 10 social categories (Age, Disability, Gender, LGBTQIA, Nationality, Physical Appearance, Race/Ethnicity, Religion, Socioeconomic Status, Spanis... |
Sep 29, 2025 - Language Technologies Laboratory
Gutiérrez-Fandiño, Asier; Armengol-Estapé, Jordi; de Gibert, Ona; Gonzalez-Agirre, Aitor; Armentano i Oller, Carme; Rodriguez-Penagos, Carlos; Villegas, Marta, 2025, "ViquiQuAD", https://dataverse.bsc.es/dataset.xhtml?persistentId=perma:BSC/0DRCY6, BSC Dataverse, V1
ViquiQuAD is an extractive question answering (QA) dataset in Catalan, built from original Catalan Wikipedia articles (i.e. not translations). The dataset contains contexts drawn from Wikipedia, and for each context, 1 to 5 QA pairs in which the answer is a span in the context. It is intended for training and evaluating extractive-QA models in Cata... |
Sep 29, 2025 - Life Sciences
Filella Merce, Isaac; Guallar, Victor; Isaac Soul Garcia, 2025, "Composite Database of Ultra-Large Chemical Libraries", https://doi.org/10.82201/6043UT, BSC Dataverse, V1
The Composite Database consists of approximately 120 billion molecules sourced from five ultra-large (> 100 million compounds) and nine large publicly available chemical libraries. Developed to support early-stage drug discovery, it is the largest publicly available database of enlisted molecules, readily accessible for efficient analog searches an... |
