Replication Data for: Beyond the Link: Assessing LLMs’ Ability to Classify Political Content Across Global Media (doi:10.82201/8UPPY6)

View:

Part 1: Document Description
Part 2: Study Description
Part 3: Data Files Description
Part 4: Variable Description
Part 5: Other Study-Related Materials
Entire Codebook

Document Description

Citation

Title:

Replication Data for: Beyond the Link: Assessing LLMs’ Ability to Classify Political Content Across Global Media

Identification Number:

doi:10.82201/8UPPY6

Distributor:

BSC Dataverse

Date of Distribution:

2025-10-23

Version:

2

Bibliographic Citation:

DE LA FUENTE CUESTA, ALEJANDRO; Alberto Martínez Serra; Nienke Visscher; Cardenal, Ana S., 2025, "Replication Data for: Beyond the Link: Assessing LLMs’ Ability to Classify Political Content Across Global Media", https://doi.org/10.82201/8UPPY6, BSC Dataverse, V2, UNF:6:j4VEdzl/g0wp+AX0/Pik1w== [fileUNF]

Study Description

Citation

Title:

Replication Data for: Beyond the Link: Assessing LLMs’ Ability to Classify Political Content Across Global Media

Identification Number:

doi:10.82201/8UPPY6

Authoring Entity:

DE LA FUENTE CUESTA, ALEJANDRO (Barcelona Supercomputing Center)

Alberto Martínez Serra (https://ror.org/05sd8tv96)

Nienke Visscher (Barcelona Supercomputing Center)

Cardenal, Ana S. (https://ror.org/01f5wp925)

Date of Production:

2025-10-22

Software used in Production:

R

Grant Number:

C005/24-ED CV1

Distributor:

BSC Dataverse

Access Authority:

DE LA FUENTE CUESTA, ALEJANDRO

Access Authority:

Ana Sofia Cardenal

Depositor:

DE LA FUENTE CUESTA, ALEJANDRO

Date of Deposit:

2025-10-22

Holdings Information:

https://doi.org/10.82201/8UPPY6

Study Scope

Keywords:

Social Sciences, Large Language Models, Media analysis, Political Content, URL

Topic Classification:

Large Language Models

Abstract:

This dataset and replication package accompany the paper: “Beyond the Link: Assessing LLMs’ Ability to Classify Political Content Across Global Media.” by Alejandro De La Fuente-Cuesta, Alberto Martínez-Serra, R. Nienke Visscher, and Ana S. Cardenal (2025). The materials include the data and code necessary to reproduce all analyses presented in the main text and the Supplementary Information. The study investigates whether large language models (LLMs) can accurately classify political versus non-political news content based solely on URLs, compared to full-text analysis, across five countries (France, Germany, Spain, the UK, and the US). Using web-tracking data and manually coded ground-truth labels, we benchmark multiple state-of-the-art LLMs (Gemma-3-27B, Mistral-3.1-24B, Qwen-32B, Llama-3.1-8B, and DeepSeek-R1-Distill-Qwen-7B) to assess their performance, precision–recall trade-offs, and sources of bias in URL-only classification. The dataset is derived from web-tracking records of news consumption across five democratic countries. A subset of 1,140 URLs was manually coded by human annotators as either Political (POL) or Non-political (NON) content to serve as the gold standard. LLM predictions were then compared against these human labels to compute accuracy, F1, precision, recall, and Cohen’s Kappa metrics. All personal data were anonymized before analysis, and all procedures complied with GDPR and institutional ethical guidelines. Replication Instructions Open the .Rmd files in RStudio (R ≥ 4.2). Install the packages listed in the setup section. Knit the documents to reproduce the corresponding .html outputs. All analyses use open-source R packages and can be fully reproduced on any standard machine. If you use these data or materials, please cite: Martínez-Serra, A., De la Fuente-Cuesta, A., Visscher, R. N., & Cardenal, A. S. (2025). Beyond the Link: Assessing LLMs’ Ability to Classify Political Content Across Global Media.

Time Period:

2022-02-22-2022-06-05

Unit of Analysis:

LLMs

Universe:

Top open source LLMs by April 2025

Kind of Data:

News

Notes:

<b>Related Material:</b> <a href="https://arxiv.org/pdf/2506.17435"> https://arxiv.org/pdf/2506.17435 </a>

Methodology and Processing

Type of Research Instrument:

Structutred

Sources Statement

Data Access

Notes:

<a href="http://creativecommons.org/licenses/by/4.0">CC BY 4.0</a>

Other Study Description Materials

File Description--f8917

File: metadata_llm_classification.tab

  • Number of cases: 11400

  • No. of variables per record: 11

  • Type of File: text/tab-separated-values

Notes:

UNF:6:3JOQw070RCv+caIx4qwiRg==

The data needed for the replication of the main text.

File Description--f8914

File: SI_metadata_llm_classification.tab

  • Number of cases: 2280

  • No. of variables per record: 13

  • Type of File: text/tab-separated-values

Notes:

UNF:6:VV9+42PqeCoCKyXOS7EEPQ==

The data needed for the replication of the Supplementary Information.

Variable Description

List of Variables:

Variables

url

f8917 Location:

Variable Format: character

Notes: UNF:6:nuhhgLaoWxwcOIckIpUuSQ==

country

f8917 Location:

Variable Format: character

Notes: UNF:6:NVbZ2uYlVkWaBa4vh2DexA==

model

f8917 Location:

Variable Format: character

Notes: UNF:6:xfGwQdIbHeQ4zDMcXX7DlA==

source

f8917 Location:

Variable Format: character

Notes: UNF:6:xpbKpjnW4oqUn66QAOj3Lg==

human

f8917 Location:

Variable Format: character

Notes: UNF:6:WmkGslO9HgITyAY0jDrjhA==

encoder1

f8917 Location:

Variable Format: character

Notes: UNF:6:2nEgjXh39uUvlMa5HFBsfA==

encoder2

f8917 Location:

Variable Format: character

Notes: UNF:6:LHUdxLsCnVBweNYBfLyeVw==

pred

f8917 Location:

Variable Format: character

Notes: UNF:6:8NJXkN70+3znlX++JuBrLQ==

match_human

f8917 Location:

Variable Format: character

Notes: UNF:6:kBzBPdf3ZsU6oE7yfl1GAw==

match_c1

f8917 Location:

Variable Format: character

Notes: UNF:6:T0FyBqX2zNHqkQAa/khLkw==

political_position_text

f8917 Location:

Summary Statistics: Mean 7.2656551724137115; Min. 1.0; Valid 7250.0; Max. 705.0; StDev 28.473690455640313

Variable Format: numeric

Notes: UNF:6:7inlR1If6qg3pFl9lcXRag==

source

f8914 Location:

Variable Format: character

Notes: UNF:6:3KSj/q2CDB3AGg128S1qyQ==

country

f8914 Location:

Variable Format: character

Notes: UNF:6:AxacQbjXHyjoDx47hU8kBA==

Human

f8914 Location:

Variable Format: character

Notes: UNF:6:lz5MN1i7ouXUi0jOA/FEGw==

Dictionary

f8914 Location:

Variable Format: character

Notes: UNF:6:yVqEvQGJQaOqcOyXd+P9YA==

GPT

f8914 Location:

Variable Format: character

Notes: UNF:6:YW91E5doPz+IQMrK/BzHIw==

Deepseek

f8914 Location:

Variable Format: character

Notes: UNF:6:33G3MteSM+tBYnvjKXhcGg==

Mistral

f8914 Location:

Variable Format: character

Notes: UNF:6:vN1DXvPN0sP81v//QT0zgQ==

Qwen

f8914 Location:

Variable Format: character

Notes: UNF:6:vKT2kuHcOofe6bf6j8uaeg==

Gemma

f8914 Location:

Variable Format: character

Notes: UNF:6:RDuCvGZZrgIxCzIaSC6o7g==

Llama

f8914 Location:

Variable Format: character

Notes: UNF:6:MNtSGg18LvdTqE9Z1j4G2w==

extension

f8914 Location:

Variable Format: character

Notes: UNF:6:gfcKP/ejOyo5dIpqgPhX3A==

URL

f8914 Location:

Variable Format: character

Notes: UNF:6:EElL1HlimSs4MqIP7G8CdQ==

text

f8914 Location:

Variable Format: character

Notes: UNF:6:FbH76btPuTn3jXbU7XHEfA==

Other Study-Related Materials

Label:

README.md

Notes:

text/markdown

Other Study-Related Materials

Label:

Replication_Material.html

Text:

Replication HTML of all the tables and figures in the paper.

Notes:

text/html

Other Study-Related Materials

Label:

Replication_Material.Rmd

Text:

Replication Code of all the tables and figures in the paper.

Notes:

text/x-r-notebook

Other Study-Related Materials

Label:

Replication_SI.html

Text:

The HTML with the code and results of the Supplementary Information.

Notes:

text/html

Other Study-Related Materials

Label:

Replication_SI.Rmd

Text:

The R code for the replication of the Supplementary Information.

Notes:

text/x-r-notebook