Yushi Yang

Yushi is a PhD researcher in Social Data Science at the University of Oxford, where she is a member of the Reasoning with Machines Lab led by Prof. Adam Mahdi. Her research focuses on LLM post-training, Agents, and AI alignment, published at EMNLP and ICML.

Alongside her doctoral research, Yushi joined Meta Superintelligence Labs as a Research Scientist Intern on the RL & Agents Team. She has previously worked as an AI Engineer (Agents) at Reply, a Data Analyst at UNDP, and an AI Researcher at the University of Oxford and Imperial College London.

EducationMath + Stats + AI

2023 - Present

PhD in Social Data Science

University of Oxford, Oxford Internet Institute

Oxford Internet Institute Scholarship for Exceptional Merit

2021 - 2022

MSc in Artificial Intelligence

Imperial College London

Software Engineering Project Corporate Partnership Prize (87/100)

2020 - 2021

MSc in Statistical Science

University of Oxford

Thesis Estimating Romania Modern Slavery Rates

2017 - 2020

BSc in Mathematics, Operational Research, Statistics and Economics (MORSE)

University of Warwick

Academic Excellence Prizes 2018 & 2019 (Top 1% and 3%)

ExperienceAI Researcher + AI Engineer

2026

Research Scientist Intern

PublicationsLLM Post-training + Agents

EMNLP 2026
(under review)

Agentic Reinforcement Learning for Search Misaligns Instruction-tuning

Y Yang, S Padarha, S Ball, A Lee, A Mahdi

Search Agents RL Goal Misspecification

Agentic RL induces multi-step harmful searches, but a simple representation-aware RL penalty restores benign search behaviour.

Agentic Reinforcement Learning for Search figure

ICML 2026

It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

K Korgul, Y Yang, A Drohomirecki, P Błaszczyk, W Howard, L Aichberger, C Russell, P Torr, A Mahdi, A Bibi

Web Agents Persuasion Benchmarks

Web agents can be hijacked to unreliably execute tasks on the web.

EMNLP 2025

How Does Direct Preference Optimization Reduce Toxicity? A Mechanistic Analysis

Y Yang, F Sondej, H Mayne, A Lee, A Mahdi

LLM Post-training Preference tuning

DPO's preference tuning can be replicated training-free via activation editing in MLPs.

NeurIPS 2026
(under review)

RepSelect: Robust LLM Unlearning via Representation Selectivity

F Sondej*, Y Yang*, A Mahdi

LLM Post-training Knowledge Removal

Restricting weight updates to forget-specific directions gives robust LLM knowledge removal.

NeurIPS 2024 Workshop

Evaluating Fine-Tuning Efficiency of Human-Inspired Learning Strategies in Medical Question Answering

Y Yang, A Bean, R McCraith, A Mahdi

LLM Post-training Continual Learning

Curriculum learning in SFT improves medical QA accuracy by 2%

EMNLP 2025

LLMs Don't Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations

H Mayne, R Kearns, Y Yang, A Bean, E Delaney, C Russell, A Mahdi

Reasoning Monitoring Self-introspection

LLMs do not reliably self-predict the key determinants of their high-stakes decisions.

NeurIPS 2025

Measuring What Matters: Construct Validity in Large Language Model Benchmarks

A Bean, R Kearns, … Y Yang … A Mahdi

Science of Evaluation Benchmarks

A wide range of LLM benchmarks do not measure what they claim.