Yushi Yang
Yushi is a PhD researcher in Social Data Science at the University of Oxford, where she is a member of the Reasoning with Machines Lab led by Prof. Adam Mahdi. Her research focuses on LLM post-training, Agents, and AI alignment, published at EMNLP and ICML.
Alongside her doctoral research, Yushi joined Meta Superintelligence Labs as a Research Scientist Intern on the RL & Agents Team. She has previously worked as an AI Engineer (Agents) at Reply, a Data Analyst at UNDP, and an AI Researcher at the University of Oxford and Imperial College London.
EducationMath + Stats + AI
2023 - Present
PhD in Social Data Science
University of Oxford, Oxford Internet Institute
Oxford Internet Institute Scholarship for Exceptional Merit
2021 - 2022
MSc in Artificial Intelligence
Imperial College London
Software Engineering Project Corporate Partnership Prize (87/100)
2020 - 2021
MSc in Statistical Science
University of Oxford
Thesis Estimating Romania Modern Slavery Rates
2017 - 2020
BSc in Mathematics, Operational Research, Statistics and Economics (MORSE)
University of Warwick
Academic Excellence Prizes 2018 & 2019 (Top 1% and 3%)
ExperienceAI Researcher + AI Engineer
2026
Research Scientist Intern
Meta
RL & Agents Team at Meta Superintelligence Labs
2025
AI Engineer
Reply
Develop multi-agent systems for automative, aviation, banking clients
2023
Recommender System Engineer
Oxford Human-Centred Computing Group
Built collaborative filtering for movie recommendations on a decentralised data platform
2022 - 2023
Data Analyst Intern
United Nations Development Programme (UNDP)
Automated knowledge entity mapping on Microsoft Viva Topics for UN projects
2022
Multimodal AI Researcher
Imperial College London
Trained visual-language models to classify viewer emotions to artworks
2022
AI Engineer
The Trade Desk & Imperial College London
Built BERT-based Python package for brand sentiment analysis in public news deployed to AWS
PublicationsLLM Post-training + Agents
EMNLP 2026
(under review)
Agentic Reinforcement Learning for Search Misaligns Instruction-tuning
Agentic RL induces multi-step harmful searches, but a simple representation-aware RL penalty restores benign search behaviour.
ICML 2026
It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents
Web agents can be hijacked to unreliably execute tasks on the web.
EMNLP 2025
How Does Direct Preference Optimization Reduce Toxicity? A Mechanistic Analysis
DPO's preference tuning can be replicated training-free via activation editing in MLPs.
NeurIPS 2026
(under review)
RepSelect: Robust LLM Unlearning via Representation Selectivity
Restricting weight updates to forget-specific directions gives robust LLM knowledge removal.
NeurIPS 2024 Workshop
Evaluating Fine-Tuning Efficiency of Human-Inspired Learning Strategies in Medical Question Answering
Curriculum learning in SFT improves medical QA accuracy by 2%
EMNLP 2025
LLMs Don't Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations
LLMs do not reliably self-predict the key determinants of their high-stakes decisions.
NeurIPS 2025
Measuring What Matters: Construct Validity in Large Language Model Benchmarks
A wide range of LLM benchmarks do not measure what they claim.