Shashwat Goel

I am an AI researcher, currently co-advised by Jonas Geiping and Douwe Kiela through the ELLIS PhD program. I am interested in novel ways to scale supervision for models, to make them more useful and safe. I believe any capability that can be measured can be optimized. Thus my work focuses on evaluating capabilities beyond knowledge recall. The hope is that these evals allow models to learn from scalable, yet grounded rewards in open-ended, long-horizon settings. Currently, I’m interested in AI capabilities that can accelerate research productivity:

  • Reasoning: making decisions under conflicting evidence or uncertainty
  • Curiosity: seeking information by asking the right questions
  • Falsifying: identifying mistakes in claims, hypotheses, solutions
  • Novelty: coming up with creative solutions to challenging problems
  • Collaboration: increasing productivity by making agents better at collaborating with humans

See this blog for research I’m excited to see towards building generally intelligent agents. If you’re interested in any of these problems, reach out, I like getting E-Mails!

News

13 July 2025Presenting 5 works at ICML 2025, checkpointing the first year of my PhD. Happy to chat!
Main Track: Great Models Think Alike and this Undermines AI Oversight (Spotlight), Corrective Unlearning in GNNs.
Assessing World Models Workshop: Measuring Belief Updates in Curious Agents, Pitfalls in Evaluating Language Model Forecasters, Answer Matching Outperforms Multiple Choice for Language Model Evaluations.
23 June 2025Starting as a Research Scientist intern at Meta GenAI London
26 February 2025Can Language Models Falsify? selected for Oral Presentation at ICLR-SSI FM Workshop
13 October 2024Corrective Machine Unlearning accepted at TMLR
2 September 2024Starting a PhD in Tübingen (Germany), co-advised by Jonas Geiping (ELLIS, MPI-IS) and Douwe Kiela (Contextual AI, Stanford)
24 May 2024Defended my masters thesis on New Frontiers for Machine Unlearning at IIIT Hyderabad
1 May 2024The WMDP Benchmark:Measuring and Reducing Malicious with Unlearning accepted at ICML 2024

Selected Publications

Measuring Belief Updates in Curious Agents
Joschka Strüber, Ilze Amanda Auzina, Shashwat Goel, Susanne Keller, Jonas Geiping, Ameya Prabhu, Matthias Bethge
ICML Workshop on Assessing World Models, 2025.

Pitfalls in Evaluating Language Model Forecasters
Daniel Paleka*, Shashwat Goel*, Jonas Geiping, Florian Tramèr
ICML Workshop on Assessing World Models, 2025.

Answer Matching Outperforms Multiple Choice for Language Model Evaluations
Nikhil Chandak*, Shashwat Goel*, Ameya Prabhu, Moritz Hardt, Jonas Geiping
ICML Workshop on Assessing World Models, 2025.

Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation
Shiven Sinha, Shashwat Goel, Ponnurangam Kumaraguru, Jonas Geiping, Matthias Bethge, Ameya Prabhu
(Oral) ICLR Scaling Self Improving Foundation Models Workshop, 2025.
[webpage], [code], [data]

Great Models Think Alike and this Undermines AI Oversight
Shashwat Goel, Joschka Strüber, Ilze Amanda Auzina, Karuna Chandra, P. Kumaraguru, Douwe Kiela, Ameya Prabhu, Matthias Bethge, Jonas Geiping
(Spotlight) ICML, 2025.
[code], [tool], [data]

Corrective Machine Unlearning
Shashwat Goel*, Ameya Prabhu*, Philip Torr, P. Kumaraguru, Amartya Sanyal
Transactions on Machine Learning Research (TMLR) 2024
Workshop on Data-centric Machine Learning (DMLR) - Recommended for Journal (Top 15) at the 12th International Conference on Representation Learning (ICLR), 2024.

[twitter], [code]

The WMDP Benchmark: Measuring and Reducing Malicious Use with Unlearning
Center for AI Safety, Scale AI
International Conference on Machine Learning (ICML), 2024.
[media], [webpage], [code]

Proportional Aggregation of Preferences for Sequential Decision Making
Nikhil Chandak, Shashwat Goel, Dominik Peters
Outstanding Paper Award (top 3 out of 12,000+ submissions) at 38th Annual Conference of the Association for the Advancement of Artificial Intelligence (AAAI), 2024.
[twitter], [talk]

Representation Engineering: A Top-Down Approach to AI Transparency
Center for AI Safety
ArXiv, 2023.
[talk], [webpage], [code]

Probing Negation in Language Models
Shashwat Singh*, Shashwat Goel*, Saujas Vaduguru, Ponnurangam Kumaraguru
8th Workshop on Representation Learning for NLP (RepL4NLP)
61st Annual Meeting of the Association for Computational Linguistics (ACL), 2023.

[code]

Towards Adversarial Evaluations of Inexact Machine Unlearning
Shashwat Goel*, Ameya Prabhu*, Amartya Sanyal, Ser-Nam Lim, Phillip Torr, Ponnurangam Kumaraguru
ArXiv, 2023.
[code]

* denotes equal contribution.

I love problem solving, and learnt a lot from ecosystems like the International Olympiads of Informatics and Linguistics, Exun Clan, and the broader tech circuit in Delhi. I’m also grateful for mentorship from Amartya Sanyal, Ameya Prabhu, Dan Hendrycks, Dominik Peters, Mikel Forcada, Mukesh Kumar, Jérôme Lang, Jorge Gracia, Ponnurangam Kumaraguru, Saujas Vaduguru, and Tanmoy Bhattacharya. Thanks to them, I have been able to explore a diverse range of research areas, including Machine Learning, Interpretability of LLMs, Social Choice Theory, Machine Translation, Semantic Evolution and Algorithm Design.