Shashwat Goel

I am an AI researcher, currently co-advised by Jonas Geiping and Douwe Kiela through the ELLIS PhD program. I work on novel ways to scale supervision for models, to make them more useful and safe. As we exhaust performance that can be achieved by training on human demonstrations, I am interested in learning from ground-truth rewards in open-ended tasks, and leveraging the generator-verifier gap. I think a lot about designing evaluations for different aspects of intelligence, as I believe any capability that can be measured can be optimized. Capabilities I am currently interested in include:

  • Seeking information by asking the right questions
  • Falsifying wrong claims
  • Reasoning about conflicting evidence and preferences
  • Solving novel problems
  • Personalization

If you’re interested in these problems too, reach out, I like getting E-Mail!

News

23 June 2025Starting as a Research Scientist intern at Meta GenAI London
26 February 2025Can Language Models Falsify? selected for Oral Presentation at ICLR-SSI FM Workshop
6 February 2025Great Models Think Alike and this Undermines AI Oversight accepted at ICLR SSI-FM Workshop
13 October 2024Corrective Machine Unlearning accepted at TMLR
2 September 2024Starting a PhD in Tübingen (Germany), co-advised by Jonas Geiping (ELLIS, MPI-IS) and Douwe Kiela (Contextual AI, Stanford)
24 May 2024Defended my masters thesis on New Frontiers for Machine Unlearning at IIIT Hyderabad
1 May 2024The WMDP Benchmark:Measuring and Reducing Malicious with Unlearning accepted at ICML 2024

Selected Publications

Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation
Shiven Sinha, Shashwat Goel, Ponnurangam Kumaraguru, Jonas Geiping, Matthias Bethge, Ameya Prabhu
ICLR Scaling Self Improving Foundation Models Workshop, 2025.
[webpage], [code], [data]

Great Models Think Alike and this Undermines AI Oversight
Shashwat Goel, Joschka Strüber, Ilze Amanda Auzina, Karuna Chandra, P. Kumaraguru, Douwe Kiela, Ameya Prabhu, Matthias Bethge, Jonas Geiping
ICLR Scaling Self Improving Foundation Models Workshop, 2025.
[code], [tool], [data]

Corrective Machine Unlearning
Shashwat Goel*, Ameya Prabhu*, Philip Torr, P. Kumaraguru, Amartya Sanyal
Transactions on Machine Learning Research (TMLR) 2024
Workshop on Data-centric Machine Learning (DMLR) - Recommended for Journal (Top 15) at the 12th International Conference on Representation Learning (ICLR), 2024.

[twitter], [code]

The WMDP Benchmark: Measuring and Reducing Malicious Use with Unlearning
Center for AI Safety, Scale AI
International Conference on Machine Learning (ICML), 2024.
[media], [webpage], [code]

Proportional Aggregation of Preferences for Sequential Decision Making
Nikhil Chandak, Shashwat Goel, Dominik Peters
Outstanding Paper Award (top 3 out of 12,000+ submissions) at 38th Annual Conference of the Association for the Advancement of Artificial Intelligence (AAAI), 2024.
[twitter], [talk]

Representation Engineering: A Top-Down Approach to AI Transparency
Center for AI Safety
ArXiv, 2023.
[talk], [webpage], [code]

Probing Negation in Language Models
Shashwat Singh*, Shashwat Goel*, Saujas Vaduguru, Ponnurangam Kumaraguru
8th Workshop on Representation Learning for NLP (RepL4NLP)
61st Annual Meeting of the Association for Computational Linguistics (ACL), 2023.

[code]

Towards Adversarial Evaluations of Inexact Machine Unlearning
Shashwat Goel*, Ameya Prabhu*, Amartya Sanyal, Ser-Nam Lim, Phillip Torr, Ponnurangam Kumaraguru
ArXiv, 2023.
[code]

I love problem solving, and learnt a lot from ecosystems like the International Olympiads of Informatics and Linguistics, Exun Clan, and the broader tech circuit in Delhi. I’m also grateful for mentorship from Amartya Sanyal, Ameya Prabhu, Dan Hendrycks, Dominik Peters, Mikel Forcada, Mukesh Kumar, Jérôme Lang, Jorge Gracia, Ponnurangam Kumaraguru, Saujas Vaduguru, and Tanmoy Bhattacharya. Thanks to them, I have been able to explore a diverse range of research areas, including Machine Learning, Interpretability of LLMs, Social Choice Theory, Machine Translation, Semantic Evolution and Algorithm Design.