Boaz Barak, Gabriel Wu, Jeremy Chen, Manas Joglekar
[Linkposting from the OpenAI alignment blog, where we post more speculative/technical/informal results and thoughts on safety and alignment.]
TL;DR We go into more details and some follow up results from our paper on confessions (see the original blog post). We give deeper analysis of the impact of training, as well as some preliminary comparisons to chain of thought monitoring.
We have recently published a new paper on confessions, along with an accompanying blog post. Here, we want to share with the research community some of the reasons why we are excited about confessions as a direction of safety, as well as some of its limitations. This blog post will be a bit more informal and speculative, so please see the paper for the full results.
The notion of “goodness” for the response of an LLM to a user prompt is inherently complex and multi-dimensional, and involves factors such as correctness, completeness, honesty, style, and more. When we optimize responses using a reward model as a proxy for “goodness” in reinforcement learning, models sometimes learn to “hack” this proxy and output an answer that only “looks good” to it (because [...]
---
Outline:
(08:19) Impact of training
(12:32) Comparing with chain-of-thought monitoring
(14:05) Confessions can increase monitorability
(15:44) Using high compute to improve alignment
(16:49) Acknowledgements
---
First published:
January 14th, 2026
Source:
https://www.lesswrong.com/posts/k4FjAzJwvYjFbCTKn/why-we-are-excited-about-confession
---
Narrated by TYPE III AUDIO.
---
Images from the article: