“Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild” by Adam Karvonen, Sam Marks
Summary: We found that LLMs exhibit significant race and gender bias in realistic hiring scenarios, but their chain-of-thought reasoning shows zero evidence of this bias. This serves as a nice example of a 100% unfaithful CoT "in the wild" where the LLM strongly suppresses the unfaithful behavior. We also find that interpretability-based interventions succeeded while prompting failed, suggesting this may be an example of interpretability being the best practical tool for a real world problem.For context on our paper, the tweet thread is here and the paper is here.Context: Chain of Thought Faithfulness Chain of Thought (CoT) monitoring has emerged as a popular research area in AI safety. The idea is simple - have the AIs reason in English text when solving a problem, and monitor the reasoning for misaligned behavior. For example, OpenAI recently published a paper on using CoT monitoring to detect reward hacking during [...] ---Outline:(00:49) Context: Chain of Thought Faithfulness(02:26) Our Results(04:06) Interpretability as a Practical Tool for Real-World Debiasing(06:10) Discussion and Related Work--- First published: July 2nd, 2025 Source: https://www.lesswrong.com/posts/me7wFrkEtMbkzXGJt/race-and-gender-bias-as-an-example-of-unfaithful-chain-of --- Narrated by TYPE III AUDIO.
--------
7:56
--------
7:56
“The best simple argument for Pausing AI?” by Gary Marcus
Not saying we should pause AI, but consider the following argument: Alignment without the capacity to follow rules is hopeless. You can’t possibly follow laws like Asimov's Laws (or better alternatives to them) if you can’t reliably learn to abide by simple constraints like the rules of chess. LLMs can’t reliably follow rules. As discussed in Marcus on AI yesterday, per data from Mathieu Acher, even reasoning models like o3 in fact empirically struggle with the rules of chess. And they do this even though they can explicit explain those rules (see same article). The Apple “thinking” paper, which I have discussed extensively in 3 recent articles in my Substack, gives another example, where an LLM can’t play Tower of Hanoi with 9 pegs. (This is not a token-related artifact). Four other papers have shown related failures in compliance with moderately complex rules in the last month. [...] --- First published: June 30th, 2025 Source: https://www.lesswrong.com/posts/Q2PdrjowtXkYQ5whW/the-best-simple-argument-for-pausing-ai --- Narrated by TYPE III AUDIO.
--------
2:00
--------
2:00
“Foom & Doom 2: Technical alignment is hard” by Steven Byrnes
2.1 Summary & Table of contents This is the second of a two-post series on foom (previous post) and doom (this post). The last post talked about how I expect future AI to be different from present AI. This post will argue that this future AI will be of a type that will be egregiously misaligned and scheming, not even ‘slightly nice’, absent some future conceptual breakthrough.I will particularly focus on exactly how and why I differ from the LLM-focused researchers who wind up with (from my perspective) bizarrely over-optimistic beliefs like “P(doom) ≲ 50%”.[1] In particular, I will argue that these “optimists” are right that “Claude seems basically nice, by and large” is nonzero evidence for feeling good about current LLMs (with various caveats). But I think that future AIs will be disanalogous to current LLMs, and I will dive into exactly how and why, with a [...] ---Outline:(00:12) 2.1 Summary & Table of contents(04:42) 2.2 Background: my expected future AI paradigm shift(06:18) 2.3 On the origins of egregious scheming(07:03) 2.3.1 Where do you get your capabilities from?(08:07) 2.3.2 LLM pretraining magically transmutes observations into behavior, in a way that is profoundly disanalogous to how brains work(10:50) 2.3.3 To what extent should we think of LLMs as imitating?(14:26) 2.3.4 The naturalness of egregious scheming: some intuitions(19:23) 2.3.5 Putting everything together: LLMs are generally not scheming right now, but I expect future AI to be disanalogous(23:41) 2.4 I'm still worried about the 'literal genie' / 'monkey's paw' thing(26:58) 2.4.1 Sidetrack on disanalogies between the RLHF reward function and the brain-like AGI reward function(32:01) 2.4.2 Inner and outer misalignment(34:54) 2.5 Open-ended autonomous learning, distribution shifts, and the 'sharp left turn'(38:14) 2.6 Problems with amplified oversight(41:24) 2.7 Downstream impacts of Technical alignment is hard(43:37) 2.8 Bonus: Technical alignment is not THAT hard(44:04) 2.8.1 I think we'll get to pick the innate drives (as opposed to the evolution analogy)(45:44) 2.8.2 I'm more bullish on impure consequentialism(50:44) 2.8.3 On the narrowness of the target(52:18) 2.9 Conclusion and takeaways(52:23) 2.9.1 If brain-like AGI is so dangerous, shouldn't we just try to make AGIs via LLMs?(54:34) 2.9.2 What's to be done?The original text contained 20 footnotes which were omitted from this narration. --- First published: June 23rd, 2025 Source: https://www.lesswrong.com/posts/bnnKGSCHJghAvqPjS/foom-and-doom-2-technical-alignment-is-hard --- Narrated by TYPE III AUDIO. ---Images from the article:
--------
56:38
--------
56:38
“Proposal for making credible commitments to AIs.” by Cleo Nardo
Acknowledgments: The core scheme here was suggested by Prof. Gabriel Weil. There has been growing interest in the deal-making agenda: humans make deals with AIs (misaligned but lacking decisive strategic advantage) where they promise to be safe and useful for some fixed term (e.g. 2026-2028) and we promise to compensate them in the future, conditional on (i) verifying the AIs were compliant, and (ii) verifying the AIs would spend the resources in an acceptable way.[1] I think the deal-making agenda breaks down into two main subproblems: How can we make credible commitments to AIs? Would credible commitments motivate an AI to be safe and useful? There are other issues, but when I've discussed deal-making with people, (1) and (2) are the most common issues raised. See footnote for some other issues in dealmaking.[2] Here is my current best assessment of how we can make credible commitments to AIs. [...] The original text contained 2 footnotes which were omitted from this narration. --- First published: June 27th, 2025 Source: https://www.lesswrong.com/posts/vxfEtbCwmZKu9hiNr/proposal-for-making-credible-commitments-to-ais --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
--------
5:19
--------
5:19
“X explains Z% of the variance in Y” by Leon Lang
Audio note: this article contains 218 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description. Recently, in a group chat with friends, someone posted this Lesswrong post and quoted: The group consensus on somebody's attractiveness accounted for roughly 60% of the variance in people's perceptions of the person's relative attractiveness. I answered that, embarrassingly, even after reading Spencer Greenberg's tweets for years, I don't actually know what it means when one says: <span>_X_</span> explains <span>_p_</span> of the variance in <span>_Y_</span>.[1] What followed was a vigorous discussion about the correct definition, and several links to external sources like Wikipedia. Sadly, it seems to me that all online explanations (e.g. on Wikipedia here and here), while precise, seem philosophically wrong since they confuse the platonic concept of explained variance with the variance explained by [...] ---Outline:(02:38) Definitions(02:41) The verbal definition(05:51) The mathematical definition(09:29) How to approximate _1 - p_(09:41) When you have lots of data(10:45) When you have less data: Regression(12:59) Examples(13:23) Dependence on the regression model(14:59) When you have incomplete data: Twin studies(17:11) ConclusionThe original text contained 6 footnotes which were omitted from this narration. --- First published: June 20th, 2025 Source: https://www.lesswrong.com/posts/E3nsbq2tiBv6GLqjB/x-explains-z-of-the-variance-in-y --- Narrated by TYPE III AUDIO. ---Images from the article:
Audio narrations of LessWrong posts. Includes all curated posts and all posts with 125+ karma.If you'd like more, subscribe to the “Lesswrong (30+ karma)” feed.