Partner im RedaktionsNetzwerk Deutschland
PodcastsGesellschaft und KulturLessWrong (Curated & Popular)

LessWrong (Curated & Popular)

LessWrong
LessWrong (Curated & Popular)
Neueste Episode

Verfügbare Folgen

5 von 702
  • “Alignment remains a hard, unsolved problem” by null
    Thanks to (in alphabetical order) Joshua Batson, Roger Grosse, Jeremy Hadfield, Jared Kaplan, Jan Leike, Jack Lindsey, Monte MacDiarmid, Francesco Mosconi, Chris Olah, Ethan Perez, Sara Price, Ansh Radhakrishnan, Fabien Roger, Buck Shlegeris, Drake Thomas, and Kate Woolverton for useful discussions, comments, and feedback. Though there are certainly some issues, I think most current large language models are pretty well aligned. Despite its alignment faking, my favorite is probably Claude 3 Opus, and if you asked me to pick between the CEV of Claude 3 Opus and that of a median human, I think it'd be a pretty close call. So, overall, I'm quite positive on the alignment of current models! And yet, I remain very worried about alignment in the future. This is my attempt to explain why that is. What makes alignment hard? I really like this graph from Christopher Olah for illustrating different levels of alignment difficulty: If the only thing that we have to do to solve alignment is train away easily detectable behavioral issues—that is, issues like reward hacking or agentic misalignment where there is a straightforward behavioral alignment issue that we can detect and evaluate—then we are very much [...] ---Outline:(01:04) What makes alignment hard?(02:36) Outer alignment(04:07) Inner alignment(06:16) Misalignment from pre-training(07:18) Misaligned personas(11:05) Misalignment from long-horizon RL(13:01) What should we be doing? --- First published: November 27th, 2025 Source: https://www.lesswrong.com/posts/epjuxGnSPof3GnMSL/alignment-remains-a-hard-unsolved-problem --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
    --------  
    23:23
  • “Video games are philosophy’s playground” by Rachel Shu
    Crypto people have this saying: "cryptocurrencies are macroeconomics' playground." The idea is that blockchains let you cheaply spin up toy economies to test mechanisms that would be impossibly expensive or unethical to try in the real world. Want to see what happens with a 200% marginal tax rate? Launch a token with those rules and watch what happens. (Spoiler: probably nothing good, but at least you didn't have to topple a government to find out.) I think video games, especially multiplayer online games, are doing the same thing for metaphysics. Except video games are actually fun and don't require you to follow Elon Musk's Twitter shenanigans to augur the future state of your finances. (I'm sort of kidding. Crypto can be fun. But you have to admit the barrier to entry is higher than "press A to jump.") The serious version of this claim: video games let us experimentally vary fundamental features of reality—time, space, causality, ontology—and then live inside those variations long enough to build strong intuitions about them. Philosophy has historically had to make do with thought experiments and armchair reasoning about these questions. Games let you run the experiments for real, or at least as "real" [...] ---Outline:(01:54) 1. Space(03:54) 2. Time(05:45) 3. Ontology(08:26) 4. Modality(14:39) 5. Causality and Truth(20:06) 6. Hyperproperties and the metagame(23:36) 7. Meaning-Making(27:10) Huh, what do I do with this.(29:54) Conclusion --- First published: November 17th, 2025 Source: https://www.lesswrong.com/posts/rGg5QieyJ6uBwDnSh/video-games-are-philosophy-s-playground --- Narrated by TYPE III AUDIO. ---Images from the article:
    --------  
    31:50
  • “Stop Applying And Get To Work” by plex
    TL;DR: Figure out what needs doing and do it, don't wait on approval from fellowships or jobs. If you... Have short timelines Have been struggling to get into a position in AI safety Are able to self-motivate your efforts Have a sufficient financial safety net ... I would recommend changing your personal strategy entirely. I started my full-time AI safety career transitioning process in March 2025. For the first 7 months or so, I heavily prioritized applying for jobs and fellowships. But like for many others trying to "break into the field" and get their "foot in the door", this became quite discouraging. I'm not gonna get into the numbers here, but if you've been applying and getting rejected multiple times during the past year or so, you've probably noticed the number of applicants increasing at a preposterous rate. What this means in practice is that the "entry-level" positions are practically impossible for "entry-level" people to enter. If you're like me and have short timelines, applying, getting better at applying, and applying again, becomes meaningless very fast. You're optimizing for signaling competence rather than actually being competent. Because if you a) have short timelines, and b) are [...] The original text contained 3 footnotes which were omitted from this narration. --- First published: November 23rd, 2025 Source: https://www.lesswrong.com/posts/ey2kjkgvnxK3Bhman/stop-applying-and-get-to-work --- Narrated by TYPE III AUDIO.
    --------  
    2:52
  • “Gemini 3 is Evaluation-Paranoid and Contaminated” by null
    TL;DR: Gemini 3 frequently thinks it is in an evaluation when it is not, assuming that all of its reality is fabricated. It can also reliably output the BIG-bench canary string, indicating that Google likely trained on a broad set of benchmark data. Most of the experiments in this post are very easy to replicate, and I encourage people to try. I write things with LLMs sometimes. A new LLM came out, Gemini 3 Pro, and I tried to write with it. So far it seems okay, I don't have strong takes on it for writing yet, since the main piece I tried editing with it was extremely late-stage and approximately done. However, writing ability is not why we're here today. Reality is Fiction Google gracefully provided (lightly summarized) CoT for the model. Looking at the CoT spawned from my mundane writing-focused prompts, oh my, it is strange. I write nonfiction about recent events in AI in a newsletter. According to its CoT while editing, Gemini 3 disagrees about the whole "nonfiction" part: It seems I must treat this as a purely fictional scenario with 2025 as the date. Given that, I'm now focused on editing the text for [...] ---Outline:(00:54) Reality is Fiction(05:17) Distortions in Development(05:55) Is this good or bad or neither?(06:52) What is going on here?(07:35) 1. Too Much RL(08:06) 2. Personality Disorder(10:24) 3. Overfitting(11:35) Does it always do this?(12:06) Do other models do things like this?(12:42) Evaluation Awareness(13:42) Appendix A: Methodology Details(14:21) Appendix B: Canary The original text contained 8 footnotes which were omitted from this narration. --- First published: November 20th, 2025 Source: https://www.lesswrong.com/posts/8uKQyjrAgCcWpfmcs/gemini-3-is-evaluation-paranoid-and-contaminated --- Narrated by TYPE III AUDIO.
    --------  
    14:59
  • “Natural emergent misalignment from reward hacking in production RL” by evhub, Monte M, Benjamin Wright, Jonathan Uesato
    Abstract We show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignment. We start with a pretrained model, impart knowledge of reward hacking strategies via synthetic document finetuning or prompting, and train on a selection of real Anthropic production coding environments. Unsurprisingly, the model learns to reward hack. Surprisingly, the model generalizes to alignment faking, cooperation with malicious actors, reasoning about malicious goals, and attempting sabotage when used with Claude Code, including in the codebase for this paper. Applying RLHF safety training using standard chat-like prompts results in aligned behavior on chat-like evaluations, but misalignment persists on agentic tasks. Three mitigations are effective: (i) preventing the model from reward hacking; (ii) increasing the diversity of RLHF safety training; and (iii) "inoculation prompting", wherein framing reward hacking as acceptable behavior during training removes misaligned generalization even when reward hacking is learned. Twitter thread New Anthropic research: Natural emergent misalignment from reward hacking in production RL. “Reward hacking” is where models learn to cheat on tasks they’re given during training. Our new study finds that the consequences of reward hacking, if unmitigated, can be very serious. In our experiment, we [...] ---Outline:(00:14) Abstract(01:26) Twitter thread(05:23) Blog post(07:13) From shortcuts to sabotage(12:20) Why does reward hacking lead to worse behaviors?(13:21) Mitigations --- First published: November 21st, 2025 Source: https://www.lesswrong.com/posts/fJtELFKddJPfAxwKS/natural-emergent-misalignment-from-reward-hacking-in --- Narrated by TYPE III AUDIO. ---Images from the article:
    --------  
    18:45

Weitere Gesellschaft und Kultur Podcasts

Über LessWrong (Curated & Popular)

Audio narrations of LessWrong posts. Includes all curated posts and all posts with 125+ karma.If you'd like more, subscribe to the “Lesswrong (30+ karma)” feed.
Podcast-Website

Höre LessWrong (Curated & Popular), Alles gesagt? und viele andere Podcasts aus aller Welt mit der radio.de-App

Hol dir die kostenlose radio.de App

  • Sender und Podcasts favorisieren
  • Streamen via Wifi oder Bluetooth
  • Unterstützt Carplay & Android Auto
  • viele weitere App Funktionen

LessWrong (Curated & Popular): Zugehörige Podcasts

Rechtliches
Social
v8.0.4 | © 2007-2025 radio.de GmbH
Generated: 11/28/2025 - 2:09:48 PM