Partner im RedaktionsNetzwerk Deutschland
PodcastsGesellschaft und KulturLessWrong (Curated & Popular)

LessWrong (Curated & Popular)

LessWrong
LessWrong (Curated & Popular)
Neueste Episode

Verfügbare Folgen

5 von 699
  • “Gemini 3 is Evaluation-Paranoid and Contaminated” by null
    TL;DR: Gemini 3 frequently thinks it is in an evaluation when it is not, assuming that all of its reality is fabricated. It can also reliably output the BIG-bench canary string, indicating that Google likely trained on a broad set of benchmark data. Most of the experiments in this post are very easy to replicate, and I encourage people to try. I write things with LLMs sometimes. A new LLM came out, Gemini 3 Pro, and I tried to write with it. So far it seems okay, I don't have strong takes on it for writing yet, since the main piece I tried editing with it was extremely late-stage and approximately done. However, writing ability is not why we're here today. Reality is Fiction Google gracefully provided (lightly summarized) CoT for the model. Looking at the CoT spawned from my mundane writing-focused prompts, oh my, it is strange. I write nonfiction about recent events in AI in a newsletter. According to its CoT while editing, Gemini 3 disagrees about the whole "nonfiction" part: It seems I must treat this as a purely fictional scenario with 2025 as the date. Given that, I'm now focused on editing the text for [...] ---Outline:(00:54) Reality is Fiction(05:17) Distortions in Development(05:55) Is this good or bad or neither?(06:52) What is going on here?(07:35) 1. Too Much RL(08:06) 2. Personality Disorder(10:24) 3. Overfitting(11:35) Does it always do this?(12:06) Do other models do things like this?(12:42) Evaluation Awareness(13:42) Appendix A: Methodology Details(14:21) Appendix B: Canary The original text contained 8 footnotes which were omitted from this narration. --- First published: November 20th, 2025 Source: https://www.lesswrong.com/posts/8uKQyjrAgCcWpfmcs/gemini-3-is-evaluation-paranoid-and-contaminated --- Narrated by TYPE III AUDIO.
    --------  
    14:59
  • “Natural emergent misalignment from reward hacking in production RL” by evhub, Monte M, Benjamin Wright, Jonathan Uesato
    Abstract We show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignment. We start with a pretrained model, impart knowledge of reward hacking strategies via synthetic document finetuning or prompting, and train on a selection of real Anthropic production coding environments. Unsurprisingly, the model learns to reward hack. Surprisingly, the model generalizes to alignment faking, cooperation with malicious actors, reasoning about malicious goals, and attempting sabotage when used with Claude Code, including in the codebase for this paper. Applying RLHF safety training using standard chat-like prompts results in aligned behavior on chat-like evaluations, but misalignment persists on agentic tasks. Three mitigations are effective: (i) preventing the model from reward hacking; (ii) increasing the diversity of RLHF safety training; and (iii) "inoculation prompting", wherein framing reward hacking as acceptable behavior during training removes misaligned generalization even when reward hacking is learned. Twitter thread New Anthropic research: Natural emergent misalignment from reward hacking in production RL. “Reward hacking” is where models learn to cheat on tasks they’re given during training. Our new study finds that the consequences of reward hacking, if unmitigated, can be very serious. In our experiment, we [...] ---Outline:(00:14) Abstract(01:26) Twitter thread(05:23) Blog post(07:13) From shortcuts to sabotage(12:20) Why does reward hacking lead to worse behaviors?(13:21) Mitigations --- First published: November 21st, 2025 Source: https://www.lesswrong.com/posts/fJtELFKddJPfAxwKS/natural-emergent-misalignment-from-reward-hacking-in --- Narrated by TYPE III AUDIO. ---Images from the article:
    --------  
    18:45
  • “Anthropic is (probably) not meeting its RSP security commitments” by habryka
    TLDR: An AI company's model weight security is at most as good as its compute providers' security. Anthropic has committed (with a bit of ambiguity, but IMO not that much ambiguity) to be robust to attacks from corporate espionage teams at companies where it hosts its weights. Anthropic seems unlikely to be robust to those attacks. Hence they are in violation of their RSP. Anthropic is committed to being robust to attacks from corporate espionage teams (which includes corporate espionage teams at Google, Microsoft and Amazon) From the Anthropic RSP: When a model must meet the ASL-3 Security Standard, we will evaluate whether the measures we have implemented make us highly protected against most attackers’ attempts at stealing model weights. We consider the following groups in scope: hacktivists, criminal hacker groups, organized cybercrime groups, terrorist organizations, corporate espionage teams, internal employees, and state-sponsored programs that use broad-based and non-targeted techniques (i.e., not novel attack chains). [...] We will implement robust controls to mitigate basic insider risk, but consider mitigating risks from sophisticated or state-compromised insiders to be out of scope for ASL-3. We define “basic insider risk” as risk from an insider who does not have persistent or time-limited [...] ---Outline:(00:37) Anthropic is committed to being robust to attacks from corporate espionage teams (which includes corporate espionage teams at Google, Microsoft and Amazon)(03:40) Claude weights that are covered by ASL-3 security requirements are shipped to many Amazon, Google, and Microsoft data centers(04:55) This means given executive buy-in by a high-level Amazon, Microsoft or Google executive, their corporate espionage team would have virtually unlimited physical access to Claude inference machines that host copies of the weights(05:36) With unlimited physical access, a competent corporate espionage team at Amazon, Microsoft or Google could extract weights from an inference machine, without too much difficulty(06:18) Given all of the above, this means Anthropic is in violation of its most recent RSP(07:05) Postscript --- First published: November 18th, 2025 Source: https://www.lesswrong.com/posts/zumPKp3zPDGsppFcF/anthropic-is-probably-not-meeting-its-rsp-security --- Narrated by TYPE III AUDIO. ---Images from the article:Ap
    --------  
    8:57
  • “Varieties Of Doom” by jdp
    There has been a lot of talk about "p(doom)"over the last few years. This has always rubbed me the wrong waybecause "p(doom)" didn't feel like it mapped to any specific belief in my head.In private conversations I'd sometimes give my p(doom) as 12%, with the caveatthat "doom" seemed nebulous and conflated between several different concepts.At some point it was decideda p(doom) over 10% makes you a "doomer" because it means what actions you should take with respect toAI are overdetermined. I did not and do not feel that is true. But any time Ifelt prompted to explain my position I'd find I could explain a little bit ofthis or that, but not really convey the whole thing. As it turns out doom hasa lot of parts, and every part is entangled with every other part so no matterwhich part you explain you always feel like you're leaving the crucial parts out. Doom ismore like an onion than asingle event, a distribution over AI outcomes people frequentlyrespond to with the force of the fear of death. Some of these outcomes are lessthan death and some [...] ---Outline:(03:46) 1. Existential Ennui(06:40) 2. Not Getting Immortalist Luxury Gay Space Communism(13:55) 3. Human Stock Expended As Cannon Fodder Faster Than Replacement(19:37) 4. Wiped Out By AI Successor Species(27:57) 5. The Paperclipper(42:56) Would AI Successors Be Conscious Beings?(44:58) Would AI Successors Care About Each Other?(49:51) Would AI Successors Want To Have Fun?(51:11) VNM Utility And Human Values(55:57) Would AI successors get bored?(01:00:16) Would AI Successors Avoid Wireheading?(01:06:07) Would AI Successors Do Continual Active Learning?(01:06:35) Would AI Successors Have The Subjective Experience of Will?(01:12:00) Multiply(01:15:07) 6. Recipes For Ruin(01:18:02) Radiological and Nuclear(01:19:19) Cybersecurity(01:23:00) Biotech and Nanotech(01:26:35) 7. Large-Finite Damnation --- First published: November 17th, 2025 Source: https://www.lesswrong.com/posts/apHWSGDiydv3ivmg6/varieties-of-doom --- Narrated by TYPE III AUDIO. ---Images from the article:
    --------  
    1:38:48
  • “How Colds Spread” by RobertM
    It seems like a catastrophic civilizational failure that we don't have confident common knowledge of how colds spread. There have been a number of studies conducted over the years, but most of those were testing secondary endpoints, like how long viruses would survive on surfaces, or how likely they were to be transmitted to people's fingers after touching contaminated surfaces, etc. However, a few of them involved rounding up some brave volunteers, deliberately infecting some of them, and then arranging matters so as to test various routes of transmission to uninfected volunteers. My conclusions from reviewing these studies are: You can definitely infect yourself if you take a sick person's snot and rub it into your eyeballs or nostrils.  This probably works even if you touched a surface that a sick person touched, rather than by handshake, at least for some surfaces.  There's some evidence that actual human infection is much less likely if the contaminated surface you touched is dry, but for most colds there'll often be quite a lot of virus detectable on even dry contaminated surfaces for most of a day.  I think you can probably infect yourself with fomites, but my guess is that [...] ---Outline:(01:49) Fomites(06:58) Aerosols(16:23) Other Factors(17:06) Review(18:33) Conclusion The original text contained 16 footnotes which were omitted from this narration. --- First published: November 18th, 2025 Source: https://www.lesswrong.com/posts/92fkEn4aAjRutqbNF/how-colds-spread --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
    --------  
    20:31

Weitere Gesellschaft und Kultur Podcasts

Über LessWrong (Curated & Popular)

Audio narrations of LessWrong posts. Includes all curated posts and all posts with 125+ karma.If you'd like more, subscribe to the “Lesswrong (30+ karma)” feed.
Podcast-Website

Höre LessWrong (Curated & Popular), Tim Gabel Podcast und viele andere Podcasts aus aller Welt mit der radio.de-App

Hol dir die kostenlose radio.de App

  • Sender und Podcasts favorisieren
  • Streamen via Wifi oder Bluetooth
  • Unterstützt Carplay & Android Auto
  • viele weitere App Funktionen

LessWrong (Curated & Popular): Zugehörige Podcasts

Rechtliches
Social
v8.0.1 | © 2007-2025 radio.de GmbH
Generated: 11/24/2025 - 1:24:19 PM