Partner im RedaktionsNetzwerk Deutschland
PodcastsGesellschaft und KulturLessWrong (Curated & Popular)

LessWrong (Curated & Popular)

LessWrong
LessWrong (Curated & Popular)
Neueste Episode

Verfügbare Folgen

5 von 681
  • “Sonnet 4.5’s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals” by Alexa Pan, ryan_greenblatt
    According to the Sonnet 4.5 system card, Sonnet 4.5 is much more likely than Sonnet 4 to mention in its chain-of-thought that it thinks it is being evaluated; this seems to meaningfully cause it to appear to behave better in alignment evaluations. So, Sonnet 4.5's behavioral improvements in these evaluations may partly be driven by growing tendency to notice and game evaluations rather than genuine alignment. This is an early example of a phenomenon that is going to get increasingly problematic: as evaluation gaming increases, alignment evaluations become harder to trust.[1]To elaborate on the above: Sonnet 4.5 seems far more aware of being evaluated than previous models. In some of Anthropic's evaluations, Sonnet 4.5 explicitly mentions being tested between 80% and 100% of the time, compared with under 10% for Sonnet 4. Internal representations associated with evaluation awareness also became notably stronger over training.[2]In addition, Sonnet 4.5 seems substantially less likely to behave poorly once it has noticed that it might be tested. As seen in the figure below, when Anthropic inhibited internal representations related to evaluation awareness[3], its misaligned behavior rate on synthetic honeypot tests increased from 0% to between 1% and 9%. This effect was [...] ---Outline:(06:58) Sonnet 4.5 is much more evaluation-aware than prior models(10:00) Evaluation awareness seems to suppress misaligned behavior(14:52) Anthropic's training plausibly caused Sonnet 4.5 to game evaluations(16:28) Evaluation gaming is plausibly a large fraction of the effect of training against misaligned behaviors(22:57) Suppressing evidence of misalignment in evaluation gamers is concerning(25:25) What AI companies should do(30:02) Appendix The original text contained 21 footnotes which were omitted from this narration. --- First published: October 30th, 2025 Source: https://www.lesswrong.com/posts/qgehQxiTXj53X49mM/sonnet-4-5-s-eval-gaming-seriously-undermines-alignment --- Narrated by TYPE III AUDIO. ---Images from the article:
    --------  
    35:57
  • “Publishing academic papers on transformative AI is a nightmare” by Jakub Growiec
    I am a professor of economics. Throughout my career, I was mostly working on economic growth theory, and this eventually brought me to the topic of transformative AI / AGI / superintelligence. Nowadays my work focuses mostly on the promises and threats of this emerging disruptive technology. Recently, jointly with Klaus Prettner, we’ve written a paper on “The Economics of p(doom): Scenarios of Existential Risk and Economic Growth in the Age of Transformative AI”. We have presented it at multiple conferences and seminars, and it was always well received. We didn’t get any real pushback; instead our research prompted a lot of interest and reflection (as I was reported, also in conversations where I wasn’t involved). But our experience with publishing this paper in a journal is a polar opposite. To date, the paper got desk-rejected (without peer review) 7 times. For example, Futures—a journal “for the interdisciplinary study of futures, visioning, anticipation and foresight” justified their negative decision by writing: “while your results are of potential interest, the topic of your manuscript falls outside of the scope of this journal”. Until finally, to our excitement, it was for once sent out for review. But then came the [...] --- First published: November 3rd, 2025 Source: https://www.lesswrong.com/posts/rmYj6PTBMm76voYLn/publishing-academic-papers-on-transformative-ai-is-a --- Narrated by TYPE III AUDIO.
    --------  
    7:23
  • “The Unreasonable Effectiveness of Fiction” by Raelifin
    [Meta: This is Max Harms. I wrote a novel about China and AGI, which comes out today. This essay from my fiction newsletter has been slightly modified for LessWrong.] In the summer of 1983, Ronald Reagan sat down to watch the film War Games, starring Matthew Broderick as a teen hacker. In the movie, Broderick's character accidentally gains access to a military supercomputer with an AI that almost starts World War III.“The only winning move is not to play.” After watching the movie, Reagan, newly concerned with the possibility of hackers causing real harm, ordered a full national security review. The response: “Mr. President, the problem is much worse than you think.” Soon after, the Department of Defense revamped their cybersecurity policies and the first federal directives and laws against malicious hacking were put in place. But War Games wasn't the only story to influence Reagan. His administration pushed for the Strategic Defense Initiative ("Star Wars") in part, perhaps, because the central technology—a laser that shoots down missiles—resembles the core technology behind the 1940 spy film Murder in the Air, which had Reagan as lead actor. Reagan was apparently such a superfan of The Day the Earth Stood Still [...] ---Outline:(05:05) AI in Particular(06:45) Whats Going On Here?(11:19) Authorial Responsibility The original text contained 10 footnotes which were omitted from this narration. --- First published: November 3rd, 2025 Source: https://www.lesswrong.com/posts/uQak7ECW2agpHFsHX/the-unreasonable-effectiveness-of-fiction --- Narrated by TYPE III AUDIO. ---Images from the article:
    --------  
    15:03
  • “Legible vs. Illegible AI Safety Problems” by Wei Dai
    Some AI safety problems are legible (obvious or understandable) to company leaders and government policymakers, implying they are unlikely to deploy or allow deployment of an AI while those problems remain open (i.e., appear unsolved according to the information they have access to). But some problems are illegible (obscure or hard to understand, or in a common cognitive blind spot), meaning there is a high risk that leaders and policymakers will decide to deploy or allow deployment even if they are not solved. (Of course, this is a spectrum, but I am simplifying it to a binary for ease of exposition.) From an x-risk perspective, working on highly legible safety problems has low or even negative expected value. Similar to working on AI capabilities, it brings forward the date by which AGI/ASI will be deployed, leaving less time to solve the illegible x-safety problems. In contrast, working on the illegible problems (including by trying to make them more legible) does not have this issue and therefore has a much higher expected value (all else being equal, such as tractability). Note that according to this logic, success in making an illegible problem highly legible is almost as good as solving [...] The original text contained 2 footnotes which were omitted from this narration. --- First published: November 4th, 2025 Source: https://www.lesswrong.com/posts/PMc65HgRFvBimEpmJ/legible-vs-illegible-ai-safety-problems --- Narrated by TYPE III AUDIO.
    --------  
    3:29
  • “Lack of Social Grace is a Lack of Skill” by Screwtape
    1.  I have claimed that one of the fundamental questions of rationality is “what am I about to do and what will happen next?” One of the domains I ask this question the most is in social situations. There are a great many skills in the world. If I had the time and resources to do so, I’d want to master all of them. Wilderness survival, automotive repair, the Japanese language, calculus, heart surgery, French cooking, sailing, underwater basket weaving, architecture, Mexican cooking, functional programming, whatever it is people mean when they say “hey man, just let him cook.” My inability to speak fluent Japanese isn’t a sin or a crime. However, it isn’t a virtue either; If I had the option to snap my fingers and instantly acquire the knowledge, I’d do it. Now, there's a different question of prioritization; I tend to pick new skills to learn by a combination of what's useful to me, what sounds fun, and what I’m naturally good at. I picked up the basics of computer programming easily, I enjoy doing it, and it turned out to pay really well. That was an over-determined skill to learn. On the other [...] ---Outline:(00:10) 1.(03:42) 2.(06:44) 3. The original text contained 2 footnotes which were omitted from this narration. --- First published: November 3rd, 2025 Source: https://www.lesswrong.com/posts/NnTwbvvsPg5kj3BKq/lack-of-social-grace-is-a-lack-of-skill-1 --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images
    --------  
    11:08

Weitere Gesellschaft und Kultur Podcasts

Über LessWrong (Curated & Popular)

Audio narrations of LessWrong posts. Includes all curated posts and all posts with 125+ karma.If you'd like more, subscribe to the “Lesswrong (30+ karma)” feed.
Podcast-Website

Höre LessWrong (Curated & Popular), TRUE LOVE und viele andere Podcasts aus aller Welt mit der radio.de-App

Hol dir die kostenlose radio.de App

  • Sender und Podcasts favorisieren
  • Streamen via Wifi oder Bluetooth
  • Unterstützt Carplay & Android Auto
  • viele weitere App Funktionen

LessWrong (Curated & Popular): Zugehörige Podcasts

Rechtliches
Social
v7.23.11 | © 2007-2025 radio.de GmbH
Generated: 11/8/2025 - 1:38:25 AM