“Auditing language models for hidden objectives” by Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Akbir Khan, Euan Ong, Christopher Olah, Fabien Roger, Meg, Drake Thomas, Adam Jermyn, Monte M, evhub

LessWrong (Curated & Popular)

Το περιεχόμενο παρέχεται από το LessWrong. Όλο το περιεχόμενο podcast, συμπεριλαμβανομένων των επεισοδίων, των γραφικών και των περιγραφών podcast, μεταφορτώνεται και παρέχεται απευθείας από τον LessWrong ή τον συνεργάτη της πλατφόρμας podcast. Εάν πιστεύετε ότι κάποιος χρησιμοποιεί το έργο σας που προστατεύεται από πνευματικά δικαιώματα χωρίς την άδειά σας, μπορείτε να ακολουθήσετε τη διαδικασία που περιγράφεται εδώ https://el.player.fm/legal.

1d ago 24:14

MP3•Αρχική οθόνη επεισοδίου

We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it.
This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams.
Abstract
We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two ways. First, we conduct a blind auditing game where four teams, unaware of the model's hidden objective or training [...]
---
Outline:
(00:26) Abstract
(01:48) Twitter thread
(04:55) Blog post
(07:55) Training a language model with a hidden objective
(11:00) A blind auditing game
(15:29) Alignment auditing techniques
(15:55) Turning the model against itself
(17:52) How much does AI interpretability help?
(22:49) Conclusion
(23:37) Join our team
The original text contained 5 images which were described by AI.
---
First published:
March 13th, 2025
Source:
https://www.lesswrong.com/posts/wSKPuBfgkkqfTpmWJ/auditing-language-models-for-hidden-objectives
---
Narrated by TYPE III AUDIO.
---

Images from the article:

461 επεισόδια

#Tech #Society #Philosophy #LessWrong #LessWrong Curated