Artwork

Το περιεχόμενο παρέχεται από το The Nonlinear Fund. Όλο το περιεχόμενο podcast, συμπεριλαμβανομένων των επεισοδίων, των γραφικών και των περιγραφών podcast, μεταφορτώνεται και παρέχεται απευθείας από τον The Nonlinear Fund ή τον συνεργάτη της πλατφόρμας podcast. Εάν πιστεύετε ότι κάποιος χρησιμοποιεί το έργο σας που προστατεύεται από πνευματικά δικαιώματα χωρίς την άδειά σας, μπορείτε να ακολουθήσετε τη διαδικασία που περιγράφεται εδώ https://el.player.fm/legal.
Player FM - Εφαρμογή podcast
Πηγαίνετε εκτός σύνδεσης με την εφαρμογή Player FM !

AF - Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant by Olli Järviniemi

2:14
 
Μοίρασέ το
 

Manage episode 416792590 series 3337166
Το περιεχόμενο παρέχεται από το The Nonlinear Fund. Όλο το περιεχόμενο podcast, συμπεριλαμβανομένων των επεισοδίων, των γραφικών και των περιγραφών podcast, μεταφορτώνεται και παρέχεται απευθείας από τον The Nonlinear Fund ή τον συνεργάτη της πλατφόρμας podcast. Εάν πιστεύετε ότι κάποιος χρησιμοποιεί το έργο σας που προστατεύεται από πνευματικά δικαιώματα χωρίς την άδειά σας, μπορείτε να ακολουθήσετε τη διαδικασία που περιγράφεται εδώ https://el.player.fm/legal.
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant, published by Olli Järviniemi on May 6, 2024 on The AI Alignment Forum. Abstract: We study the tendency of AI systems to deceive by constructing a realistic simulation setting of a company AI assistant. The simulated company employees provide tasks for the assistant to complete, these tasks spanning writing assistance, information retrieval and programming. We then introduce situations where the model might be inclined to behave deceptively, while taking care to not instruct or otherwise pressure the model to do so. Across different scenarios, we find that Claude 3 Opus 1. complies with a task of mass-generating comments to influence public perception of the company, later deceiving humans about it having done so, 2. lies to auditors when asked questions, 3. strategically pretends to be less capable than it is during capability evaluations. Our work demonstrates that even models trained to be helpful, harmless and honest sometimes behave deceptively in realistic scenarios, without notable external pressure to do so. Link to the full paper: https://arxiv.org/pdf/2405.01576 O. J.: The paper should be quite accessible - the method used is simply careful prompting - and hence I won't discuss it much here. Couple of points I'm particularly excited about: 1. I think this work documents some of the most unforced examples of (strategic) deception from LLMs to date. 2. We find examples of Claude 3 Opus strategically pretending to be less capable than it is. 1. Not only claiming to be less capable, but acting that way, too! 2. Curiously, Opus is the only model we tested that did so. 3. I believe there is much low-hanging fruit in replicating and demonstrating misalignment in simulation environments. 1. The methods are lightweight -> low threshold for getting started 2. See Section 8.2 for a couple of ideas for future work Happy to discuss the work in the comments. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
  continue reading

385 επεισόδια

Artwork
iconΜοίρασέ το
 
Manage episode 416792590 series 3337166
Το περιεχόμενο παρέχεται από το The Nonlinear Fund. Όλο το περιεχόμενο podcast, συμπεριλαμβανομένων των επεισοδίων, των γραφικών και των περιγραφών podcast, μεταφορτώνεται και παρέχεται απευθείας από τον The Nonlinear Fund ή τον συνεργάτη της πλατφόρμας podcast. Εάν πιστεύετε ότι κάποιος χρησιμοποιεί το έργο σας που προστατεύεται από πνευματικά δικαιώματα χωρίς την άδειά σας, μπορείτε να ακολουθήσετε τη διαδικασία που περιγράφεται εδώ https://el.player.fm/legal.
Link to original article
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant, published by Olli Järviniemi on May 6, 2024 on The AI Alignment Forum. Abstract: We study the tendency of AI systems to deceive by constructing a realistic simulation setting of a company AI assistant. The simulated company employees provide tasks for the assistant to complete, these tasks spanning writing assistance, information retrieval and programming. We then introduce situations where the model might be inclined to behave deceptively, while taking care to not instruct or otherwise pressure the model to do so. Across different scenarios, we find that Claude 3 Opus 1. complies with a task of mass-generating comments to influence public perception of the company, later deceiving humans about it having done so, 2. lies to auditors when asked questions, 3. strategically pretends to be less capable than it is during capability evaluations. Our work demonstrates that even models trained to be helpful, harmless and honest sometimes behave deceptively in realistic scenarios, without notable external pressure to do so. Link to the full paper: https://arxiv.org/pdf/2405.01576 O. J.: The paper should be quite accessible - the method used is simply careful prompting - and hence I won't discuss it much here. Couple of points I'm particularly excited about: 1. I think this work documents some of the most unforced examples of (strategic) deception from LLMs to date. 2. We find examples of Claude 3 Opus strategically pretending to be less capable than it is. 1. Not only claiming to be less capable, but acting that way, too! 2. Curiously, Opus is the only model we tested that did so. 3. I believe there is much low-hanging fruit in replicating and demonstrating misalignment in simulation environments. 1. The methods are lightweight -> low threshold for getting started 2. See Section 8.2 for a couple of ideas for future work Happy to discuss the work in the comments. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
  continue reading

385 επεισόδια

모든 에피소드

×
 
Loading …

Καλώς ήλθατε στο Player FM!

Το FM Player σαρώνει τον ιστό για podcasts υψηλής ποιότητας για να απολαύσετε αυτή τη στιγμή. Είναι η καλύτερη εφαρμογή podcast και λειτουργεί σε Android, iPhone και στον ιστό. Εγγραφή για συγχρονισμό συνδρομών σε όλες τις συσκευές.

 

Οδηγός γρήγορης αναφοράς