Artwork

Το περιεχόμενο παρέχεται από το Soroush Pour. Όλο το περιεχόμενο podcast, συμπεριλαμβανομένων των επεισοδίων, των γραφικών και των περιγραφών podcast, μεταφορτώνεται και παρέχεται απευθείας από τον Soroush Pour ή τον συνεργάτη της πλατφόρμας podcast. Εάν πιστεύετε ότι κάποιος χρησιμοποιεί το έργο σας που προστατεύεται από πνευματικά δικαιώματα χωρίς την άδειά σας, μπορείτε να ακολουθήσετε τη διαδικασία που περιγράφεται εδώ https://el.player.fm/legal.
Player FM - Εφαρμογή podcast
Πηγαίνετε εκτός σύνδεσης με την εφαρμογή Player FM !

Ep 14 - Interp, latent robustness, RLHF limitations w/ Stephen Casper (PhD AI researcher, MIT)

2:42:17
 
Μοίρασέ το
 

Manage episode 424366109 series 3428190
Το περιεχόμενο παρέχεται από το Soroush Pour. Όλο το περιεχόμενο podcast, συμπεριλαμβανομένων των επεισοδίων, των γραφικών και των περιγραφών podcast, μεταφορτώνεται και παρέχεται απευθείας από τον Soroush Pour ή τον συνεργάτη της πλατφόρμας podcast. Εάν πιστεύετε ότι κάποιος χρησιμοποιεί το έργο σας που προστατεύεται από πνευματικά δικαιώματα χωρίς την άδειά σας, μπορείτε να ακολουθήσετε τη διαδικασία που περιγράφεται εδώ https://el.player.fm/legal.

We speak with Stephen Casper, or "Cas" as his friends call him. Cas is a PhD student at MIT in the Computer Science (EECS) department, in the Algorithmic Alignment Group advised by Prof Dylan Hadfield-Menell. Formerly, he worked with the Harvard Kreiman Lab and the Center for Human-Compatible AI (CHAI) at Berkeley. His work focuses on better understanding the internal workings of AI models (better known as “interpretability”), making them robust to various kinds of adversarial attacks, and calling out the current technical and policy gaps when it comes to making sure our future with AI goes well. He’s particularly interested in finding automated ways of finding & fixing flaws in how deep neural nets handle human-interpretable concepts.
We talk to Stephen about:
* His technical AI safety work in the areas of:
* Interpretability
* Latent attacks and adversarial robustness
* Model unlearning
* The limitations of RLHF
* Cas' journey to becoming an AI safety researcher
* How he thinks the AI safety field is going and whether we're on track for a positive future with AI
* Where he sees the biggest risks coming with AI
* Gaps in the AI safety field that people should work on
* Advice for early career researchers
Hosted by Soroush Pour. Follow me for more AGI content:
Twitter: https://twitter.com/soroushjp
LinkedIn: https://www.linkedin.com/in/soroushjp/
== Show links ==
-- Follow Stephen --
* Website: https://stephencasper.com/
* Email: (see Cas' website above)
* Twitter: https://twitter.com/StephenLCasper
* Google Scholar: https://scholar.google.com/citations?user=zaF8UJcAAAAJ
-- Further resources --
* Automated jailbreaks / red-teaming paper that Cas and I worked on together (2023) - https://twitter.com/soroushjp/status/1721950722626077067
* Sam Marks paper on Sparse Autoencoders (SAEs) - https://arxiv.org/abs/2403.19647
* Interpretability papers involving downstream tasks - See section 4.2 of https://arxiv.org/abs/2401.14446
* MMET paper on model editing - https://arxiv.org/abs/2210.07229
* Motte & bailey definition - https://en.wikipedia.org/wiki/Motte-and-bailey_fallacy
* Bomb-making papers tweet thread by Cas - https://twitter.com/StephenLCasper/status/1780370601171198246
* Paper: undoing safety with as few as 10 examples - https://arxiv.org/abs/2310.03693
* Recommended papers on latent adversarial training (LAT) -
* https://ai-alignment.com/training-robust-corrigibility-ce0e0a3b9b4d
* https://arxiv.org/abs/2403.05030
* Scoping (related to model unlearning) blog post by Cas - https://www.alignmentforum.org/posts/mFAvspg4sXkrfZ7FA/deep-forgetting-and-unlearning-for-safely-scoped-llms
* Defending against failure modes using LAT - https://arxiv.org/abs/2403.05030
* Cas' systems for reading for research -
* Follow ML Twitter
* Use a combination of the following two search tools for new Arxiv papers:
* https://vjunetxuuftofi.github.io/arxivredirect/
* https://chromewebstore.google.com/detail/highlight-this-finds-and/fgmbnmjmbjenlhbefngfibmjkpbcljaj?pli=1
* Skim a new paper or two a day + take brief notes in a searchable notes app
* Recommended people to follow to learn about how to impact the world through research -
* Dan Hendrycks
* Been Kim
* Jacob Steinhardt
* Nicolas Carlini
* Paul Christiano
* Ethan Perez
Recorded May 1, 2024

  continue reading

15 επεισόδια

Artwork
iconΜοίρασέ το
 
Manage episode 424366109 series 3428190
Το περιεχόμενο παρέχεται από το Soroush Pour. Όλο το περιεχόμενο podcast, συμπεριλαμβανομένων των επεισοδίων, των γραφικών και των περιγραφών podcast, μεταφορτώνεται και παρέχεται απευθείας από τον Soroush Pour ή τον συνεργάτη της πλατφόρμας podcast. Εάν πιστεύετε ότι κάποιος χρησιμοποιεί το έργο σας που προστατεύεται από πνευματικά δικαιώματα χωρίς την άδειά σας, μπορείτε να ακολουθήσετε τη διαδικασία που περιγράφεται εδώ https://el.player.fm/legal.

We speak with Stephen Casper, or "Cas" as his friends call him. Cas is a PhD student at MIT in the Computer Science (EECS) department, in the Algorithmic Alignment Group advised by Prof Dylan Hadfield-Menell. Formerly, he worked with the Harvard Kreiman Lab and the Center for Human-Compatible AI (CHAI) at Berkeley. His work focuses on better understanding the internal workings of AI models (better known as “interpretability”), making them robust to various kinds of adversarial attacks, and calling out the current technical and policy gaps when it comes to making sure our future with AI goes well. He’s particularly interested in finding automated ways of finding & fixing flaws in how deep neural nets handle human-interpretable concepts.
We talk to Stephen about:
* His technical AI safety work in the areas of:
* Interpretability
* Latent attacks and adversarial robustness
* Model unlearning
* The limitations of RLHF
* Cas' journey to becoming an AI safety researcher
* How he thinks the AI safety field is going and whether we're on track for a positive future with AI
* Where he sees the biggest risks coming with AI
* Gaps in the AI safety field that people should work on
* Advice for early career researchers
Hosted by Soroush Pour. Follow me for more AGI content:
Twitter: https://twitter.com/soroushjp
LinkedIn: https://www.linkedin.com/in/soroushjp/
== Show links ==
-- Follow Stephen --
* Website: https://stephencasper.com/
* Email: (see Cas' website above)
* Twitter: https://twitter.com/StephenLCasper
* Google Scholar: https://scholar.google.com/citations?user=zaF8UJcAAAAJ
-- Further resources --
* Automated jailbreaks / red-teaming paper that Cas and I worked on together (2023) - https://twitter.com/soroushjp/status/1721950722626077067
* Sam Marks paper on Sparse Autoencoders (SAEs) - https://arxiv.org/abs/2403.19647
* Interpretability papers involving downstream tasks - See section 4.2 of https://arxiv.org/abs/2401.14446
* MMET paper on model editing - https://arxiv.org/abs/2210.07229
* Motte & bailey definition - https://en.wikipedia.org/wiki/Motte-and-bailey_fallacy
* Bomb-making papers tweet thread by Cas - https://twitter.com/StephenLCasper/status/1780370601171198246
* Paper: undoing safety with as few as 10 examples - https://arxiv.org/abs/2310.03693
* Recommended papers on latent adversarial training (LAT) -
* https://ai-alignment.com/training-robust-corrigibility-ce0e0a3b9b4d
* https://arxiv.org/abs/2403.05030
* Scoping (related to model unlearning) blog post by Cas - https://www.alignmentforum.org/posts/mFAvspg4sXkrfZ7FA/deep-forgetting-and-unlearning-for-safely-scoped-llms
* Defending against failure modes using LAT - https://arxiv.org/abs/2403.05030
* Cas' systems for reading for research -
* Follow ML Twitter
* Use a combination of the following two search tools for new Arxiv papers:
* https://vjunetxuuftofi.github.io/arxivredirect/
* https://chromewebstore.google.com/detail/highlight-this-finds-and/fgmbnmjmbjenlhbefngfibmjkpbcljaj?pli=1
* Skim a new paper or two a day + take brief notes in a searchable notes app
* Recommended people to follow to learn about how to impact the world through research -
* Dan Hendrycks
* Been Kim
* Jacob Steinhardt
* Nicolas Carlini
* Paul Christiano
* Ethan Perez
Recorded May 1, 2024

  continue reading

15 επεισόδια

Όλα τα επεισόδια

×
 
Loading …

Καλώς ήλθατε στο Player FM!

Το FM Player σαρώνει τον ιστό για podcasts υψηλής ποιότητας για να απολαύσετε αυτή τη στιγμή. Είναι η καλύτερη εφαρμογή podcast και λειτουργεί σε Android, iPhone και στον ιστό. Εγγραφή για συγχρονισμό συνδρομών σε όλες τις συσκευές.

 

Οδηγός γρήγορης αναφοράς