Πηγαίνετε εκτός σύνδεσης με την εφαρμογή Player FM !
19 - Mechanistic Interpretability with Neel Nanda
Manage episode 354439285 series 2844728
How good are we at understanding the internal computation of advanced machine learning models, and do we have a hope at getting better? In this episode, Neel Nanda talks about the sub-field of mechanistic interpretability research, as well as papers he's contributed to that explore the basics of transformer circuits, induction heads, and grokking.
Topics we discuss, and timestamps:
- 00:01:05 - What is mechanistic interpretability?
- 00:24:16 - Types of AI cognition
- 00:54:27 - Automating mechanistic interpretability
- 01:11:57 - Summarizing the papers
- 01:24:43 - 'A Mathematical Framework for Transformer Circuits'
- 01:39:31 - How attention works
- 01:49:26 - Composing attention heads
- 01:59:42 - Induction heads
- 02:11:05 - 'In-context Learning and Induction Heads'
- 02:12:55 - The multiplicity of induction heads
- 02:30:10 - Lines of evidence
- 02:38:47 - Evolution in loss-space
- 02:46:19 - Mysteries of in-context learning
- 02:50:57 - 'Progress measures for grokking via mechanistic interpretability'
- 02:50:57 - How neural nets learn modular addition
- 03:11:37 - The suddenness of grokking
- 03:34:16 - Relation to other research
- 03:43:57 - Could mechanistic interpretability possibly work?
- 03:49:28 - Following Neel's research
The transcript: axrp.net/episode/2023/02/04/episode-19-mechanistic-interpretability-neel-nanda.html
Links to Neel's things:
- Neel on Twitter: twitter.com/NeelNanda5
- Neel on the Alignment Forum: alignmentforum.org/users/neel-nanda-1
- Neel's mechanistic interpretability blog: neelnanda.io/mechanistic-interpretability
- TransformerLens: github.com/neelnanda-io/TransformerLens
- Concrete Steps to Get Started in Transformer Mechanistic Interpretability: alignmentforum.org/posts/9ezkEb9oGvEi6WoB3/concrete-steps-to-get-started-in-transformer-mechanistic
- Neel on YouTube: youtube.com/@neelnanda2469
- 200 Concrete Open Problems in Mechanistic Interpretability: alignmentforum.org/s/yivyHaCAmMJ3CqSyj
- Comprehesive mechanistic interpretability explainer: dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J
Writings we discuss:
- A Mathematical Framework for Transformer Circuits: transformer-circuits.pub/2021/framework/index.html
- In-context Learning and Induction Heads: transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html
- Progress measures for grokking via mechanistic interpretability: arxiv.org/abs/2301.05217
- Hungry Hungry Hippos: Towards Language Modeling with State Space Models (referred to in this episode as the "S4 paper"): arxiv.org/abs/2212.14052
- interpreting GPT: the logit lens: lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens
- Locating and Editing Factual Associations in GPT (aka the ROME paper): arxiv.org/abs/2202.05262
- Human-level play in the game of Diplomacy by combining language models with strategic reasoning: science.org/doi/10.1126/science.ade9097
- Causal Scrubbing: alignmentforum.org/s/h95ayYYwMebGEYN5y/p/JvZhhzycHu2Yd57RN
- An Interpretability Illusion for BERT: arxiv.org/abs/2104.07143
- Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small: arxiv.org/abs/2211.00593
- Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets: arxiv.org/abs/2201.02177
- The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models: arxiv.org/abs/2201.03544
- Collaboration & Credit Principles: colah.github.io/posts/2019-05-Collaboration
- Transformer Feed-Forward Layers Are Key-Value Memories: arxiv.org/abs/2012.14913
- Multi-Component Learning and S-Curves: alignmentforum.org/posts/RKDQCB6smLWgs2Mhr/multi-component-learning-and-s-curves
- The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks: arxiv.org/abs/1803.03635
- Linear Mode Connectivity and the Lottery Ticket Hypothesis: proceedings.mlr.press/v119/frankle20a
42 επεισόδια
Manage episode 354439285 series 2844728
How good are we at understanding the internal computation of advanced machine learning models, and do we have a hope at getting better? In this episode, Neel Nanda talks about the sub-field of mechanistic interpretability research, as well as papers he's contributed to that explore the basics of transformer circuits, induction heads, and grokking.
Topics we discuss, and timestamps:
- 00:01:05 - What is mechanistic interpretability?
- 00:24:16 - Types of AI cognition
- 00:54:27 - Automating mechanistic interpretability
- 01:11:57 - Summarizing the papers
- 01:24:43 - 'A Mathematical Framework for Transformer Circuits'
- 01:39:31 - How attention works
- 01:49:26 - Composing attention heads
- 01:59:42 - Induction heads
- 02:11:05 - 'In-context Learning and Induction Heads'
- 02:12:55 - The multiplicity of induction heads
- 02:30:10 - Lines of evidence
- 02:38:47 - Evolution in loss-space
- 02:46:19 - Mysteries of in-context learning
- 02:50:57 - 'Progress measures for grokking via mechanistic interpretability'
- 02:50:57 - How neural nets learn modular addition
- 03:11:37 - The suddenness of grokking
- 03:34:16 - Relation to other research
- 03:43:57 - Could mechanistic interpretability possibly work?
- 03:49:28 - Following Neel's research
The transcript: axrp.net/episode/2023/02/04/episode-19-mechanistic-interpretability-neel-nanda.html
Links to Neel's things:
- Neel on Twitter: twitter.com/NeelNanda5
- Neel on the Alignment Forum: alignmentforum.org/users/neel-nanda-1
- Neel's mechanistic interpretability blog: neelnanda.io/mechanistic-interpretability
- TransformerLens: github.com/neelnanda-io/TransformerLens
- Concrete Steps to Get Started in Transformer Mechanistic Interpretability: alignmentforum.org/posts/9ezkEb9oGvEi6WoB3/concrete-steps-to-get-started-in-transformer-mechanistic
- Neel on YouTube: youtube.com/@neelnanda2469
- 200 Concrete Open Problems in Mechanistic Interpretability: alignmentforum.org/s/yivyHaCAmMJ3CqSyj
- Comprehesive mechanistic interpretability explainer: dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J
Writings we discuss:
- A Mathematical Framework for Transformer Circuits: transformer-circuits.pub/2021/framework/index.html
- In-context Learning and Induction Heads: transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html
- Progress measures for grokking via mechanistic interpretability: arxiv.org/abs/2301.05217
- Hungry Hungry Hippos: Towards Language Modeling with State Space Models (referred to in this episode as the "S4 paper"): arxiv.org/abs/2212.14052
- interpreting GPT: the logit lens: lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens
- Locating and Editing Factual Associations in GPT (aka the ROME paper): arxiv.org/abs/2202.05262
- Human-level play in the game of Diplomacy by combining language models with strategic reasoning: science.org/doi/10.1126/science.ade9097
- Causal Scrubbing: alignmentforum.org/s/h95ayYYwMebGEYN5y/p/JvZhhzycHu2Yd57RN
- An Interpretability Illusion for BERT: arxiv.org/abs/2104.07143
- Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small: arxiv.org/abs/2211.00593
- Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets: arxiv.org/abs/2201.02177
- The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models: arxiv.org/abs/2201.03544
- Collaboration & Credit Principles: colah.github.io/posts/2019-05-Collaboration
- Transformer Feed-Forward Layers Are Key-Value Memories: arxiv.org/abs/2012.14913
- Multi-Component Learning and S-Curves: alignmentforum.org/posts/RKDQCB6smLWgs2Mhr/multi-component-learning-and-s-curves
- The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks: arxiv.org/abs/1803.03635
- Linear Mode Connectivity and the Lottery Ticket Hypothesis: proceedings.mlr.press/v119/frankle20a
42 επεισόδια
Semua episod
×Καλώς ήλθατε στο Player FM!
Το FM Player σαρώνει τον ιστό για podcasts υψηλής ποιότητας για να απολαύσετε αυτή τη στιγμή. Είναι η καλύτερη εφαρμογή podcast και λειτουργεί σε Android, iPhone και στον ιστό. Εγγραφή για συγχρονισμό συνδρομών σε όλες τις συσκευές.