Πηγαίνετε εκτός σύνδεσης με την εφαρμογή Player FM !
AF - Estimating Tail Risk in Neural Networks by Jacob Hilton
Fetch error
Hmmm there seems to be a problem fetching this series right now. Last successful fetch was on October 09, 2024 12:46 ()
What now? This series will be checked again in the next hour. If you believe it should be working, please verify the publisher's feed link below is valid and includes actual episode links. You can contact support to request the feed be immediately fetched.
Manage episode 439797655 series 3314709
Machine learning systems are typically trained to maximize average-case performance. However, this method of training can fail to meaningfully control the probability of tail events that might cause significant harm. For instance, while an artificial intelligence (AI) assistant may be generally safe, it would be catastrophic if it ever suggested an action that resulted in unnecessary large-scale harm.
Current techniques for estimating the probability of tail events are based on finding inputs on which an AI behaves catastrophically. Since the input space is so large, it might be prohibitive to search through it thoroughly enough to detect all potential catastrophic behavior. As a result, these techniques cannot be used to produce AI systems that we are confident will never behave catastrophically.
We are excited about techniques to estimate the probability of tail events that do not rely on finding inputs on which an AI behaves badly, and can thus detect a broader range of catastrophic behavior. We think developing such techniques is an exciting problem to work on to reduce the risk posed by advanced AI systems:
Estimating tail risk is a conceptually straightforward problem with relatively objective success criteria; we are predicting something mathematically well-defined, unlike instances of eliciting latent knowledge (ELK) where we are predicting an informal concept like "diamond".
Improved methods for estimating tail risk could reduce risk from a variety of sources, including central misalignment risks like deceptive alignment.
Improvements to current methods can be found both by doing empirical research, or by thinking about the problem from a theoretical angle.
This document will discuss the problem of estimating the probability of tail events and explore estimation strategies that do not rely on finding inputs on which an AI behaves badly. In particular, we will:
Introduce a toy scenario about an AI engineering assistant for which we want to estimate the probability of a catastrophic tail event.
Explain some deficiencies of adversarial training, the most common method for reducing risk in contemporary AI systems.
Discuss deceptive alignment as a particularly dangerous case in which adversarial training might fail.
Present methods for estimating the probability of tail events in neural network behavior that do not rely on evaluating behavior on concrete inputs.
Conclude with a discussion of why we are excited about work aimed at improving estimates of the probability of tail events.
This document describes joint research done with Jacob Hilton, Victor Lecomte, David Matolcsi, Eric Neyman, Thomas Read, George Robinson, and Gabe Wu. Thanks additionally to Ajeya Cotra, Lukas Finnveden, and Erik Jenner for helpful comments and suggestions.
A Toy Scenario
Consider a powerful AI engineering assistant. Write M for this AI system, and M(x) for the action it suggests given some project description x.
We want to use this system to help with various engineering projects, but would like it to never suggest an action that results in large-scale harm, e.g. creating a doomsday device. In general, we define a behavior as catastrophic if it must never occur in the real world.[1] An input is catastrophic if it would lead to catastrophic behavior.
Assume we can construct a catastrophe detector C that tells us if an action M(x) will result in large-scale harm. For the purposes of this example, we will assume both that C has a reasonable chance of catching all catastrophes and that it is feasible to find a useful engineering assistant M that never triggers C (see Catastrophe Detectors for further discussion).
We will also assume we can use C to train M, but that it is ...
2437 επεισόδια
Fetch error
Hmmm there seems to be a problem fetching this series right now. Last successful fetch was on October 09, 2024 12:46 ()
What now? This series will be checked again in the next hour. If you believe it should be working, please verify the publisher's feed link below is valid and includes actual episode links. You can contact support to request the feed be immediately fetched.
Manage episode 439797655 series 3314709
Machine learning systems are typically trained to maximize average-case performance. However, this method of training can fail to meaningfully control the probability of tail events that might cause significant harm. For instance, while an artificial intelligence (AI) assistant may be generally safe, it would be catastrophic if it ever suggested an action that resulted in unnecessary large-scale harm.
Current techniques for estimating the probability of tail events are based on finding inputs on which an AI behaves catastrophically. Since the input space is so large, it might be prohibitive to search through it thoroughly enough to detect all potential catastrophic behavior. As a result, these techniques cannot be used to produce AI systems that we are confident will never behave catastrophically.
We are excited about techniques to estimate the probability of tail events that do not rely on finding inputs on which an AI behaves badly, and can thus detect a broader range of catastrophic behavior. We think developing such techniques is an exciting problem to work on to reduce the risk posed by advanced AI systems:
Estimating tail risk is a conceptually straightforward problem with relatively objective success criteria; we are predicting something mathematically well-defined, unlike instances of eliciting latent knowledge (ELK) where we are predicting an informal concept like "diamond".
Improved methods for estimating tail risk could reduce risk from a variety of sources, including central misalignment risks like deceptive alignment.
Improvements to current methods can be found both by doing empirical research, or by thinking about the problem from a theoretical angle.
This document will discuss the problem of estimating the probability of tail events and explore estimation strategies that do not rely on finding inputs on which an AI behaves badly. In particular, we will:
Introduce a toy scenario about an AI engineering assistant for which we want to estimate the probability of a catastrophic tail event.
Explain some deficiencies of adversarial training, the most common method for reducing risk in contemporary AI systems.
Discuss deceptive alignment as a particularly dangerous case in which adversarial training might fail.
Present methods for estimating the probability of tail events in neural network behavior that do not rely on evaluating behavior on concrete inputs.
Conclude with a discussion of why we are excited about work aimed at improving estimates of the probability of tail events.
This document describes joint research done with Jacob Hilton, Victor Lecomte, David Matolcsi, Eric Neyman, Thomas Read, George Robinson, and Gabe Wu. Thanks additionally to Ajeya Cotra, Lukas Finnveden, and Erik Jenner for helpful comments and suggestions.
A Toy Scenario
Consider a powerful AI engineering assistant. Write M for this AI system, and M(x) for the action it suggests given some project description x.
We want to use this system to help with various engineering projects, but would like it to never suggest an action that results in large-scale harm, e.g. creating a doomsday device. In general, we define a behavior as catastrophic if it must never occur in the real world.[1] An input is catastrophic if it would lead to catastrophic behavior.
Assume we can construct a catastrophe detector C that tells us if an action M(x) will result in large-scale harm. For the purposes of this example, we will assume both that C has a reasonable chance of catching all catastrophes and that it is feasible to find a useful engineering assistant M that never triggers C (see Catastrophe Detectors for further discussion).
We will also assume we can use C to train M, but that it is ...
2437 επεισόδια
Όλα τα επεισόδια
×Καλώς ήλθατε στο Player FM!
Το FM Player σαρώνει τον ιστό για podcasts υψηλής ποιότητας για να απολαύσετε αυτή τη στιγμή. Είναι η καλύτερη εφαρμογή podcast και λειτουργεί σε Android, iPhone και στον ιστό. Εγγραφή για συγχρονισμό συνδρομών σε όλες τις συσκευές.