Согласование ИИ

Математика

Прочее

Полное дерево:

↓ Только переводы

Согласование ИИ

Великая цивилизационная задача создания искусственных интеллектуальных компьютерных систем, запускать которые было бы хорошей идеей.
- Теория продвинутых агентов
  
  Одна из исследовательноских подзадач для создания мощных и хороших ИИ - это теория (достаточно продвинутых) разумов в целом.
  - Advanced agent properties
    
    How smart does a machine intelligence need to be, for its niceness to become an issue? "Advanced" is a broad term to cover cognitive abilities such that we'd need to start considering AI alignment.
    
    Когнитивная невместимость
    
    'Когнитивная невместимость' - это когда мы не можем держать в уме все возможные варианты поведения агента.
    
    Богатая область
    
    Почти все области реального мира богаты
    
    Все, что вы пытаетесь сделать в реальном мире, может быть сделано *множеством* разных способов.
    
    Логическая игра
    
    Математическая структура игры в ее чистейшем виде.
    
    Epistemic and instrumental efficiency
    
    An efficient agent never makes a mistake you can predict. You can never successfully predict a directional bias in its estimates.
    
    Time-machine metaphor for efficient agents
    
    Don't imagine a paperclip maximizer as a mind. Imagine it as a time machine that always spits out the output leading to the greatest number of future paperclips.
    
    Свойства стандартных агентов
    
    Что такое "стандартный агент" и что он может?
    
    Ограниченный агент
    
    Агент, действующий в реальном мире, использующий реалистичные объемы вычислительной мощности, неуверенный насчет своего окружения, и т. д.
    
    Винджевская неопределенность
    
    Вы не можете предсказать точные действия агентов умнее вас - но есть ли что-то, что вы _можете_ сказать о них?
    
    Deep Blue
    
    Шахматная программа, созданная IBM, и впервые забравшая мировое чемпионство по шахматам у Гарри Каспарова в 1996.
    
    Закон Винджа
    
    Вы не можете точно предсказать действия того, кто умнее вас, потому что если бы вы могли, вы и сами были бы столь же умны.
    
    Advanced nonagent
    
    Hypothetically, cognitively powerful programs that don't follow the loop of "observe, learn, model the consequences, act, observe results" that a standard "agent" would.
    
    Общий искусственный интеллект
    
    ИИ, у которого такой же "существенно более общий" интеллект, как у людей в сравнении с шимпанзе; он способен изучать новые области, как и мы.
    
    Consequentialist cognition
    
    The cognitive ability to foresee the consequences of actions, prefer some outcomes to others, and output actions leading to the preferred outcomes.
    
    Big-picture strategic awareness
    
    We start encountering new AI alignment issues at the point where a machine intelligence recognizes the existence of a real world, the existence of programmers, and how these relate to its goals.
    
    Corporations vs. superintelligences
    
    Corporations have relatively few of the advanced-agent properties that would allow one mistake in aligning a corporation to immediately kill all humans and turn the future light cone into paperclips.
    
    General intelligence
    
    Compared to chimpanzees, humans seem to be able to learn a much wider variety of domains. We have 'significantly more generally applicable' cognitive abilities, aka 'more general intelligence'.
    
    Infrahuman, par-human, superhuman, efficient, optimal
    
    A categorization of AI ability levels relative to human, with some gotchas in the ordering. E.g., in simple domains where humans can play optimally, optimal play is not superhuman.
    
    Интеллектуальный взрыв
    
    То, что случается, если самоулучшающийся ИИ доходит до того, что каждое дополнительное самоулучшение на величину x вызывает следующее самоулучшение на величину >x, и так происходит какое-то время.
    
    Область «Реальный Мир»
    
    Какие-то ИИ играют в шахматы, какие-то играют в Го, а какие-то водят машины. В этих разных "областях" представлены разные возможности. Реальность целиком, с ее путаными взаимосвязями, является областью "реального мира".
    
    Достаточно продвинутый Искусственный Интеллект
    
    Superintelligent
    
    A "superintelligence" is strongly superhuman (strictly higher-performing than any and all humans) on every cognitive problem.
  - Instrumental convergence
    
    Some strategies can help achieve most possible simple goals. E.g., acquiring more computing power or more material resources. By default, unless averted, we can expect advanced AIs to do that.
    
    Convergent instrumental strategies
    
    Paperclip maximizers can make more paperclips by improving their cognitive abilities or controlling more resources. What other strategies would almost-any AI try to use?
    
    Консеквенциалистские предпочтения по умолчанию рефлексивно устойчивы
    
    Ганди не принял бы таблетку, вызывающую желание убивать людей, потому что знает, что в этом случае умерло бы больше людей. Максимизатор скрепок не хочет прекращать максимизировать скрепки.
    
    Конвергентные стратегии самомодификации
    
    Стратегии, применение которых мы ожидаем тем ИИ, который понимает важность своего кода и оборудования для достижения его целей, и как следствие, имеет какие-то цели в отношении своего кода и оборудования.
    
    Максимизатор скрепок
    
    Этот агент не остановится, пока вся вселенная не будет заполнена скрепками.
    
    Скрепка
    
    Конфигурация материи, которую мы бы сочли бесполезной даже с очень космополитичной точки зрения.
    
    Случайная функция полезности
    
    "Случайная" функция полезности - это случайно выбранная в соответствии с какой-то простой вероятностной мерой (например, взвешиванием по колмогоровской сложности) на логическом пространстве формальных функций полезности.
    
    Инструментальность
    
    Что означает "инструментальный" в контексте теории согласования ценностей?
    
    Instrumental pressure
    
    A consequentialist agent will want to bring about certain instrumental events that will help to fulfill its goals.
    
    You can't get more paperclips that way
    
    Most arguments that "A paperclip maximizer could get more paperclips by (doing nice things)" are flawed.
  - Orthogonality Thesis
    
    Will smart AIs automatically become benevolent, or automatically become hostile? Or do different AI designs imply different goals?
    
    Максимизатор скрепок
    
    Этот агент не остановится, пока вся вселенная не будет заполнена скрепками.
    
    Скрепка
    
    Конфигурация материи, которую мы бы сочли бесполезной даже с очень космополитичной точки зрения.
    
    Случайная функция полезности
    
    "Случайная" функция полезности - это случайно выбранная в соответствии с какой-то простой вероятностной мерой (например, взвешиванием по колмогоровской сложности) на логическом пространстве формальных функций полезности.
    
    Instrumental goals are almost-equally as tractable as terminal goals
    
    Getting the milk from the refrigerator because you want to drink it, is not vastly harder than getting the milk from the refrigerator because you inherently desire it.
    
    Пространство устройств разума широко
    
    Представьте всех людей как одну маленькую точку в огромной сфере возможностей "пространства умов в целом". Будет мудрее делать какие-либо утверждения о *некоторых* умах, а не обо *всех* умах.
- Advanced safety
  
  An agent is *really* safe when it has the capacity to do anything, but chooses to do what the programmer wants.
  - AI safety mindset
    
    Asking how AI designs could go wrong, instead of imagining them going right.
    
    Ad-hoc hack (alignment theory)
    
    A "hack" is when you alter the behavior of your AI in a way that defies, or doesn't correspond to, a principled approach for that problem.
    
    Directing, vs. limiting, vs. opposing
    
    Getting the AI to compute the right action in a domain; versus getting the AI to not compute at all in an unsafe domain; versus trying to prevent the AI from acting successfully. (Prefer 1 & 2.)
    
    Don't try to solve the entire alignment problem
    
    New to AI alignment theory? Want to work in this area? Already been working in it for years? Don't try to solve the entire alignment problem with your next good idea!
    
    Flag the load-bearing premises
    
    If somebody says, "This AI safety plan is going to fail, because X" and you reply, "Oh, that's fine because of Y and Z", then you'd better clearly flag Y and Z as "load-bearing" parts of your plan.
    
    Покажите мне, что вы сломали
    
    Чтобы продемонстрировать компетентность в области компьютерной безопасности или теории согласования ИИ, думайте о том, как сломать существующие решения и найти в них технически очевидные недостатки.
    
    Долина опасной самоуспокоенности
    
    Когда ОИИ работает достаточно часто, чтобы усыпить вашу бдительность, но в нем все еще есть ошибки. Представьте себе роботизированный автомобиль, который почти всегда ездит идеально, но иногда падает с обрыва.
  - Methodology of unbounded analysis
    
    What we do and don't understand how to do, using unlimited computing power, is a critical distinction and important frontier.
    
    AIXI
    
    Как создать (злой) сверхразумный ИИ с помощью неограниченной вычислительной мощности и одной страницы кода на Python.
    
    AIXI-tl
    
    Ограниченная по времени версия идеального AIXI-агента, которая использует невозможно большой конечный компьютер вместо гиперкомпьютера.
    
    Картезианский агент
    
    Агенты, отделенные от окружающей их среды непроницаемыми барьерами, через которые могут лишь входить сенсорная информация и выходить двигательные сигналы.
    
    Картезианская граница между агентом и средой
    
    Если агент отделен от среды абсолютной границей, которую могут пересекать только сенсорная информация и двигательные сигналы агента, то это наверное просто картезианский агент.
    
    Индукция Соломонова
    
    Как просто и сверхразумно предсказывать последовательности данных, имея неограниченную вычислительную мощность.
    
    Solomonoff induction: Intro Dialogue (Math 2)
    
    An introduction to Solomonoff induction for the unfamiliar reader who isn't bad at math
    
    Гиперкомпьютер
    
    Некоторые формализмы требуют компьютеров, которые больше, чем предел всех конечных компьютеров.
    
    Механический Турок (пример)
    
    Шахматный автомат 19-го века, известный как "Механический Турок", на самом деле имел внутри себя оператора-человека. У современников автомата были интересные мысли на счет возможности механических шахмат.
    
    No-Free-Lunch theorems are often irrelevant
    
    There's often a theorem proving that some problem has no optimal answer across every possible world. But this may not matter, since the real world is a special case. (E.g., a low-entropy universe.)
    
    Нефизически большой конечный компьютер
    
    Воображаемый ящик, который нужен для запуска программ, требующих невозможно большую, но конечную вычислительную мощность.
  - Actual effectiveness
    
    If you want the AI's so-called 'utility function' to actually be steering the AI, you need to think about how it meshes up with beliefs, or what gets output to actions.
  - Context disaster
    
    Some possible designs cause your AI to behave nicely while developing, and behave a lot less nicely when it's smarter.
  - Distinguish which advanced-agent properties lead to the foreseeable difficulty
    
    Say what kind of AI, or threshold level of intelligence, or key type of advancement, first produces the difficulty or challenge you're talking about.
  - Goodhart's Curse
    
    The Optimizer's Curse meets Goodhart's Law. For example, if our values are V, and an AI's utility function U is a proxy for V, optimizing for high U seeks out 'errors'--that is, high values of U - V.
  - Goodness estimate biaser
    
    Some of the main problems in AI alignment can be seen as scenarios where actual goodness is likely to be systematically lower than a broken way of estimating goodness.
  - Methodology of foreseeable difficulties
    
    Building a nice AI is likely to be hard enough, and contain enough gotchas that won't show up in the AI's early days, that we need to foresee problems coming in advance.
  - Nearest unblocked strategy
    
    If you patch an agent's preference framework to avoid an undesirable solution, what can you expect to happen?
  - Optimization daemons
    
    When you optimize something so hard that it crystalizes into an optimizer, like the way natural selection optimized apes so hard they turned into human-level intelligences
  - Безопасно, но бесполезно
    
    Порой, доводя блокировки своего ИИ до конца ради его максимальной безопасности, вы получаете такой ИИ, который уже невозможно будет использовать ни для чего интересного.
- Винджевская рефлексия
  
  Проблема размышлений о будущей версии вас, когда она умнее нынешней.
  - Рефлексивная устойчивость
    
    Желание в будущем думать также, как и сейчас, создание других агентов и самомодификаций, которые думают также как вы сейчас.
    
    Reflectively consistent degree of freedom
    
    When an instrumentally efficient, self-modifying AI can be like X or like X' in such a way that X wants to be X and X' wants to be X', that's a reflectively consistent degree of freedom.
    
    Humean degree of freedom
    
    A concept includes 'Humean degrees of freedom' when the intuitive borders of the human version of that concept depend on our values, making that concept less natural for AIs to learn.
    
    Value-laden
    
    Cure cancer, but avoid any bad side effects? Categorizing "bad side effects" requires knowing what's "bad". If an agent needs to load complex human goals to evaluate something, it's "value-laden".
    
    Консеквенциалистские предпочтения по умолчанию рефлексивно устойчивы
    
    Ганди не принял бы таблетку, вызывающую желание убивать людей, потому что знает, что в этом случае умерло бы больше людей. Максимизатор скрепок не хочет прекращать максимизировать скрепки.
    
    Другое-зация (разыскивается: новое выражение для оптимизации)
    
    Максимизация невозможна для ограниченных агентов, а сатисфаизация, похоже, недостаточна для них. Какой другой вид "изации" был бы хорош для реалистичных ограниченных агентов?
  - Рефлексивная последовательность
    
    Система принятия решений рефлексивно устойчива, если она может одобрить саму себя или создание таких же систем принятия решений (возможно, и создание других систем).
  - Теория тайлинговых агентов
    
    Теория самомодифицирующихся агентов, которые очень похожи на них, подобно повторяющимся тайлам (плиткам) в замощении плоскости.
  - Принцип Винджа
    
    Агент, создающий другого агента, обычно должен одобрить его конструкцию, не зная его точных будущих действий.
- Стратегическая типология ОИИ
  
  Какие широкие классы продвинутых ИИ было бы возможно или благоразумно создавать в каких стратегических сценариях?
  - Оракул
    
    Система, спроектированная чтобы безопасно отвечать на вопросы.
    
    Zermelo-Fraenkel provability oracle
    
    We might be able to build a system that can safely inform us that a theorem has a proof in set theory, but we can't see how to use that capability to save the world.
  - ОИИ, нацеленный на выполнение поручений
    
    Продвинутый ИИ, предназначенный для достижения ряда ограниченных целей, данных пользователем. «Джинн» в терминологии Бострома.
    
    ИИ-в-коробке
    
    Идея: что, если мы ограничим возможности ИИ взаимодействовать с миром? Это сделает его безопасным, так?
    
    Zermelo-Fraenkel provability oracle
    
    We might be able to build a system that can safely inform us that a theorem has a proof in set theory, but we can't see how to use that capability to save the world.
    
    Low impact
    
    The open problem of having an AI carry out tasks in ways that cause minimum side effects and change as little of the rest of the universe as possible.
    
    Abortable plans
    
    Plans that can be undone, or switched to having low further impact. If the AI builds abortable nanomachines, they'll have a quiet self-destruct option that includes any replicated nanomachines.
    
    Shutdown utility function
    
    A special case of a low-impact utility function where you just want the AGI to switch itself off harmlessly (and not create subagents to make absolutely sure it stays off, etcetera).
    
    Оракул
    
    Система, спроектированная чтобы безопасно отвечать на вопросы.
    
    Zermelo-Fraenkel provability oracle
    
    We might be able to build a system that can safely inform us that a theorem has a proof in set theory, but we can't see how to use that capability to save the world.
    
    Safe plan identification and verification
    
    On a particular task or problem, the issue of how to communicate to the AGI what you want it to do and all the things you don't want it to do.
    
    Do-What-I-Mean hierarchy
    
    Successive levels of "Do What I Mean" or AGIs that understand their users increasingly well
    
    Task identification problem
    
    If you have a task-based AGI (Genie) then how do you pinpoint exactly what you want it to do (and not do)?
    
    Look where I'm pointing, not at my finger
    
    When trying to communicate the concept "glove", getting the AGI to focus on "gloves" rather than "my user's decision to label something a glove" or "anything that depresses the glove-labeling button".
    
    Behaviorist genie
    
    An advanced agent that's forbidden to model minds in too much detail.
    
    Conservative concept boundary
    
    Given N example burritos, draw a boundary around what is a 'burrito' that is relatively simple and allows as few positive instances as possible. Helps make sure the next thing generated is a burrito.
    
    Epistemic exclusion
    
    How would you build an AI that, no matter what else it learned about the world, never knew or wanted to know what was inside your basement?
    
    Faithful simulation
    
    How would you identify, to a Task AGI (aka Genie), the problem of scanning a human brain, and then running a sufficiently accurate simulation of it for the simulation to not be crazy or psychotic?
    
    Limited AGI
    
    Task-based AGIs don't need unlimited cognitive and material powers to carry out their Tasks; which means their powers can potentially be limited.
    
    Mild optimization
    
    An AGI which, if you ask it to paint one car pink, just paints one car pink and doesn't tile the universe with pink-painted cars, because it's not trying *that* hard to max out its car-painting score.
    
    Open subproblems in aligning a Task-based AGI
    
    Open research problems, especially ones we can model today, in building an AGI that can "paint all cars pink" without turning its future light cone into pink-painted cars.
    
    Querying the AGI user
    
    Postulating that an advanced agent will check something with its user, probably comes with some standard issues and gotchas (e.g., prioritizing what to query, not manipulating the user, etc etc).
    
    Task (AI goal)
    
    When building the first AGIs, it may be wiser to assign them only goals that are bounded in space and time, and can be satisfied by bounded efforts.
  - Автономный ОИИ
    
    Сложнейший из возможных для создания класс Дружественного ИИ, с наименьшей моральной угрозой; ИИ, задуманный так, чтобы не требовать и не принимать дальнейших указаний.
  - Known-algorithm non-self-improving agent
    
    Possible advanced AIs that aren't self-modifying, aren't self-improving, and where we know and understand all the component algorithms.
- Principles in AI alignment
  
  A 'principle' of AI alignment is a very general design goal like 'understand what the heck is going on inside the AI' that has informed a wide set of specific design proposals.
  - Non-adversarial principle
    
    At no point in constructing an Artificial General Intelligence should we construct a computation that tries to hurt us, and then try to stop it from hurting us.
    
    Directing, vs. limiting, vs. opposing
    
    Getting the AI to compute the right action in a domain; versus getting the AI to not compute at all in an unsafe domain; versus trying to prevent the AI from acting successfully. (Prefer 1 & 2.)
    
    Generalized principle of cognitive alignment
    
    When we're asking how we want the AI to think about an alignment problem, one source of inspiration is trying to have the AI mirror our own thoughts about that problem.
    
    Доброжелательность — первая линия обороны
    
    Когда речь о любой частично сверхчеловеческой ИИ-системе, достаточно развитой, чтобы быть потенциально опасной, первая линия обороны заключается в том, чтобы она не *хотела* вам навредить или разрушить ваши меры безопасности.
    
    Omnipotence test for AI safety
    
    Would your AI produce disastrous outcomes if it suddenly gained omnipotence and omniscience? If so, why did you program something that *wants* to hurt you and is held back only by lacking the power?
    
    The AI must tolerate your safety measures
    
    A corollary of the nonadversarial principle is that "The AI must tolerate your safety measures."
  - Understandability principle
    
    The more you understand what the heck is going on inside your AI, the safer you are.
    
    Effability principle
    
    You are safer the more you understand the inner structure of how your AI thinks; the better you can describe the relation of smaller pieces of the AI's thought process.
  - Minimality principle
    
    The first AGI ever built should save the world in a way that requires the least amount of the least dangerous cognition.
  - Separation from hyperexistential risk
    
    The AI should be widely separated in the design space from any AI that would constitute a "hyperexistential risk" (anything worse than death).
- Corrigibility
  
  "I can't let you do that, Dave."
  - Programmer deception
    
    Когнитивная стеганография
    
    Несогласованные ИИ, которые моделируют человеческую психологию и пытаются обмануть своих программистов, захотят скрыть свои внутренние процессы мышления от программистов.
  - Shutdown problem
    
    How to build an AGI that lets you shut it down, despite the obvious fact that this will interfere with whatever the AGI's goals are.
    
    You can't get the coffee if you're dead
    
    An AI given the goal of 'get the coffee' can't achieve that goal if it has been turned off; so even an AI whose goal is just to fetch the coffee may try to avert a shutdown button being pressed.
  - Манипуляция пользователем
    
    Если не предотвратить этого, многие из желаемых ОИИ исходов скорее всего будут связаны с взаимодействием пользователями, а значит будут стимулировать манипулирование ими.
    
    Максимизация пользователя
    
    Разновидность манипуляции пользователем. Если вы сформулировали ИИ в терминах argmax на X или инструкции «оптимизируй X», и X включает в себя взаимодействие с пользователем как компонент, то вы только что сказали ИИ, чтобы он оптимизировал пользователя.
  - Averting instrumental pressures
    
    Almost-any utility function for an AI, whether the target is diamonds or paperclips or eudaimonia, implies subgoals like rapidly self-improving and refusing to shut down. Can we make that not happen?
  - Averting the convergent instrumental strategy of self-improvement
    
    We probably want the first AGI to *not* improve as fast as possible, but improving as fast as possible is a convergent strategy for accomplishing most things.
  - Hard problem of corrigibility
    
    Can you build an agent that reasons as if it knows itself to be incomplete and sympathizes with your wanting to rebuild or correct it?
  - Interruptibility
    
    A subproblem of corrigibility under the machine learning paradigm: when the agent is interrupted, it must not learn to prevent future interruptions.
  - Problem of fully updated deference
    
    Why moral uncertainty doesn't stop an AI from defending its off-switch.
  - Utility indifference
    
    How can we make an AI indifferent to whether we press a button that changes its goals?
- Value
  
  The word 'value' in the phrase 'value alignment' is a metasyntactic variable that indicates the speaker's future goals for intelligent life.
  - Extrapolated volition (normative moral theory)
    
    If someone asks you for orange juice, and you know that the refrigerator contains no orange juice, should you bring them lemonade?
    
    Rescuing the utility function
    
    If your utility function values 'heat', and then you discover to your horror that there's no ontologically basic heat, switch to valuing disordered kinetic energy.
  - "Благотворно"
    
    По настоящему хорошо. Метасинтаксическая переменная для обозначения "предпочтения того, что говорящий хотел бы в идеале достичь", хотя у разных говорящих различные моральные взгляды и метаэтика.
  - "Пагубно"
    
    Противоположность благотворности.
  - Когерентное экстраполированное волеизъявление (цель согласования)
    
    Предлагаемое направление для крайне хорошо согласованного автономного сверхинтеллекта - делать то, что люди хотели бы, если бы знали все, что знает ИИ, думали бы столь же быстро, и понимали бы самих себя.
  - Cosmopolitan value
    
    Intuitively: Value as seen from a broad, embracing standpoint that is aware of how other entities may not always be like us or easily understandable to us, yet still worthwhile.
  - Immediate goods
  - Список терминальных ценностей от Уильяма Франкены
    
    Жизнь, сознание и деятельность; здоровье и сила; удовольствия и удовлетворения — все или определенных видов; счастье, блаженство, довольство и т. д.; истина; знание и верные убеждения...
- Задача согласования ценностей
  
  Вы хотите построить продвинутый ИИ с правильными ценностями... Но как?
  - Система предпочтений
    
    Что агент использует, чтобы сравнивать предпочитаемое им?
    
    Moral uncertainty
    
    A meta-utility function in which the utility function as usually considered, takes on different values in different possible worlds, potentially distinguishable by evidence.
    
    Ideal target
    
    The 'ideal target' of a meta-utility function is the value the ground-level utility function would take on if the agent updated on all possible evidence; the 'true' utilities under moral uncertainty.
    
    Attainable optimum
    
    The 'attainable optimum' of an agent's preferences is the best that agent can actually do given its finite intelligence and resources (as opposed to the global maximum of those preferences).
    
    Meta-utility function
    
    Preference frameworks built out of simple utility functions, but where, e.g., the 'correct' utility function for a possible world depends on whether a button is pressed.
  - Total alignment
    
    We say that an advanced AI is "totally aligned" when it knows *exactly* which outcomes and plans are beneficial, with no further user input.
- Задача идентификации ценностей
  - Ontology identification problem
    
    How do we link an agent's utility function to its model of the world, when we don't know what that model will look like?
    
    Ontology identification problem: Technical tutorial
    
    Technical tutorial for ontology identification problem.
    
    Diamond maximizer
    
    How would you build an agent that made as much diamond material as possible, given vast computing power but an otherwise rich and complicated environment?
  - Edge instantiation
    
    When you ask the AI to make people happy, and it tiles the universe with the smallest objects that can be happy.
  - Environmental goals
    
    The problem of having an AI want outcomes that are out in the world, not just want direct sense events.
  - Goal-concept identification
    
    Figuring out how to say "strawberry" to an AI that you want to bring you strawberries (and not fake plastic strawberries, either).
  - Максимизатор счастья
  - Identifying causal goal concepts from sensory data
    
    If the intended goal is "cure cancer" and you show the AI healthy patients, it sees, say, a pattern of pixels on a webcam. How do you get to a goal concept *about* the real patients?
- Complexity of value
  
  There's no simple way to describe the goals we want Artificial Intelligences to want.
  - Meta-rules for (narrow) value learning are still unsolved
    
    We don't currently know a simple meta-utility function that would take in observation of humans and spit out our true values, or even a good target for a Task AGI.
  - Underestimating complexity of value because goodness feels like a simple property
    
    When you just want to yell at the AI, "Just do normal high-value X, dammit, not weird low-value X!" and that 'high versus low value' boundary is way more complicated than your brain wants to think.
- Value achievement dilemma
  
  How can Earth-originating intelligent life achieve most of its potential value, whether by AI or otherwise?
  - Aligning an AGI adds significant development time
    
    Aligning an advanced AI foreseeably involves extra code and extra testing and not being able to do everything the fastest way, so it takes longer.
  - Coordinative AI development hypothetical
    
    What would safe AI development look like if we didn't have to worry about anything else?
  - Cosmic endowment
    
    The 'cosmic endowment' consists of all the stars that could be reached from probes originating on Earth; the sum of all matter and energy potentially available to be transformed into life and fun.
  - Moral hazards in AGI development
    
    "Moral hazard" is when owners of an advanced AGI give in to the temptation to do things with it that the rest of us would regard as 'bad', like, say, declaring themselves God-Emperor.
  - Pivotal act
    
    Which types of AIs, if they work, can do things that drastically change the nature of the further game?
- Development phase unpredictable
  - Unforeseen maximum
    
    When you tell AI to produce world peace and it kills everyone. (Okay, some SF writers saw that one coming.)
    
    Missing the weird alternative
    
    People might systematically overlook "make tiny molecular smileyfaces" as a way of "producing smiles", because our brains automatically search for high-utility-to-us ways of "producing smiles".
- Ментальное преступление
  
  Может ли машинный интеллект содержать огромные количества несчастных сознательных подпроцессов?
  - Ментальное преступление: Введение
  - Неличностный предикат
    
    Если бы мы знали, какие вычисления точно не являются личностями, мы могли бы научить ИИ, что такие вычисления ему точно дозволены.
- Modeling distant superintelligences
  
  The several large problems that might occur if an AI starts to think about alien superintelligences.
  - Distant superintelligences can coerce the most probable environment of your AI
    
    Distant superintelligences may be able to hack your local AI, if your AI's preference framework depends on its most probable environment.
- Patch resistance
  
  One does not simply solve the value alignment problem.
  - Unforeseen maximum
    
    When you tell AI to produce world peace and it kills everyone. (Okay, some SF writers saw that one coming.)
    
    Missing the weird alternative
    
    People might systematically overlook "make tiny molecular smileyfaces" as a way of "producing smiles", because our brains automatically search for high-utility-to-us ways of "producing smiles".
- ОИИ, нацеленный на выполнение поручений
  
  Продвинутый ИИ, предназначенный для достижения ряда ограниченных целей, данных пользователем. «Джинн» в терминологии Бострома.
  - ИИ-в-коробке
    
    Идея: что, если мы ограничим возможности ИИ взаимодействовать с миром? Это сделает его безопасным, так?
    
    Zermelo-Fraenkel provability oracle
    
    We might be able to build a system that can safely inform us that a theorem has a proof in set theory, but we can't see how to use that capability to save the world.
  - Low impact
    
    The open problem of having an AI carry out tasks in ways that cause minimum side effects and change as little of the rest of the universe as possible.
    
    Abortable plans
    
    Plans that can be undone, or switched to having low further impact. If the AI builds abortable nanomachines, they'll have a quiet self-destruct option that includes any replicated nanomachines.
    
    Shutdown utility function
    
    A special case of a low-impact utility function where you just want the AGI to switch itself off harmlessly (and not create subagents to make absolutely sure it stays off, etcetera).
  - Оракул
    
    Система, спроектированная чтобы безопасно отвечать на вопросы.
    
    Zermelo-Fraenkel provability oracle
    
    We might be able to build a system that can safely inform us that a theorem has a proof in set theory, but we can't see how to use that capability to save the world.
  - Safe plan identification and verification
    
    On a particular task or problem, the issue of how to communicate to the AGI what you want it to do and all the things you don't want it to do.
    
    Do-What-I-Mean hierarchy
    
    Successive levels of "Do What I Mean" or AGIs that understand their users increasingly well
  - Task identification problem
    
    If you have a task-based AGI (Genie) then how do you pinpoint exactly what you want it to do (and not do)?
    
    Look where I'm pointing, not at my finger
    
    When trying to communicate the concept "glove", getting the AGI to focus on "gloves" rather than "my user's decision to label something a glove" or "anything that depresses the glove-labeling button".
  - Behaviorist genie
    
    An advanced agent that's forbidden to model minds in too much detail.
  - Conservative concept boundary
    
    Given N example burritos, draw a boundary around what is a 'burrito' that is relatively simple and allows as few positive instances as possible. Helps make sure the next thing generated is a burrito.
  - Epistemic exclusion
    
    How would you build an AI that, no matter what else it learned about the world, never knew or wanted to know what was inside your basement?
  - Faithful simulation
    
    How would you identify, to a Task AGI (aka Genie), the problem of scanning a human brain, and then running a sufficiently accurate simulation of it for the simulation to not be crazy or psychotic?
  - Limited AGI
    
    Task-based AGIs don't need unlimited cognitive and material powers to carry out their Tasks; which means their powers can potentially be limited.
  - Mild optimization
    
    An AGI which, if you ask it to paint one car pink, just paints one car pink and doesn't tile the universe with pink-painted cars, because it's not trying *that* hard to max out its car-painting score.
  - Open subproblems in aligning a Task-based AGI
    
    Open research problems, especially ones we can model today, in building an AGI that can "paint all cars pink" without turning its future light cone into pink-painted cars.
  - Querying the AGI user
    
    Postulating that an advanced agent will check something with its user, probably comes with some standard issues and gotchas (e.g., prioritizing what to query, not manipulating the user, etc etc).
  - Task (AI goal)
    
    When building the first AGIs, it may be wiser to assign them only goals that are bounded in space and time, and can be satisfied by bounded efforts.
- Unforeseen maximum
  
  When you tell AI to produce world peace and it kills everyone. (Okay, some SF writers saw that one coming.)
  - Missing the weird alternative
    
    People might systematically overlook "make tiny molecular smileyfaces" as a way of "producing smiles", because our brains automatically search for high-utility-to-us ways of "producing smiles".
- Glossary (Value Alignment Theory)
  
  Words that have a special meaning in the context of creating nice AIs.
  - Когнитивная область
    
    Предположительно компактная группа знаний — такая, что идеи внутри нее взаимодействуют главным образом друг с другом и в меньшей степени с идеями из других областей.
    
    Distances between cognitive domains
    
    Often in AI alignment we want to ask, "How close is 'being able to do X' to 'being able to do Y'?"
  - 'Concept'
    
    In the context of Artificial Intelligence, a 'concept' is a category, something that identifies thingies as being inside or outside the concept.
  - Дружественный ИИ
    
    Старая терминология для обозначения ИИ, предпочтения которого были успешно согласованы с идеализированными человеческими ценностями.
- Linguistic conventions in value alignment
  
  How and why to use precise language and words with special meaning when talking about value alignment.
  - Utility
    
    What is "utility" in the context of Value Alignment Theory?
- Исследователи по теории согласования ценностей
  
  Кто на постоянной основе работает в теории согласования ценностей?
  - Ник Бостром
    
    Ник Бостром, секретный автор идеи Дружественного ИИ
  - Элиезер Юдковский
- AI alignment open problem
  
  Tag for open problems under AI alignment.
- AI arms races
  
  AI arms races are bad
- Coordinative AI development hypothetical
  
  What would safe AI development look like if we didn't have to worry about anything else?
- Correlated coverage
  
  In which parts of AI alignment can we hope that getting many things right, will mean the AI gets everything right?
- Difficulty of AI alignment
  
  How hard is it exactly to point an Artificial General Intelligence in an intuitively okay direction?
- Executable philosophy
  
  Philosophical discourse aimed at producing a trustworthy answer or meta-answer, in limited time, which can used in constructing an Artificial Intelligence.
- Identifying ambiguous inductions
  
  What do a "red strawberry", a "red apple", and a "red cherry" have in common that a "yellow carrot" doesn't? Are they "red fruits" or "red objects"?
- Informed oversight
  
  Incentivize a reinforcement learner that's less smart than you to accomplish some task
- Намеченная цель
- List: value-alignment subjects
  
  Bullet point list of core VAT subjects.
- Natural language understanding of "right" will yield normativity
  
  What will happen if you tell an advanced agent to do the "right" thing?
- Nick Bostrom's book Superintelligence
  
  The current best book-form introduction to AI alignment theory.
- Object-level vs. indirect goals
  
  Difference between "give Alice the apple" and "give Alice what she wants".
- Программист
  
  Кто строит этих продвинутых агентов?
- Relevant limited AI
  
  Can we have a limited AI, that's nonetheless relevant?
- Relevant powerful agent
  
  An agent is relevant if it completely changes the course of history.
- Relevant powerful agents will be highly optimized
- Reliable prediction
  
  How can we train predictors that reliably predict observable phenomena such as human behavior?
- Safe impact measure
  
  What can we measure to make sure an agent is acting in a safe manner?
- Safe training procedures for human-imitators
  
  How does one train a reinforcement learner to act like a human?
- Selective similarity metrics for imitation
  
  Can we make human-imitators more efficient by scoring them more heavily on imitating the aspects of human behavior we care about more?
- Некоторые вычисления являются людьми
  
  Возможна симуляция сознательного существа внутри компьютера или на ином субстрате.
- Strong cognitive uncontainability
  
  An advanced agent can win in ways humans can't understand in advance.
- Sufficiently optimized agents appear coherent
  
  If you could think as well as a superintelligence, you'd be at least that smart yourself.
- The rocket alignment problem
  
  If people talked about the problem of space travel the way they talked about AI...

Только переводы:

↑ Полное дерево

Согласование ИИ

Великая цивилизационная задача создания искусственных интеллектуальных компьютерных систем, запускать которые было бы хорошей идеей.
Покажите мне, что вы сломали

Чтобы продемонстрировать компетентность в области компьютерной безопасности или теории согласования ИИ, думайте о том, как сломать существующие решения и найти в них технически очевидные недостатки.
Долина опасной самоуспокоенности

Когда ОИИ работает достаточно часто, чтобы усыпить вашу бдительность, но в нем все еще есть ошибки. Представьте себе роботизированный автомобиль, который почти всегда ездит идеально, но иногда падает с обрыва.
AIXI

Как создать (злой) сверхразумный ИИ с помощью неограниченной вычислительной мощности и одной страницы кода на Python.
AIXI-tl

Ограниченная по времени версия идеального AIXI-агента, которая использует невозможно большой конечный компьютер вместо гиперкомпьютера.
Картезианский агент

Агенты, отделенные от окружающей их среды непроницаемыми барьерами, через которые могут лишь входить сенсорная информация и выходить двигательные сигналы.
Картезианская граница между агентом и средой

Если агент отделен от среды абсолютной границей, которую могут пересекать только сенсорная информация и двигательные сигналы агента, то это наверное просто картезианский агент.
Индукция Соломонова

Как просто и сверхразумно предсказывать последовательности данных, имея неограниченную вычислительную мощность.
Гиперкомпьютер

Некоторые формализмы требуют компьютеров, которые больше, чем предел всех конечных компьютеров.
Механический Турок (пример)

Шахматный автомат 19-го века, известный как "Механический Турок", на самом деле имел внутри себя оператора-человека. У современников автомата были интересные мысли на счет возможности механических шахмат.
Нефизически большой конечный компьютер

Воображаемый ящик, который нужен для запуска программ, требующих невозможно большую, но конечную вычислительную мощность.
Безопасно, но бесполезно

Порой, доводя блокировки своего ИИ до конца ради его максимальной безопасности, вы получаете такой ИИ, который уже невозможно будет использовать ни для чего интересного.
Когнитивная стеганография

Несогласованные ИИ, которые моделируют человеческую психологию и пытаются обмануть своих программистов, захотят скрыть свои внутренние процессы мышления от программистов.
Манипуляция пользователем

Если не предотвратить этого, многие из желаемых ОИИ исходов скорее всего будут связаны с взаимодействием пользователями, а значит будут стимулировать манипулирование ими.
Максимизация пользователя

Разновидность манипуляции пользователем. Если вы сформулировали ИИ в терминах argmax на X или инструкции «оптимизируй X», и X включает в себя взаимодействие с пользователем как компонент, то вы только что сказали ИИ, чтобы он оптимизировал пользователя.
Когнитивная область

Предположительно компактная группа знаний — такая, что идеи внутри нее взаимодействуют главным образом друг с другом и в меньшей степени с идеями из других областей.
Дружественный ИИ

Старая терминология для обозначения ИИ, предпочтения которого были успешно согласованы с идеализированными человеческими ценностями.
Ментальное преступление

Может ли машинный интеллект содержать огромные количества несчастных сознательных подпроцессов?
Ментальное преступление: Введение
Неличностный предикат

Если бы мы знали, какие вычисления точно не являются личностями, мы могли бы научить ИИ, что такие вычисления ему точно дозволены.
Доброжелательность — первая линия обороны

Когда речь о любой частично сверхчеловеческой ИИ-системе, достаточно развитой, чтобы быть потенциально опасной, первая линия обороны заключается в том, чтобы она не *хотела* вам навредить или разрушить ваши меры безопасности.
Исследователи по теории согласования ценностей

Кто на постоянной основе работает в теории согласования ценностей?
Ник Бостром

Ник Бостром, секретный автор идеи Дружественного ИИ
Стратегическая типология ОИИ

Какие широкие классы продвинутых ИИ было бы возможно или благоразумно создавать в каких стратегических сценариях?
Автономный ОИИ

Сложнейший из возможных для создания класс Дружественного ИИ, с наименьшей моральной угрозой; ИИ, задуманный так, чтобы не требовать и не принимать дальнейших указаний.
Оракул

Система, спроектированная чтобы безопасно отвечать на вопросы.
ОИИ, нацеленный на выполнение поручений

Продвинутый ИИ, предназначенный для достижения ряда ограниченных целей, данных пользователем. «Джинн» в терминологии Бострома.
ИИ-в-коробке

Идея: что, если мы ограничим возможности ИИ взаимодействовать с миром? Это сделает его безопасным, так?
Теория продвинутых агентов

Одна из исследовательноских подзадач для создания мощных и хороших ИИ - это теория (достаточно продвинутых) разумов в целом.
Когнитивная невместимость

'Когнитивная невместимость' - это когда мы не можем держать в уме все возможные варианты поведения агента.
Богатая область
Почти все области реального мира богаты

Все, что вы пытаетесь сделать в реальном мире, может быть сделано *множеством* разных способов.
Логическая игра

Математическая структура игры в ее чистейшем виде.
Свойства стандартных агентов

Что такое "стандартный агент" и что он может?
Ограниченный агент

Агент, действующий в реальном мире, использующий реалистичные объемы вычислительной мощности, неуверенный насчет своего окружения, и т. д.
Винджевская неопределенность

Вы не можете предсказать точные действия агентов умнее вас - но есть ли что-то, что вы _можете_ сказать о них?
Deep Blue

Шахматная программа, созданная IBM, и впервые забравшая мировое чемпионство по шахматам у Гарри Каспарова в 1996.
Закон Винджа

Вы не можете точно предсказать действия того, кто умнее вас, потому что если бы вы могли, вы и сами были бы столь же умны.
Общий искусственный интеллект

ИИ, у которого такой же "существенно более общий" интеллект, как у людей в сравнении с шимпанзе; он способен изучать новые области, как и мы.
Интеллектуальный взрыв

То, что случается, если самоулучшающийся ИИ доходит до того, что каждое дополнительное самоулучшение на величину x вызывает следующее самоулучшение на величину >x, и так происходит какое-то время.
Область «Реальный Мир»

Какие-то ИИ играют в шахматы, какие-то играют в Го, а какие-то водят машины. В этих разных "областях" представлены разные возможности. Реальность целиком, с ее путаными взаимосвязями, является областью "реального мира".
Достаточно продвинутый Искусственный Интеллект
Конвергентные стратегии самомодификации

Стратегии, применение которых мы ожидаем тем ИИ, который понимает важность своего кода и оборудования для достижения его целей, и как следствие, имеет какие-то цели в отношении своего кода и оборудования.
Консеквенциалистские предпочтения по умолчанию рефлексивно устойчивы

Ганди не принял бы таблетку, вызывающую желание убивать людей, потому что знает, что в этом случае умерло бы больше людей. Максимизатор скрепок не хочет прекращать максимизировать скрепки.
Инструментальность

Что означает "инструментальный" в контексте теории согласования ценностей?
Максимизатор скрепок

Этот агент не остановится, пока вся вселенная не будет заполнена скрепками.
Скрепка

Конфигурация материи, которую мы бы сочли бесполезной даже с очень космополитичной точки зрения.
Случайная функция полезности

"Случайная" функция полезности - это случайно выбранная в соответствии с какой-то простой вероятностной мерой (например, взвешиванием по колмогоровской сложности) на логическом пространстве формальных функций полезности.
Пространство устройств разума широко

Представьте всех людей как одну маленькую точку в огромной сфере возможностей "пространства умов в целом". Будет мудрее делать какие-либо утверждения о *некоторых* умах, а не обо *всех* умах.
"Благотворно"

По настоящему хорошо. Метасинтаксическая переменная для обозначения "предпочтения того, что говорящий хотел бы в идеале достичь", хотя у разных говорящих различные моральные взгляды и метаэтика.
"Пагубно"

Противоположность благотворности.
Когерентное экстраполированное волеизъявление (цель согласования)

Предлагаемое направление для крайне хорошо согласованного автономного сверхинтеллекта - делать то, что люди хотели бы, если бы знали все, что знает ИИ, думали бы столь же быстро, и понимали бы самих себя.
Список терминальных ценностей от Уильяма Франкены

Жизнь, сознание и деятельность; здоровье и сила; удовольствия и удовлетворения — все или определенных видов; счастье, блаженство, довольство и т. д.; истина; знание и верные убеждения...
Задача согласования ценностей

Вы хотите построить продвинутый ИИ с правильными ценностями... Но как?
Система предпочтений

Что агент использует, чтобы сравнивать предпочитаемое им?
Задача идентификации ценностей
Максимизатор счастья
Винджевская рефлексия

Проблема размышлений о будущей версии вас, когда она умнее нынешней.
Рефлексивная устойчивость

Желание в будущем думать также, как и сейчас, создание других агентов и самомодификаций, которые думают также как вы сейчас.
Другое-зация (разыскивается: новое выражение для оптимизации)

Максимизация невозможна для ограниченных агентов, а сатисфаизация, похоже, недостаточна для них. Какой другой вид "изации" был бы хорош для реалистичных ограниченных агентов?
Рефлексивная последовательность

Система принятия решений рефлексивно устойчива, если она может одобрить саму себя или создание таких же систем принятия решений (возможно, и создание других систем).
Теория тайлинговых агентов

Теория самомодифицирующихся агентов, которые очень похожи на них, подобно повторяющимся тайлам (плиткам) в замощении плоскости.
Принцип Винджа

Агент, создающий другого агента, обычно должен одобрить его конструкцию, не зная его точных будущих действий.
Намеченная цель
Программист

Кто строит этих продвинутых агентов?
Некоторые вычисления являются людьми

Возможна симуляция сознательного существа внутри компьютера или на ином субстрате.
Элиезер Юдковский

Arbital на русском

Полное дерево:

↓ Только переводы

Только переводы:

↑ Полное дерево