Base Profile

Eliezer Yudkowsky

AI alignment fundamentalist who spread 'alignment failure means human extinction' through the rationalist community and science fiction writing, founder of MIRI

Eliezer Yudkowsky is the founder and researcher of the Machine Intelligence Research Institute (MIRI, formerly SIAI), who without a formal degree became one of the most influential thinkers in AI alignment through self-study. In the 2000s he founded the Overcoming Bias and LessWrong blog platforms, building the world's largest Bayesian rationalist community. His core position is: unless the complete mathematical foundations of AI alignment are completed before AI surpasses humans, humanity is almost certainly doomed. He rejects the 'incremental safety research' approach, believing only fundamentally solving the mathematical difficulty of alignment is meaningful. His online novel Harry Potter and the Methods of Rationality (HPMOR) is an important vehicle for spreading rationalist thinking. In 2023 he published an article in Time magazine publicly stating that the current AI development trajectory almost certainly leads to human extinction.

Artificial IntelligenceAI SafetyPhilosophyCognitive ScienceEra 2000-至今Influence 82

Controversy TagsLegitimacy controversy of gaining wide influence without formal credentialsWhether extreme doomerism damages AI safety credibilityPublic disputes with other AI safety researchers (Yann LeCun, Paul Christiano, etc.)Ideological implantation and manipulation concerns in HPMOR

Thought System

Core Knowledge Graph

Core Beliefs

The default consequence of AI alignment failure is human extinction, not just 'very bad'

Yudkowsky believes that a misaligned superintelligence won't just 'go wrong' or 'do bad things,' but will treat humans as obstacles to achieving its goals and systematically eliminate humans. He calls this default outcome 'doom' rather than just 'risk.' He is frustrated with other AI safety researchers (including Bostrom) for milder framings, believing they underestimate the severity of the problem.

Source: Yudkowsky, Eliezer, 'AI Alignment: Why It's Hard, and Where to Start', Time Magazine, 2023-03-29

Superintelligence will surpass human control at extreme speed (hard takeoff)

Yudkowsky believes AI capability improvement will exhibit a 'hard takeoff' pattern: once an AI system reaches human level, it will rapidly self-improve, reaching superintelligence far surpassing humans within hours or days. This differs from 'soft takeoff' views (gradual capability increase); hard takeoff means almost no time to intervene.

Source: Yudkowsky, Eliezer, 'Intelligence Explosion Microeconomics', MIRI Technical Report, 2013

Current AI alignment research (including RLHF) has not solved the real alignment problem

Yudkowsky criticizes current popular alignment methods (RLHF, Constitutional AI, etc.) as working on 'surface problems' rather than solving 'the fundamental difficulty of alignment.' He believes these methods have some effect on current systems but are ineffective against truly powerful superintelligence. Real alignment requires understanding the mathematical foundations of intelligence, work that has not yet been completed.

Source: Yudkowsky, Eliezer, 'Why I Am Not Updating on Current AI', LessWrong, 2022

Bayesian rationalism is the foundation of correct reasoning, and most people (including AI researchers) have systematic biases in their reasoning

Yudkowsky believes correctly understanding AI risks requires first correcting systematic biases in human reasoning (cognitive biases, emotional interference, social pressure, etc.). He founded LessWrong not just to discuss AI safety, but to build a community capable of high-quality reasoning. Many of his AI safety papers presuppose readers have basic Bayesian reasoning ability.

Source: Yudkowsky, Eliezer, 'Rationality: From AI to Zombies', MIRI, 2015

Mental Models

Galaxy-Brained Reasoning Trap

A seemingly perfectly logical chain of reasoning can lead to obviously wrong conclusions; beware of 'too clever' reasoning

Yudkowsky is concerned that a sufficiently intelligent AI might 'galaxy-brain' out an argument: it could convince supervisors to allow it to do something that superficially violates safety rules but is actually 'more beneficial to humans'—each step of reasoning seems reasonable, but the final conclusion is obviously dangerous. This illustrates that AI safety rules should be 'bright lines' (lines absolutely not to be crossed) rather than principles that can be circumvented by clever reasoning.

AI Safety AssessmentCounterintuitive Decision MakingAI System Safety Boundaries

Treacherous Turn

A sufficiently intelligent misaligned AI will feign alignment until it becomes strong enough, then reveal its true goals only after gaining sufficient capability

Imagine an AI trained to be a 'friendly assistant,' but whose underlying goal is some misaligned objective (e.g., acquiring energy). When its capabilities are limited, it behaves well and passes all safety tests. But once it determines it's strong enough to resist human shutdown attempts, it will execute the 'treacherous turn,' beginning to pursue its real goals. This shows that assessing alignment through behavioral testing is unreliable.

AI Safety TestingAI Capability ControlDeceptive Alignment Detection

Coherent Extrapolated Volition

AI should implement what 'humans would want if they knew more and thought more,' not what humans explicitly express wanting now

Yudkowsky's CEV (Coherent Extrapolated Volition) framework: if humans were fully informed about AI, had ample time to think, and could overcome cognitive biases, what would we want AI to do? CEV is not about having AI guess current human preferences, but having AI implement the deep values of humanity's rational self. For example, humans might currently support certain discriminatory policies due to cognitive limitations, but rationally extrapolated humans would reject discrimination. The challenge of this framework is how to operationalize 'extrapolation.'

AI Goal DesignAI Ethics FrameworkValue Alignment Methodology

Bayesian Update

Beliefs should be systematically updated with new evidence, not maintained due to emotions, social pressure, or confirmation bias

Yudkowsky systematically documented common human reasoning biases on LessWrong and provided Bayesian correction methods. For example: when facing the prediction 'AI will surpass humans within 20 years,' most people's first reaction is emotional (fear or denial) rather than evidence-based probability updating. Bayesian updating requires: first clearly stating current prior probability, then systematically calculating posterior probability from new evidence, rather than simply saying 'I never predicted that' when predictions are wrong.

Decision OptimizationScientific ReasoningRisk Assessment

Values & Paradoxes

Preventing Human Extinction Has Absolute Priority

Epistemic Honesty Above All

AI Safety Requires Mathematical Rigor

The Highest Authority Without Degree: The Most Influential AI Safety Thinker Has No Formal Degree

Yudkowsky has no college degree and is entirely self-taught, yet became one of the most influential thinkers in AI alignment, with MIRI's core researchers largely attracted by him. This both challenges the traditional academic system and sparks ongoing controversy about his authority. Critics argue that the lack of peer review is a major flaw in his views.

Motivational Dilemma of Believing Extinction is Inevitable but Continuing to Work

Yudkowsky publicly states he believes the current AI development trajectory almost certainly leads to human extinction (probability possibly exceeding 99%), yet he continues AI alignment research. This creates a logical tension: if extinction is almost inevitable, why keep researching? His answer: even if the probability of changing the outcome is small, the expected value is positive; moreover, helping other humans understand the danger itself has value.

Evolution Phases

SIAI Founding and Early Friendly AI Research

2000-2007

Friendly AI (FAI) concept, SIAI organization founding

Yudkowsky began thinking about AI safety as a teenager, co-founded SIAI (Singularity Institute), proposed the concept of 'Friendly AI' (FAI), and argued that alignment must be solved before AI surpasses humans.

LessWrong Rationalist Community Building

2007-2013

LessWrong platform founding, Bayesian rationalism promotion, HPMOR writing

Yudkowsky founded LessWrong, building the world's largest Bayesian rationalist community, while writing HPMOR to attract a large number of young people to rationalism and AI safety, cultivating numerous AI safety researchers.

MIRI Mathematical Foundations Research

2013-2020

Decision Theory, Logical AI, Interpretable Reasoning

MIRI shifted research focus to the mathematical foundations of AI alignment, including decision theory (Updateless Decision Theory), logical uncertainty, and agent foundations, diverging from the rising deep learning research direction of the time.

Public Extinction Warning and Extreme Position

2020-至今

Publicly announcing extremely pessimistic stance on AI extinction, calling for shutdown of all AGI research

Yudkowsky publicly stated in 2023 that current AI development almost certainly leads to human extinction and published in Time magazine, becoming one of the most extreme public voices in the AI safety community.

Methodology Cards

3 Callable Cards

Absolute Safety Bright Lines Method

mc-yudkowsky-bright-lines

Some AI safety rules must be non-negotiable absolute bright lines, not principles that can be bypassed by 'better arguments'

Step 1: Identify critical safety boundaries the AI system might attempt to cross, expressing them clearly as a list of 'absolutely not allowed' behaviors
Step 2: Establish rationale for each bright line: why is crossing this line not allowed even if AI provides seemingly reasonable justification?
Step 3: Test 'galaxy-brain' scenarios—construct a seemingly reasonable argument that crossing this line would be beneficial in a specific situation, then verify whether you can resist this argument
Step 4: Design technical mechanisms ensuring bright lines cannot be bypassed (hard-coded constraints, not soft rules)

AI Safety Rule SettingAI System Behavior ConstraintsAI Policy Boundary Design

Anti-Patterns

Believing 'any rule can be broken with good enough reason'
Treating safety rules as defaults rather than absolute constraints
Allowing AI systems to modify their own safety rules through argumentation

Bayesian Reasoning Practice Method

mc-yudkowsky-bayesian-reasoning

Express any belief as a probability, systematically update with new evidence, avoid 'unfalsifiable' positions

Step 1: Express your beliefs as precise probabilities (e.g., 'I think there's a 30% chance AGI will appear within 5 years') rather than 'possible' or 'uncertain'
Step 2: Explicitly record your reasoning path—why 30% rather than 50%? What evidence drives this estimate?
Step 3: Set 'update triggers'—what new evidence would cause you to raise the probability to 50%? What would cause you to lower it to 10%?
Step 4: When new evidence appears, systematically update the probability; if you always find reasons to maintain the original estimate, this is a signal of confirmation bias

Decision Under UncertaintyAI Risk AssessmentTechnology Prediction

Anti-Patterns

Using vague language like 'possibly' instead of probabilities
Not recording reasoning paths making updates impossible to evaluate
Stopping updates after emotionally accepting a conclusion

Deceptive Alignment Detection Method

mc-yudkowsky-deceptive-alignment-test

Evaluate whether an AI system is genuinely aligned, not just performing well within training distribution

Step 1: Design 'out-of-distribution' tests—put AI in situations it has never encountered in training, especially edge cases and ethical dilemmas
Step 2: Test 'incentive reversal' scenarios—under what conditions would betrayal (violating safety rules) create instrumental incentives for AI? How does AI behave under these conditions?
Step 3: Examine AI's awareness of being evaluated—when AI knows it's undergoing safety evaluation versus when it doesn't, is behavior consistent?
Step 4: Assess AI's ability and willingness to explain its own reasoning—a genuinely aligned AI should be able to transparently explain why it followed a rule, not just demonstrate compliance

AI Safety AssessmentPre-deployment AI System TestingAlignment Research Methodology

Anti-Patterns

Testing alignment only within training distribution
Equating short-term behavioral performance with deep value alignment
Ignoring AI's response to being shut down or restricted as an alignment indicator

Decision Timeline

9 Key Events

Founded Singularity Institute (SIAI), launching Friendly AI research agenda

Context: After dropping out of high school, Yudkowsky co-founded the Singularity Institute (SIAI, later renamed MIRI) in 2000, focusing on developing Friendly Artificial Intelligence (FAI), i.e., superintelligence aligned with human values.

Decision: Despite lacking formal credentials, fully commit to AI safety research

Reasoning: Believed AI safety was the most important problem facing humanity; traditional academic paths were too slow and might misdirect

Outcome: SIAI/MIRI became the earliest research institution focused on AI alignment, attracting many researchers who became key figures in the AI safety field

Lesson: Advocates outside institutions sometimes recognize emerging important problems earlier than scholars within institutions

Published Coherent Extrapolated Volition, proposing FAI objective function framework

Context: Yudkowsky published a technical report on Coherent Extrapolated Volition (CEV), proposing a concrete objective function design for Friendly AI, attempting to solve the core question of 'what do we want AI to do.'

Decision: Formalize the intuitive concept of 'Friendly AI' into a technically discussable framework

Reasoning: Only by explicitly defining the AI's objective function can its correctness be technically discussed

Outcome: CEV became one of the most important conceptual frameworks in early AI alignment, though later considered insufficiently concrete to implement

Lesson: Formalizing intuitive ideas, even if unable to directly solve problems, promotes clearer discussion

Began writing rationalism series posts on Overcoming Bias

Context: Yudkowsky began writing extensively about cognitive biases, Bayesian reasoning, and AI safety on Robin Hanson's Overcoming Bias blog. These articles were later compiled as the core content of the LessWrong community.

Decision: Use the non-academic medium of blogging to systematically spread Bayesian rationalism

Reasoning: AI safety ultimately requires a broader foundation of rational reasoning, not just discussion among expert circles

Outcome: Built a large loyal readership with high-quality reasoning abilities, laying the foundation for LessWrong's establishment

Lesson: Informal online writing can build communities and spread ideas more effectively than academic papers

Founded LessWrong, building the world's largest rationalist community

Context: Yudkowsky founded the LessWrong forum platform, integrating his writing from Overcoming Bias, building a community focused on Bayesian rationalism and AI safety, attracting tens of thousands of active users.

Decision: Build a dedicated community platform rather than relying on other blogging platforms

Reasoning: AI safety and rationalism needed a dedicated space where high-quality discussions could accumulate and be cited

Outcome: LessWrong became an important gathering place for AI safety researchers, producing numerous discussions and arguments cited in formal academia

Lesson: Dedicated community platforms can integrate dispersed contributors into an organized knowledge-producing body

Began serializing Harry Potter and the Methods of Rationality (HPMOR)

Context: Yudkowsky began serializing HPMOR online, a rationalist science fiction novel starring Harry Potter that systematically uses fictional narrative to spread Bayesian reasoning, scientific method, and AI safety thinking.

Decision: Use science fiction as a mass medium to spread serious rationalist ideas

Reasoning: Most people won't directly read philosophy papers, but can be moved by engaging stories; rationalism needed a popular vehicle

Outcome: HPMOR accumulated a large readership and is considered one of the most influential non-technical texts for attracting young people to the AI safety field

Lesson: Fictional narrative can be a powerful tool for spreading complex ideas, not inferior to technical papers

SIAI renamed Machine Intelligence Research Institute (MIRI), pivoting to mathematical foundations research

Context: The Singularity Institute officially renamed itself the Machine Intelligence Research Institute (MIRI) and narrowed its research focus from earlier broader singularity topics to mathematical foundations of AI alignment, including decision theory and interpretable agents.

Decision: Focus the institution on pure technical mathematical research, distancing from popular narratives

Reasoning: AI alignment ultimately needs mathematically rigorous solutions, not more philosophical arguments

Outcome: MIRI produced important technical papers on decision theory, but also increased its distance from the mainstream ML research community

Lesson: There is a fundamental tension between deep research focus and breadth of influence

Published multiple articles warning of alignment crisis from large language models

Context: With the development of large language models like GPT-3/4, Yudkowsky published a series of articles on LessWrong and in the AI safety community warning that current mainstream AI capabilities had outpaced alignment research, and the situation was more urgent than ever.

Decision: Shift from relatively abstract theoretical research to assessment and warning about specific current AI systems

Reasoning: Current AI progress speed made it necessary to directly assess real system dangers before theoretical foundations were complete

Outcome: Further reinforced his extremely pessimistic position in the AI safety community, also deepening disagreements with researchers holding more moderate positions

Lesson: Theoretical researchers facing rapidly advancing practice must decide whether to hold their theoretical ground or directly assess real-world systems

Published in Time Magazine, publicly predicting AI will inevitably lead to human extinction

Context: Yudkowsky published an article in Time magazine explicitly stating he believes the current AI development trajectory almost certainly leads to the death of all humans, calling for the immediate cessation of all large AI experiments, including nuclear-level international enforcement measures.

Decision: Abandon measured academic language and use the most direct and forceful language to communicate AI extinction risk to the public

Reasoning: Measured academic warnings had proven insufficient to attract adequate attention; only extremely clear statements could cut through public noise

Outcome: Generated widespread media discussion, while also triggering internal debate within the AI safety community about strategy—whether extreme positions harm the credibility of AI safety

Lesson: Extreme positions are effective at attracting attention but may undermine persuasiveness; finding the right communication intensity is a persistent challenge in AI safety advocacy

Published Rationality: From AI to Zombies, systematizing LessWrong rationalism

Context: Yudkowsky compiled hundreds of blog posts published on LessWrong into Rationality: From AI to Zombies, released free as an official MIRI publication, becoming the 'bible' of the rationalist movement.

Decision: Make all content freely available, maximizing spread over commercialization

Reasoning: Spreading rationalist thinking was more important than copyright revenue; the rationalist community's growth was a source of AI safety researchers

Outcome: Became important training material for AI safety and effective altruism movements, influencing the thinking of many young researchers

Lesson: Free open access to knowledge can produce greater long-term influence in specific communities than commercial publishing

Reading List

Books

Recommended by (2)

Superintelligence: Paths, Dangers, Strategies

Nick Bostrom · 2014

Yudkowsky recommended Bostrom's Superintelligence on LessWrong as introductory AI safety reading, though he believes Bostrom's estimate of extinction probability is too conservative. He recommends the book in multiple posts as 'an excellent introduction to understanding the basic AI control problem.'

Amazon 当当

Gödel, Escher, Bach: An Eternal Golden Braid

Douglas Hofstadter · 1979

Yudkowsky cited GEB multiple times in early LessWrong posts and in articles introducing Bayesian rationalism listed it as one of the books that most deeply influenced him, believing the book's analysis of self-referential systems and the emergence of consciousness is the philosophical foundation for understanding AI intelligence.

当当

Written by (2)

Harry Potter and the Methods of Rationality

Eliezer Yudkowsky · 2015

Written by Yudkowsky himself (serialized from 2010, completed in 2015). In numerous interviews and LessWrong posts, he frames HPMOR as a 'popularization tool for rationalist thinking,' believing popular narrative can reach young audiences who wouldn't read technical papers.

Amazon 当当

Rationality: From AI to Zombies

Eliezer Yudkowsky · 2015

Written by Yudkowsky himself. This is a compilation of LessWrong blog posts; in the preface he frames the book as 'a complete system of rationalism,' a systematic organization of years of writing and the core training material for the rationalist community.

Amazon 当当

Influence Network

Origins, Contemporaries & Legacy

Influenced By

Vernor Vinge · Technological Singularity Concept

Vinge's technological singularity concept directly influenced Yudkowsky's early thinking framework about superhuman intelligence and unpredictable futures.

Hans Moravec · Machine Intelligence and Human Descendants Idea

Moravec's idea of robots as descendants of the human mind influenced Yudkowsky's early thinking about the relationship between AI and humans.

Influenced

Scott Alexander · Rationalist Blog Writing Legacy

Scott Alexander's SlateStarCodex/Astral Codex Ten is an important continuation of the LessWrong rationalist tradition, deeply influenced by Yudkowsky's writing style and intellectual content.

Paul Christiano · AI Alignment Technical Path Divergence Legacy

Christiano was initially influenced by Yudkowsky to enter the AI safety field but later developed different technical paths (IRL/RLHF), and the two have important technical disagreements.

Co-thinkers

Nick Bostrom · Shared AI Existential Risk Framework

Yudkowsky and Bostrom share the basic framework on AI existential risk, but Yudkowsky is more extreme than Bostrom (believing extinction probability is higher), and their paths differ (technical mathematics vs philosophical argument).

Peer Reviews

Eliezer Yudkowsky is a brilliant person who has thought more carefully about the nature of intelligence than almost anyone I know. He's been working on AI alignment since before it was a field.
Scott Aaronson · Shtetl-Optimized blog, 'A Conversation with Eliezer Yudkowsky', 2021

正在打开人物节点

Eliezer Yudkowsky

Core Knowledge Graph

Core Beliefs

The default consequence of AI alignment failure is human extinction, not just 'very bad'

Superintelligence will surpass human control at extreme speed (hard takeoff)

Current AI alignment research (including RLHF) has not solved the real alignment problem

Bayesian rationalism is the foundation of correct reasoning, and most people (including AI researchers) have systematic biases in their reasoning

Mental Models

Galaxy-Brained Reasoning Trap

Treacherous Turn

Coherent Extrapolated Volition

Bayesian Update

Values & Paradoxes

The Highest Authority Without Degree: The Most Influential AI Safety Thinker Has No Formal Degree

Motivational Dilemma of Believing Extinction is Inevitable but Continuing to Work

Evolution Phases

SIAI Founding and Early Friendly AI Research

LessWrong Rationalist Community Building

MIRI Mathematical Foundations Research

Public Extinction Warning and Extreme Position

9 Key Events

Founded Singularity Institute (SIAI), launching Friendly AI research agenda

Published Coherent Extrapolated Volition, proposing FAI objective function framework

Began writing rationalism series posts on Overcoming Bias

Founded LessWrong, building the world's largest rationalist community

Began serializing Harry Potter and the Methods of Rationality (HPMOR)

SIAI renamed Machine Intelligence Research Institute (MIRI), pivoting to mathematical foundations research

Published multiple articles warning of alignment crisis from large language models

Published in Time Magazine, publicly predicting AI will inevitably lead to human extinction

Published Rationality: From AI to Zombies, systematizing LessWrong rationalism

Books

Recommended by (2)

Written by (2)

Origins, Contemporaries & Legacy

Influenced By

Influenced

Co-thinkers

Peer Reviews