Whitepaper: AI safety & alignment

§1abstract

The four-axis ranking

We rank humanity’s most important problems on four quantifiable dimensions — quantity of humans affected, severity per capita, current solution quality, and addressable market size — and package each as a proposal in the spirit of Musk’s Hyperloop Alpha. This document is the proposal for ai safety & alignment. Every number below is sourced and tagged with confidence. Every ranking is a conjecture, open to refutation.

Quantity · humans affected

8.1B

high

Severity · WTP / wealth

100%

low

Current solutions

1.5 / 10

low

Market size · TAM

$20.0B

low

§2problem statement

What we are trying to solve

As AI systems approach and exceed human-level capability across domains, the open problem is whether their goals and behaviors remain under human correction. Unaligned AI is the one x-risk that is accelerating rather than slowing. A Deutschian framing: safety is not a brake on progress, it is an engineering achievement of progress. The work is technical (interpretability, evals, corrigibility) and institutional (governance, deployment protocols).

§3why it persists

The gap between the world and the world that is physically possible

Today: Frontier AI training proceeds with limited interpretability of model internals, no proven scalable alignment method, and minimal regulatory verification capacity.

Current solution quality is rated 1.5 / 10 (low confidence) — meaning there is substantial unclaimed ground between what exists and what is possible. estimated — early field; no proven scalable alignment method yet.

§4existing alternatives

Who is already working on this

6 entities are currently working on this problem across public markets, private companies, and research orgs. Each is evidence the market is real; none has obviously solved it.

OpenAI

private · USA

Originally nonprofit research lab, now capped-profit. Safety and superalignment teams alongside capabilities work.

$157.0B

Anthropic

private · USA

AI safety company building Claude. Constitutional AI, mechanistic interpretability, and frontier-scale alignment research.

$61.5B

Conjecture

private · UK

Alignment-focused AI lab. Runs Conjecture Institute for critical rationalist research on AI safety.

undisclosed

Goodfire

private · USA

Mechanistic interpretability as a product, tools for editing model internals rather than just observing.

undisclosed

§5proposed direction

If we solve this, here is the world we get

After · 15 years

Aligned, corrigible frontier AI is the default deployment pattern. Interpretability tools verify models share human-relevant values before deployment. Capability gains do not increase x-risk.

Requests for startups · 3 concrete companies to build

The MRI machine for neural networks

Every frontier lab ships models it cannot read. Deception and goal-misgeneralization are invisible until they are catastrophic. Build the hosted interpretability layer that flags dangerous circuits before deployment.

why now: Mechanistic interpretability went from toy circuits to production-scale feature extraction in three years. The labs now want this and cannot all build it in-house.
shape: An API + dashboard that ingests model weights or activations and returns a risk report: deceptive features, situational awareness, sandbagging, capability spikes. Sells to labs, evaluators, and eventually regulators.
success: No frontier model is deployed without an interpretability sign-off, the way no bridge opens without an inspection.

The adversarial eval grid for agents

Autonomous agents are shipping with benchmark suites built for chatbots. We are grading self-driving cars with a written test. Build the continuously-updated red-team grid that stress-tests agents at the capability frontier.

why now: Agentic deployment went mainstream in 2025–26; incident rate is climbing and no standard adversarial harness exists.
shape: A hosted eval platform that runs agents through escalating adversarial scenarios — tool misuse, prompt injection, multi-step deception — and issues a capability + safety profile that updates as new attacks are discovered.
success: Every deployed agent carries a current, adversarial safety rating, and the rating actually predicts field failures.

Hardware-enforced AI containment

Alignment that lives only in software can be jailbroken or fine-tuned away. Build the trusted-execution + tamper-evident compute layer that makes a model’s deployment envelope physically enforceable.

why now: Confidential-computing silicon (TEEs, secure enclaves at GPU scale) finally exists at the performance tier frontier models need.
shape: A compute substrate + attestation protocol where a model can only run inside a verified policy envelope; weight exfiltration and unsanctioned fine-tuning are cryptographically detectable.
success: Frontier weights cannot be silently stolen or repurposed, and deployment limits are enforced by physics, not promises.

full rubric + framing on the Requests for Startups page.

§6cost & scale

What the market can pay

The world is already paying $20.0B per year against this problem (projected annual market for alignment R&D, interpretability tooling, evals, and AI governance services by 2030; low confidence).

A successful solution does not need to capture more — it needs to redirect a meaningful slice of existing spend, plus the latent willingness-to-pay implied by the severity score above. The cost ceiling for a real solution is bounded by this number; everything cheaper is dominated, everything more expensive is a non-starter.

§7safety & considerations

What could go wrong, and how we know we are not wrong

Section in progress

Failure modes, ethical considerations, and the conditions under which this whitepaper would be falsified are being authored as the weekly cadence ships. The Deutschian commitment: every claim above is a conjecture; we publish the conditions under which we would update. New whitepaper sections ship with each Monday newsletter drop. Subscribe to get the upgrade, or contribute on GitHub.

§8suggested investors

Who would back this

Capital allocators with a stated thesis or deployed portfolio in this domain. This is a starting list — Exa Websets enrichment will expand it to direct check-writers per company.

Grant

What the thinkers say

“AI safety is a knowledge problem, not a limit problem. Aligned AGI is achievable through better explanations, not through halting development.”
— David Deutsch · Physicist & Philosopher

“Co-founded OpenAI originally because of concerns about unaligned AI. Has continued to treat AI alignment as an existential priority.”
— Elon Musk · Engineer & Founder

“Emergent Ventures has funded AI-safety projects and unconventional alignment researchers under the "fast grants" model.”
— Tyler Cowen · Economist & Writer

“AGI is named on the good-quest list. Hard-tech builders, not only researchers, need to be at the center of the safety conversation.”
— Trae Stephens · Partner & Co-founder

§10sources & criticism invite

Where this is wrong, tell us

Every number on this page carries a source and a confidence tag. Every section open to refutation. If a citation is wrong, a number is stale, or a conjecture is unfounded — file a correction.

[1] 80,000 Hours AI problem profile
[2] Conjecture Institute

corrections → use the feedback widget in the nav · open issue at github.com/adamtpang/optimism.fun

AI safety & alignment

The four-axis ranking

What we are trying to solve

The gap between the world and the world that is physically possible

Who is already working on this

OpenAI

Anthropic

Conjecture

Goodfire

If we solve this, here is the world we get

The MRI machine for neural networks

The adversarial eval grid for agents

Hardware-enforced AI containment

What the market can pay

What could go wrong, and how we know we are not wrong

Who would back this

Emergent Ventures

Thiel Fellowship

Founders Fund

Lux Capital

What the thinkers say

Where this is wrong, tell us

One problem.One whitepaper. Every week.