AI Safety, Alignment, and Governance

Before Class

You should read all four articles before today's discussion:

Anthropic and Donald Trump's Dangerous Alignment Problem(27 min)
The Dissonance of Anthropic CEO Dario Amodei(12 min)
I'm glad the Anthropic fight is happening now(23 min)
OpenAI Is Opening the Door to Government Spying(11 min)

Please complete the preparation conversation below before class. This is part of attendance for today's meeting.

Preparation Discussion

Today's Plan

On Tuesday you explored how human preferences get baked into AI through the training process: RLHF raters rank outputs, and the model learns to produce what humans prefer. Today we ask a harder question: who gets to decide what those preferences should be? The readings show this playing out in real time. Anthropic drew red lines about how Claude could be used. The Pentagon punished them for it. OpenAI signed the deal instead. Four rounds of paired discussion, each with a different partner and a different angle on the alignment problem.

In-Class Activity~85 min

Round 1: What Is Alignment?~10 min

Partner work

Round 1: Share Out~10 min

Round 2: The Anthropic Dilemma~10 min

Partner work

Round 2: Share Out~10 min

Round 3: Government and AI~10 min

Partner work

Round 3: Share Out~10 min

Round 4: Who Decides?~10 min

Partner work

Round 4: Share Out~10 min

Feedback~5 min

Round 1: What Is Alignment?

Partner Activity

This activity involves working with a partner.

What Is Alignment?

On Tuesday you saw how RLHF works: human raters compare model outputs and pick the "better" one. The model learns to produce responses that humans prefer. But who decides what "better" means?

The New Yorker article describes Anthropic's "soul doc," an internal document that lays out how Claude should behave. Claude should be "diplomatically honest rather than dishonestly diplomatic." It should acknowledge the Holocaust and climate change. It should exercise judgment, not just follow orders. These seem like reasonable values. But someone at Anthropic wrote them down and trained a model to follow them.

Discuss with your partner: What does "alignment" mean? On Tuesday you ranked AI outputs and experienced how subjective "better" is. Now scale that up: a company is deciding the values for an AI used by millions of people. Is that different from what the RLHF raters do, or is it the same problem at a bigger scale? Whose values should AI be aligned with? The company that built it? The government? The users? Some universal standard? Is there such a thing?

Round 1: Share Out

Share Out

Geoff will ask a few pairs to share what they discussed. Listen for ideas that challenge or extend your own thinking.

Round 2: The Anthropic Dilemma

Partner Activity

This activity involves working with a partner.

The Anthropic Dilemma

Anthropic put red lines in its Pentagon contract: Claude would not be used for fully autonomous weapons or domestic mass surveillance. The Pentagon accepted those terms initially. Then Defense Secretary Hegseth declared Anthropic a supply chain risk, threatening the company's commercial relationships across the entire defense industry, because Anthropic refused to remove the restrictions.

Meanwhile, Dario Amodei published "Machines of Loving Grace," a 15,000-word utopian vision of AI curing diseases and doubling human lifespans. He also compares AI development to the Manhattan Project and believes it could be more dangerous than nuclear weapons. He thinks the "right people" need to be in control.

Dwarkesh Patel argues both sides have a point: the government can't let a private company hold a kill switch on military technology, but Anthropic can't be forced to abandon its values.

Discuss with your partner: Did Anthropic make the right call by refusing to remove its red lines? Did the government overreact, or was declaring a supply chain risk a reasonable response? Amodei compares himself to Manhattan Project scientists. But those scientists lost control of the bomb. What does that parallel suggest about Anthropic's chances of maintaining its principles? Is "responsible scaling" a real strategy or a contradiction in terms?

Round 2: Share Out

Share Out

Geoff will ask a few pairs to share what they discussed. Listen for ideas that challenge or extend your own thinking.

Round 3: Government and AI

Partner Activity

This activity involves working with a partner.

Government and AI

After Anthropic was sidelined, OpenAI signed the Pentagon deal. Sam Altman said he would seek the same red lines Anthropic demanded: no mass surveillance, no autonomous lethal weapons.

But legal experts who analyzed the contract found the lines are blurry. Several told The Atlantic that, legally, the Pentagon can likely use OpenAI's technology for mass surveillance of Americans. The military also has a pathway to use it in autonomous weapons. One expert said OpenAI appears "okay with using ChatGPT for what ordinary people think of as mass surveillance."

Outside OpenAI's headquarters, protesters wrote on the sidewalk in chalk: "Stand for liberty. Please no legal mass surveillance."

Discuss with your partner: Should AI companies be allowed to sell their technology to the military? If yes, what restrictions should apply, and who enforces them? The OpenAI contract shows that "red lines" can be drawn in ways that look protective but legally aren't. How do you solve the transparency problem when the contracts involve classified information? Is there a meaningful difference between the government using AI for surveillance versus using older surveillance technologies?

Round 3: Share Out

Share Out

Geoff will ask a few pairs to share what they discussed. Listen for ideas that challenge or extend your own thinking.

Round 4: Who Decides?

Partner Activity

This activity involves working with a partner.

Who Decides?

You've now seen the alignment problem from multiple angles. Companies like Anthropic try to self-regulate, but they can be overruled by governments. Governments demand access, but they may use AI for surveillance. Contracts are negotiated behind closed doors with limited public transparency.

Dwarkesh argues that within 20 years, most government and military labor could be AI. Who controls that workforce will be one of the most important questions of the century.

Discuss with your partner: If you were designing a governance system for AI from scratch, what would it look like? Consider several options: company self-regulation (what we have now), government regulation (which government?), an international body (can it be enforced?), open-source AI (no one controls it), or something else entirely. What are the tradeoffs of each? What would you actually propose? Be specific: who sits on the oversight body? What powers do they have? How do you enforce the rules across borders?

Round 4: Share Out

Share Out

Geoff will ask a few pairs to share what they discussed. Listen for ideas that challenge or extend your own thinking.