Training Data and Its Costs

Before Class

You should read all three articles before today's discussion:

Please complete the preparation conversation below before class. This is part of attendance for today's meeting.

Preparation Discussion

Today's Plan

On Monday you explored how AI represents meaning as geometry, with words becoming points in a space shaped entirely by training data. Today we ask a different question: what does it cost to produce that training data and the infrastructure to process it?

The costs are bigger and more varied than you might expect. Four rounds of paired discussion, each with a different partner and a different dimension of cost. After each round, we'll hear from a few pairs before moving on.

In-Class Activity~85 min

Round 1: Energy and Infrastructure~15 min

Partner work

Round 1: Share Out~5 min

Round 2: Who Pays?~15 min

Partner work

Round 2: Share Out~5 min

Round 3: Whose Work?~15 min

Partner work

Round 3: Share Out~5 min

Round 4: Who Decides?~15 min

Partner work

Round 4: Share Out~5 min

Wrap-Up~5 min

Round 1: Energy and Infrastructure

Partner Activity

This activity involves working with a partner.

What Does AI Physically Cost?

Wong's Atlantic piece describes Colossus using as much electricity as 200,000 homes. OpenAI has announced plans for facilities requiring more than 30 gigawatts total, more than all of New England's largest recorded demand. Since ChatGPT's launch, capital expenditures by Amazon, Microsoft, Meta, and Google have exceeded $600 billion — more, adjusted for inflation, than the entire interstate highway system.

To get data centers running fast, companies are turning to fossil fuels. Sam Altman says "short-term: natural gas." The IEA estimates data center emissions could more than double by 2030. But there's an optimist's case too: Microsoft is restarting Three Mile Island, Google is investing in renewables, and researchers at Duke have shown that data centers throttling during peak demand could free up enough electricity for years.

Discuss with your partner: Are these infrastructure costs a reasonable price for what AI delivers? Who should decide how much energy AI gets to consume? The interstate highway system was a public investment with public benefits. AI infrastructure is private investment with private returns. Does that distinction matter?

Round 1: Share Out

Share Out

Geoff will ask a few pairs to share what they discussed. Listen for ideas that challenge or extend your own thinking.

Round 2: Who Pays?

Partner Activity

This activity involves working with a partner.

Who Pays?

Colossus was built in Boxtown, a historically Black neighborhood in Memphis where life expectancy is already more than five years below the national average and cancer risk is four times higher. Residents found out about the data center after construction had already begun. Sarah Gladney's tomatoes wilted. Marilyn Gooch wonders whether her grandchildren should visit.

But the physical infrastructure is only one layer of hidden human cost. AI systems also depend on low-wage human labor that rarely makes headlines. Content moderators, many in Kenya and the Philippines, review traumatic material (violence, abuse, exploitation) for as little as $1-2 per hour so that chatbots can learn what not to say. Data labelers spend hours tagging images, correcting outputs, and rating responses. A 2023 TIME investigation found that OpenAI used Kenyan workers paid less than $2/hour to label toxic content for ChatGPT's safety filters. These workers reported lasting psychological trauma.

Discuss with your partner: The readings focus on energy and land, but the human labor costs are just as real. Why do you think the labor side of AI's costs gets so much less attention? What connects Boxtown residents breathing polluted air and Kenyan workers reviewing violent content — is there a pattern in who bears the costs of technology that benefits others?

Round 2: Share Out

Share Out

Geoff will ask a few pairs to share what they discussed. Listen for ideas that challenge or extend your own thinking.

Round 3: Whose Work?

Partner Activity

This activity involves working with a partner.

Whose Work?

Eric Schmidt told Stanford students to download whatever they need to build AI and "hire lawyers" to clean up later. OpenAI, Anthropic, and Meta have trained their models on massive troves of copyrighted books, art, and other media. Their legal defense: this is "fair use" because AI models produce original output, not copies.

But as Reisner reports, these same companies explicitly forbid using their outputs to train competing models. OpenAI's terms of service ban it. Anthropic's ban it. Google's ban it. The message: we can train on your work, but you can't train on ours.

Meanwhile, Dario Amodei wrote an internal memo in 2021 acknowledging that AI could be "an increasingly extractive concentrator of wealth" and suggested compensating creators. Today, Anthropic argues in court that using copyrighted books is fair use and authors are entitled to nothing.

Discuss with your partner: Is AI training on copyrighted work fundamentally different from a human reading a book and being influenced by it? Does the double standard — we can use yours, you can't use ours — matter, or is it just normal competitive behavior? If you were a novelist or artist whose work was used to train a model without permission or payment, would "fair use" feel fair?

Round 3: Share Out

Share Out

Geoff will ask a few pairs to share what they discussed. Listen for ideas that challenge or extend your own thinking.

Round 4: Who Decides?

Partner Activity

This activity involves working with a partner.

Who Decides?

Wallace-Wells describes an American status quo where the country is "hugely anxious about what's to come while at the same time seeming to lack real faith in anyone, or in any institution, to actually manage it." Congress held AI hearings in 2023. Everyone agreed the government should play a role. Almost nothing has happened since.

Meanwhile, the Pentagon used Claude in military strikes hours after the President banned its use. AI companies invoke competition with China to argue against regulation. And the people most affected by AI's costs — the communities hosting data centers, the workers labeling data, the creators whose work is scraped — have had essentially no voice in these decisions.

Local communities are fighting data center construction because it's the one tangible thing they can push back on. But Wallace-Wells argues this won't address AI's broader impacts.

Discuss with your partner: Who should be making decisions about AI development? The companies building it? National governments? Local communities? International bodies? Wallace-Wells compares AI to nuclear weapons, noting that "we didn't just let Oppenheimer and Teller decide what to do with the bombs." Is that comparison apt? If stopping data centers won't stop AI, what would meaningful governance actually look like?

Round 4: Share Out

Share Out

Geoff will ask a few pairs to share what they discussed. Listen for ideas that challenge or extend your own thinking.

Wrap-Up

Closing Reflection

Four rounds, four kinds of cost: energy and infrastructure, environmental justice and human labor, intellectual property, and governance. The through-line is not that AI is too expensive. It's that the costs are real, they're unequally distributed, and right now almost nobody is deciding how to manage them.

On Monday, you saw that AI represents meaning as geometry shaped by training data. Today you saw what it takes to produce that training data and the infrastructure to process it. The technical and the human sides of AI are inseparable. Pay attention to both as the course continues.