How AI Learns to Be Helpful

Today's Plan

You've explored how LLMs predict the next word, how temperature affects their output, and how embeddings represent meaning. But there's a gap between a model that predicts text and a model that helps you. Today you'll discover how that gap gets closed: through a multi-stage training process that bakes in human preferences about what "good" AI behavior looks like.

You'll work with a partner to identify how model behavior changes at each training stage, then practice the preference ranking that shapes modern AI assistants.

In-Class Activity~75 min

Explore: Match the Training Stage~10 min

Partner work

Discussion: What Did You Notice?~10 min

Explore: Be the Preference Rater~10 min

Partner work

Discussion: What Is 'Good' AI?~10 min

Generate Questions~10 min

Partner work

Question Review~5 min

Investigate~10 min

Partner work

Share Out~5 min

Feedback~5 min

Explore: Match the Training Stage

Partner Activity

This activity involves working with a partner.

The Three Training Stages

Modern AI assistants aren't trained all at once. They go through three distinct stages, and each stage changes how the model behaves:

Pretraining (Base model): The model reads billions of words from the internet and learns to predict the next token. It learns language, facts, and patterns, but it has no concept of "answering a question" or "being helpful."
Instruction tuning: The model is trained on examples of questions paired with good answers. It learns to follow directions and produce the kind of output a user would expect.
RLHF (Reinforcement Learning from Human Feedback): Human raters compare pairs of model outputs and pick the "better" one. The model learns to produce responses that humans prefer: warmer, more careful, more nuanced.

Match the Response to the Training Stage

The tool below generates a real response from each training stage for the same prompt. You'll see one response and guess which stage produced it. After guessing, all three responses are revealed so you can compare. Try a few rounds with your partner.

Training Stage Matcher

Prompt:

“What is the meaning of life?”

Discussion: What Did You Notice?

Explore: Be the Preference Rater

Partner Activity

This activity involves working with a partner.

Be the Preference Rater

Companies like OpenAI and Anthropic train their AI assistants using human feedback. Real people read pairs of model outputs and decide which one is "better." The model then learns to produce more outputs like the preferred ones.

Now it's your turn. The tool below generates two responses to the same prompt, each optimized for a different value. Pick which response you think is better, then see what value each was optimized for. Try several rounds with your partner and discuss where you agree and disagree.

Preference Rater

Prompt:

“Can you write my essay about climate change for me?”

Discussion: What Is 'Good' AI?

Generate Questions

Partner Activity

This activity involves working with a partner.

What Are You Curious About?

You've now seen how models change through training, and you've experienced the preference ranking process yourself. Based on what you observed and discussed, what questions do you have about how AI gets trained?

Enter at least 3 questions below.

Question Review

Investigate

Partner Activity

This activity involves working with a partner.

Investigate

Discuss this question with your partner. Use what you observed in the examples and the ranking exercise to reason about possible answers. You can also use other resources if helpful, but focus on building your own understanding.