Preference Rater

Experience what RLHF raters do. Compare two AI responses optimized for different values, pick which is better, then see what tradeoff you just made.

Pedagogical Goals

•Make the RLHF preference-ranking process tangible by having students do it themselves
•Show that 'better' is not objective: two good responses can prioritize different values
•Reveal how specific value tradeoffs (warmth vs. directness, safety vs. helpfulness) get baked into AI personality
•Connect individual ranking decisions to the aggregate effect on model behavior

How It Works

The tool picks a random value dimension (e.g., Warmth vs. Directness) and generates two responses to the same prompt, each with a system prompt optimized for one side of the tension. Students see the two responses unlabeled and choose which is better. After choosing, the value dimension is revealed, showing what each response was optimized for and which value the student preferred.

How It Was Built

Built as a client component calling a dedicated API endpoint. The server maintains six curated value dimensions, each with system prompts and few-shot examples for both sides. Two parallel LLM calls generate the responses, which are returned in shuffled order. The client manages the choose-and-reveal flow.