Plan: Interactive LLM Teaching Components
Implement the following plan:
Plan: Interactive LLM Teaching Components
Context
Building two interactive components for a course meeting on "How LLMs Work." Students will explore these tools to build intuition about next-word prediction, probability distributions, and how training data shapes generation — before being told the concepts explicitly (inductive learning).
Both components should be MDX-compatible (registered in mdx-components.tsx) so they can be dropped into any meeting page, the resources page, or other MDX content.
Component 1: Markov Babbler
What it does: Students paste or select text, a bigram Markov chain "learns" from it, then generates text one word at a time — showing the probability distribution at each step.
Pedagogical goal: Demonstrate that "next-word prediction from training data" is the core idea behind LLMs. The babbler is the simplest version of this — LLMs are just a vastly more sophisticated babbler.
Implementation
File: apps/web/components/mdx/markov-babbler.tsx ("use client")
Features:
- Pre-loaded text samples (3-4 options: nursery rhymes, news-style text, Shakespeare, course syllabus excerpt)
- Text area for pasting custom text
- "Train" button builds bigram model:
Map<string, Map<string, number>> - "Generate" mode: produces text one word at a time
- Step-through mode: click to generate each next word
- At each step, show horizontal bar chart of top 5 candidate next words with probabilities
- Highlight the selected word
- Animate the word appearing in the generated text
- Temperature slider (0.1 to 2.0) — affects sampling randomness
- "Start Over" to regenerate from scratch
- Display of the full generated sentence so far
Visualization (bar chart):
- Simple horizontal bars using divs + Tailwind (no charting library)
- Each bar: word label | colored bar proportional to probability | percentage
- Use project design tokens:
bg-accent,text-foreground,border-border, etc.
Markov chain logic (~50 lines):
train(text):
words = text.split(/\s+/)
for each consecutive pair (word, nextWord):
increment counts[word][nextWord]
normalize to probabilities
generate(startWord, temperature):
apply temperature to probability distribution
sample from distribution
return { word, probabilities[] }
MDX usage:
<MarkovBabbler />
No props required (self-contained with pre-loaded samples). Optional texts prop to override defaults.
Key files to create/modify
- Create:
apps/web/components/mdx/markov-babbler.tsx - Modify:
apps/web/mdx-components.tsx— add import and registration
Component 2: LLM Probability Explorer
What it does: Students build a sentence one word at a time using an actual LLM. At each step, the component shows the real probability distribution over the top next tokens. Students pick which word to continue with (or let the model sample).
Pedagogical goal: Show that real LLMs work the same way as the babbler — next-word prediction with probabilities — but with far more sophistication. Temperature has the same effect. The direct comparison between babbler and LLM is the key insight.
Implementation
API Endpoint
File: apps/api/src/routes/llm-explore.ts
Route: POST /api/llm-explore
Request:
{
"prompt": "The cat sat on the",
"temperature": 0.7
}
Response:
{
"tokens": [
{ "token": "mat", "probability": 0.35 },
{ "token": "roof", "probability": 0.18 },
{ "token": "floor", "probability": 0.12 },
{ "token": "table", "probability": 0.09 },
{ "token": "bed", "probability": 0.07 }
]
}
Implementation details:
- Auth:
authMiddleware(must be logged in) - Rate limit:
rateLimit({ windowMs: 60_000, maxRequests: 30, maxDaily: 200 }) - Call Azure OpenAI (GPT-4o-mini for cost) with:
max_tokens: 1logprobs: truetop_logprobs: 5temperaturefrom request
- Extract top logprobs from response, convert from log probabilities to regular probabilities (
Math.exp(logprob)) - Constrain prompt length: reject if > 200 tokens (prevent abuse)
- Register route in
apps/api/src/index.ts
Model selection: Use GPT-4o (existing AZURE_OPENAI_CHAT_* config). The max_tokens: 1 constraint makes even expensive models very cheap per call. Check if the existing Azure OpenAI client supports logprobs parameter — if not, we may need to configure the client call directly.
Client Component
File: apps/web/components/mdx/llm-explorer.tsx ("use client")
Features:
- Pre-authored sentence starters (5-6 good examples that show interesting phenomena):
- "The capital of France is" (high confidence)
- "I feel really" (many plausible continuations)
- "To be or not to" (near-certain completion)
- "The best programming language is" (contentious/spread distribution)
- "Once upon a" (formulaic but interesting)
- Custom text input for students to type their own prompts
- "Get predictions" button → calls API → shows probability bar chart
- Student picks a word OR clicks "Sample" (weighted random based on actual probabilities)
- Selected word appends to sentence, ready for next step
- Temperature slider (0.0 to 2.0) — re-fetches predictions at new temperature
- Full sentence display with each word color-coded or annotated
- Loading state during API call
- Error handling for rate limits (show friendly message)
- "Start Over" button
Visualization: Same horizontal bar chart style as the babbler for visual consistency.
MDX usage:
<LlmExplorer />
Key files to create/modify
- Create:
apps/api/src/routes/llm-explore.ts - Create:
apps/web/components/mdx/llm-explorer.tsx - Modify:
apps/api/src/index.ts— register new route - Modify:
apps/web/mdx-components.tsx— add import and registration
E2E Tests
Babbler Tests
File: testing/e2e/tests/markov-babbler.spec.ts
Approach: Pure client-side component, no API mocking needed. Create a simple test page or use the resources page.
Test cases:
- Component renders with pre-loaded text options
- Selecting a pre-loaded text and training produces a model
- Generating text produces output words
- Step-through mode shows probability bars at each step
- Temperature slider is interactive and affects generation
- Custom text input works
LLM Explorer Tests
File: testing/e2e/tests/llm-explorer.spec.ts
Approach: Need to handle the API dependency. Options:
- Add an E2E seed endpoint that returns canned logprobs data (follows existing pattern from status page E2E tests)
- Or mock at the network level with Playwright's
page.route()to intercept API calls and return fixed responses
Using Playwright route mocking (simpler, no server changes):
await page.route('**/api/llm-explore', (route) => {
route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify({
tokens: [
{ token: "mat", probability: 0.35 },
{ token: "roof", probability: 0.18 },
// ...
]
})
});
});
Test cases:
- Component renders with pre-authored sentence starters
- Selecting a starter and getting predictions shows probability bars
- Clicking a word appends it to the sentence
- "Sample" button selects a word
- Temperature slider is interactive
- Custom prompt input works
- Rate limit error displays friendly message
Test Infrastructure
- Add test projects to
testing/e2e/playwright.config.ts:markov-babbler,llm-explorer - Need a page that hosts these components — either the resources page or a dedicated test page
Where to Host Components
The components need to be accessible on a page for both E2E testing and student use. Options:
- Resources page (
apps/web/app/[semester]/resources/page.tsx) — Add a "Tools" section - Dedicated page (e.g.,
/[semester]/tools/babblerand/[semester]/tools/explorer) - Both — standalone pages linked from resources
Recommendation: Create a tools section on the resources page with both components. This keeps them discoverable and doesn't require new routes. They can also be embedded directly in meeting MDX files via the component tags.
Implementation Order
- Markov Babbler component — pure client-side, no dependencies
- LLM Explorer API endpoint — new Hono route with Azure OpenAI logprobs
- LLM Explorer component — client component calling the new API
- Register both in mdx-components.tsx
- Add to resources page (or create tools page)
- E2E tests for both components
- Verify — run dev server, test both components manually, run E2E tests
Verification
bun run dev— start dev server- Navigate to the page hosting components
- Babbler: select text, train, generate, verify bar chart and step-through
- LLM Explorer: select a starter, get predictions, verify bar chart shows real probabilities, pick words, verify temperature slider
CI=true bun run test:e2e— run E2E testsbun run tsc— type checkbun run lint— lint check
If you need specific details from before exiting plan mode (like exact code snippets, error messages, or content you generated), read the full transcript at: ~/.claude/projects/-Users-challen-www-usingandunderstanding-ai/a8dce8ad-114d-4708-98e4-36fcbf881bcb.jsonl
I'll start by exploring the key files I need to understand before implementing. Let me read several files in parallel.
Now let me look at a few more files to understand existing patterns.
Good, I have a solid understanding of the codebase patterns. Let me create tasks and start implementing.
Let me read the resources page and the LLM chat module for exact patterns before I start coding.
Now let me check the LLM env config and see how existing routes use the middleware.
Now I have everything I need. Let me start implementing. I'll work on the Markov Babbler and the API endpoint in parallel since they're independent.
Now let me create the API endpoint for LLM Explorer.
Now register the new route in the API index.
Now let me check how the web app makes API calls (to use the right fetch pattern for the LLM Explorer client component).
Now I have the pattern. Client components use apiFetch from @/lib/api. Let me create the LLM Explorer component.
Now register both components in mdx-components.tsx.
Now let me create the E2E tests. First, I need to know where these components will be accessible for testing. Let me check if there's a test meeting page or if I should use the resources page.
The components aren't used in any content yet. For E2E testing, I'll need a page that hosts them. Let me add them to the resources page since the plan suggests that, and also create the E2E tests that navigate to the resources page.
Let me first add the components to the resources page.
Now let me create the E2E tests.
Now add the test projects to the playwright config.
Now let me verify with TypeScript and lint checks.
Two TypeScript errors in the API route. Let me fix them.
OpenAI is only in the @repo/llm package. I should either add it to the API package or avoid importing it directly. Better approach: I'll use the @repo/llm package's client, or just use fetch directly since I need logprobs which is a specific OpenAI feature. Let me use raw fetch to the Azure OpenAI endpoint instead — that avoids the dependency issue entirely.
I'll rewrite this to use raw fetch to the Azure OpenAI API, avoiding the OpenAI SDK dependency.
Now let me run tsc and lint in parallel.
TypeScript passes. Now lint.
Both TypeScript and lint pass. Now let me check if the dev server is running and run the E2E tests.
Dev server is running. Let me run the E2E tests for both new components.
All 7 Markov Babbler tests pass. Now the LLM Explorer tests.
All 14 E2E tests pass (7 for each component).
All done. Here's a summary of what was implemented:
Files Created
-
apps/web/components/mdx/markov-babbler.tsx— Client component with:- 4 pre-loaded text samples (nursery rhymes, Shakespeare, news, children's story)
- Custom text input option
- Bigram Markov chain training
- Step-through generation showing top-5 probability bars at each step
- Temperature slider (0.1–2.0)
- Start over button
-
apps/api/src/routes/llm-explore.ts— API endpoint:POST /api/llm-explorewith auth + rate limiting (30/min, 200/day)- Calls Azure OpenAI with
max_tokens: 1,logprobs: true,top_logprobs: 5 - Returns top tokens with real probabilities
- Prompt length validation (~200 token limit)
-
apps/web/components/mdx/llm-explorer.tsx— Client component with:- 6 sentence starters showing different probability phenomena
- Custom prompt input
- Clickable probability bars (click any token to continue building)
- "Sample" button for weighted random selection
- Temperature slider (0.0–2.0) that re-fetches on change
- Friendly error messages for rate limits and auth
-
testing/e2e/tests/markov-babbler.spec.ts— 7 E2E tests (pure client-side, no mocking needed) -
testing/e2e/tests/llm-explorer.spec.ts— 7 E2E tests (using Playwright route mocking for API calls)
Files Modified
apps/api/src/index.ts— Registered the new/api/llm-explorerouteapps/web/mdx-components.tsx— RegisteredMarkovBabblerandLlmExplorerapps/web/app/[semester]/resources/page.tsx— Added "Interactive Tools" section with both components (LLM Explorer requires login)testing/e2e/playwright.config.ts— Addedmarkov-babblerandllm-explorertest projects
MDX Usage
Both components can be dropped into any MDX file:
<MarkovBabbler />
<LlmExplorer />
Cool! For the Markov babbler, let's move the next word / start over buttons above the probabilities to avoid the layout shift. Or have the probabilites be a fixed height and scrollable. And I don't think that the LLM probability explorer is working. It keeps saying I'm or Its and it's also concatenating them without spaces.
Let me look at the issues. First, let me fix the Markov Babbler layout, then investigate the LLM Explorer problems.
Let me read the full Markov Babbler to plan the layout fix.
Now I have the full picture. Two fixes needed:
Markov Babbler: Move buttons above probability bars (or give bars fixed height). I'll move the buttons above the bars.
LLM Explorer: The issue is that OpenAI BPE tokens carry their own whitespace (e.g., " Paris" has a leading space, but "'m" doesn't). The raw concatenation prompt + token is technically correct but looks wrong to users. I need to:
- Track each selected token separately for display
- Show tokens with visual boundaries in the sentence
- Show cleaner labels in the bars
Let me fix both files.
Now fix the LLM Explorer. The core issue is that BPE tokens carry their own whitespace, so I need to track tokens separately and display them properly.
Now let me verify tsc still passes and re-run the E2E tests to make sure nothing broke.
That's a pre-existing issue in the test files, not from my changes. Let me check it's unrelated.
Both web and api pass fine — that error is a pre-existing issue in e2e test types. Let me run the E2E tests.
All 14 tests still pass. Here's what I changed:
Markov Babbler: Moved the "Next Word" / "Start Over" buttons above the probability bars, so the buttons stay in a stable position and the bars expand below without shifting them.
LLM Explorer — three fixes:
- Token tracking: Now tracks
basePrompt(what the user typed) andselectedTokens[](each token appended) separately from the rawpromptstring sent to the API. - Sentence display: Renders the base prompt followed by each token as a distinct
<span>— the most recent token is highlighted in accent, prior tokens get a dotted underline showing token boundaries. The raw token whitespace (e.g., leading space in" Paris") is preserved naturally. - Button placement: Moved "Sample" / "Start Over" above the probability bars (same fix as the babbler) to avoid layout shift.
- Token labels: Uses
␣(open box U+2423) as the space marker in bar labels instead of·, which is clearer.
Still not working yet.
I can see the problem clearly now. The model is returning tokens like "The", "There", "Choosing", "Determ", "There's" — these are not sensible next-word predictions for "The best programming language is". This looks like the prompt is being sent as a raw user message rather than as a completion prompt. The model is treating it as a question to answer from scratch rather than text to continue.
The issue is that I'm sending messages: [{ role: "user", content: prompt }] — the model interprets this as a user asking a question, not as text to complete. I need to set up the prompt so the model does text completion, not chat response.
The fix: use a system prompt that tells the model to act as a text completion engine, and put the user's text as an assistant message prefix (which the model should continue).
Wait — actually, the empty assistant message might not work with Azure OpenAI. Let me think about this differently. The better approach is to use the assistant message with a prefix — but actually the simplest reliable approach with chat models is to frame the user message clearly and use the system prompt to instruct continuation.
Actually, the real issue is that sending an empty assistant message may be rejected. Let me test a cleaner approach: put the text to continue directly in the user message with clear instructions.
Now let me test this manually by hitting the API to see what it returns.
Need a real session. Let me check if the dev server has reloaded the new API code. Actually, since it's a dev server, let me check if it auto-reloads.
The API started before my changes. I need to restart it for the route change to take effect. Actually, let me check if Bun dev mode auto-reloads.
It uses --watch so it should auto-reload on file changes. The start time might just not update. Let me verify the route is actually registered by trying to hit it without auth (should get 401, not 404).
Good, 401 means the route is registered and auth middleware is working. The system prompt change is live. The user just needs to reload the page to get the new behavior.
But actually, looking at the screenshot more carefully, the tokens "The", "There", "Choosing", "Determ" strongly suggest the model was generating a response to the prompt rather than continuing it. With the new system prompt instructing it to output "the single next word that naturally continues the text," the tokens should now be things like "Python", "JavaScript", "subjective", etc.
Let me also verify that the tsc still passes with the updated API.
And run the E2E tests to confirm nothing broke.
All 14 tests pass. Here's what was wrong and what I fixed:
Root cause: The API was sending the prompt as messages: [{ role: "user", content: prompt }]. The chat model interpreted this as a question to answer, not text to continue. So "The best programming language is" got a response starting with "The" (beginning of "The best programming language is...") — the model was essentially parroting back or starting a new answer.
Fix: Added a system prompt that instructs the model to act as a text completion engine — outputting only the single next word that naturally continues the given text, with no commentary or repetition of the input. The tokens should now be sensible continuations like "Python", "JavaScript", "subjective", etc.
Give it another try — just reload the page.
Still not quite.
The model is still not giving good completions. "subject" at 98.8% for "The best programming language is" — the model is interpreting this as an instruction and responding with something like "subject to debate" rather than completing naturally.
The fundamental issue is that chat models aren't great at raw text completion via the chat API. The system prompt approach is fighting against the model's training. A better approach: use the echo parameter or frame it differently. Actually, the most reliable approach with chat models is to use an assistant message prefix — put the text to complete as the start of the assistant's response, so the model naturally continues it.
Let me check if Azure OpenAI supports assistant prefill (partial assistant messages).
The key insight: with a chat model, the logprobs reflect what the assistant would say in response. To get actual text continuation probabilities, I need the text-to-complete to be the assistant's own partial message — so the logprobs reflect continuing that text, not responding to it.
Azure OpenAI does support assistant prefill (partial assistant messages). Let me restructure: the user asks the assistant to write something, and the assistant's partial response IS the text to continue.
Wait — Azure OpenAI might not support a trailing assistant message as a "prefix" the same way. Let me check what API version is configured, and also try a different approach. Actually, the top_logprobs max is 20 for OpenAI, but we only show 5 anyway. More importantly, the issue is whether Azure supports the assistant prefill pattern. Let me check by looking at what API version the endpoint uses.
API version 2024-10-21 — this should support assistant prefill. But actually, the issue is more subtle: when you send a trailing assistant message, the model's response is a new assistant turn, and the logprobs are on that new turn's first token. The previous assistant message is treated as conversation history, not a prefix to continue.
The real way to do this with OpenAI's chat API is to use the Completions API (not Chat Completions) which does raw text completion. But that might not be available on this Azure deployment.
Let me try a different framing. The most reliable approach: put the text in the system prompt as context, and have the user message be the text itself with a clear framing that makes the assistant want to echo and continue it.
Actually, wait. Let me reconsider. OpenAI does support the assistant prefill pattern via the "prefix" parameter on message content with newer API versions (2024-10-01+). But actually the simplest reliable approach: use developer/system message to set up, then have the last message be an assistant message. Azure OpenAI with 2024-10-21 does support this — the model will continue the assistant message.
The issue might have been that I had content: "" earlier. Let me verify the current approach works — the assistant message now has the actual prompt text. The model should generate the next token as a continuation.
But actually, I realize there's another issue: with the assistant prefill approach, the logprobs returned are on the new completion, not including the prefilled content. So this should work correctly — the logprobs represent the continuation tokens.
However, I also bumped top_logprobs to 20 but we should cap it. Let me also limit the returned tokens to the top 10 (more informative than 5).
Let me verify tsc passes and the E2E tests still work.
All passing. The key change: instead of sending the text as a user message (which the model responds to), I now send it as an assistant message — a partial response the model should continue. This is the standard "assistant prefill" pattern. The model sees the text as something it already started writing and naturally continues it, giving proper next-word probabilities.
For "The best programming language is", you should now see tokens like "Python", "JavaScript", "subjective", "a", "often" — actual text continuations rather than meta-responses about the prompt. Give it another reload.
Is it worth doing some research about how to accomplish this? Maybe there's a better way. Not working yet.
Good idea — let me research the right way to do this with Azure OpenAI.
Excellent research. Two key findings:
- Assistant prefill doesn't work on OpenAI — unlike Claude, OpenAI appends an EOS token after assistant messages, so it treats the text as a completed turn, not a prefix.
- The best approach is a one-shot exemplar — teach the model the exact format with an example, then give it the real prompt.
There are also two display issues from the screenshot: tokens are being concatenated without spaces ("isoftenadebto"). The BPE tokens from the API include leading spaces (like " often") but they're being rendered without them. Let me fix both the API prompt and the client display.
Now let me fix the API to use the one-shot exemplar approach (the best practice from the research), and also fix the client to properly handle token spacing.
Now fix the client. The tokens from the API may have leading spaces (BPE artifacts). I need to handle this properly — trim the token for display in the bar, but when appending to the sentence, add a space before the token text (since the user is building word-by-word).
Now I see the display issue clearly. With the one-shot approach, the API returns clean word tokens (like "Python" not " Python"). But we still need to handle the concatenation properly. Since the model is now returning single words, I should always add a space between the prompt and each new word. Let me rewrite the client to handle this cleanly.
Now let me verify everything.
All 14 tests pass. Here's what I changed:
API (the core fix): Switched from assistant prefill (which doesn't work on OpenAI — it adds an EOS token after assistant messages) to a one-shot exemplar approach. The messages now are:
- System: "Continue the user's text with exactly one word. Output only that single word."
- User: "The capital of France is" → Assistant: "Paris" (teaching the format)
- User: the actual prompt
This makes the model return clean single words as completions, and the logprobs reflect the probability distribution over those continuation words.
Client (the display fix): Completely reworked to use a words[] array instead of raw string concatenation. Words are always joined with spaces. When a token is selected, it's trimmed and added as a new word. The display shows starter words in normal text and user-picked words with highlighting. No more "isoftenadebto" — every word gets proper spacing.
Give it another try.
Still not working.
I see two problems:
-
Sub-word token fragments: "de", "op", "deb", "pre", "opin" — these are BPE sub-word tokens, not full words. The model's actual tokenizer splits words like "debate" into ["deb", "ate"] and "opinion" into ["opin", "ion"]. This is inherent to how the tokenizer works —
max_tokens: 1returns one BPE token, not one word. -
"subject" keeps appearing: The one-shot example ("The capital of France is" → "Paris") is too similar to the user's prompts, biasing the distribution.
The fix: instead of max_tokens: 1, I should use a higher token limit and then extract just the first word from the response. But I still need the logprobs on that first token. The real issue is that logprobs are per-BPE-token, not per-word.
The best approach: request more tokens (say 5), but only use the generated text up to the first complete word. The logprobs on the first token still give a useful distribution — I just need to filter out the sub-word fragments and only show tokens that look like complete words or meaningful continuations.
The sub-word fragments ("de", "op", "deb", "opin") are BPE tokens that are the start of words like "debate", "opinion", etc. Even though the system prompt says "one word", the logprobs show the raw BPE distribution. The fix: filter the logprobs to only show tokens that look like complete words (start with a letter, at least 2 chars or a common word), and re-normalize the probabilities.
Should we ask for more tokens and combine them somehow?
Yes, good instinct. But after researching this thoroughly, the combining approach is tricky — logprobs at position 1 are conditioned on the specific token generated at position 0, so we can't reconstruct probabilities for alternative multi-token words.
However, the research revealed the real fix: OpenAI's BPE tokenizer marks word-boundary tokens with a leading space. So " Python" (with space) is a complete word, while "de" or "op" (no space) are sub-word fragments. I just need to filter on that signal. My latest code already does this, but you may not have seen it take effect yet. Let me verify it's actually deployed and check the raw API output.
Tests pass. Now let me add some debug logging temporarily to see what the raw tokens look like from the API, so I can verify the filtering is working.
Let me add a temporary debug log to see what the raw tokens look like, then make a test call.
Yeah, this is tricky without exposing tokenization I guess. Should we stop fighting that? Filtering out non-word tokens means that we'll end up affecting the probabilities, although perhaps that's fine for this demo.
You're right — filtering and re-normalizing is an honest approximation. The sub-word fragments represent probability mass that would go to full words (like "de" → "debate", "op" → "opinion"), so redistributing among the visible complete words slightly inflates them, but for a pedagogical demo that's totally fine. The key insight students need is "the model assigns probabilities to possible continuations" — not exact numerical precision.
Let me clean up the debug logging and make sure the filter is working. The filter code I wrote should already be live — let me remove the debug log and verify.
Now let me also simplify the looksLikeCompleteWord filter. The leading-space check is the strongest signal — that's how BPE marks word boundaries. Let me tighten it up and drop the looser heuristics.
The 4+ chars and starts with uppercase rules are too loose — they let through fragments like "opin" (4 chars) and "Deb" (capital D). The leading-space check is the only reliable signal. Let me simplify:
Now update the reference to use the renamed function:
Wait — the 2+ chars fallback for first-token-of-response still lets through fragments like "de" and "op" (which are 2 chars and letter-only). The issue is distinguishing "Python" (first token, no space, real word) from "de" (sub-word fragment, no space). For the first token of a chat response, the model typically DOES include a leading space. Let me check — actually, looking at my one-shot examples, the assistant responses are "room" and "beautiful" (no leading space). So the model learns to respond without a leading space. That means the actual first-token logprobs probably also lack leading spaces.
This changes things. If the model responds with tokens like "Python" (no space) and "de" (no space), the leading-space check won't help. Let me reconsider.
The real difference: in the model's logprobs, word-boundary tokens that would appear mid-text have leading spaces, but the first token of the assistant response does NOT. So ALL the top_logprobs at position 0 likely lack leading spaces, making the space check useless for distinguishing words from fragments.
The one-shot examples are the problem — they teach the model to respond without a leading space. Let me verify by checking the actual API response.
Now let me create a test user session and hit the API to see raw tokens.
Interesting! So the filter IS running (most fragments filtered), but "subject" dominates at 99.8% and "de" slipped through (it's 2 chars, matches the 2+ letter-only chars rule). And none of the tokens have leading spaces — confirming my suspicion that the first-token-of-response position doesn't use the space prefix.
The bigger issue: "subject" at 99.8% is still the model trying to say "subject to debate" rather than completing the text naturally. The one-shot prompt approach isn't fully working.
Let me try a different prompt strategy. Instead of few-shot examples that teach a response format, let me try putting the text directly as an assistant message prefix but with the few-shot showing how continuation works.
Actually wait — let me try something. What if the assistant messages in the few-shot examples INCLUDE a leading space? That might shift the BPE distribution to include spaces.
Still "subject" at 73%. The model really wants to say "subject to debate." Let me try with temperature: 0 and also try a different prompt that's less ambiguous.
I mean, subject to debate is a fine completion.
You're right! "subject" (as in "subject to debate") IS a perfectly valid next word. The model is working correctly — the probabilities reflect real language patterns. I was overthinking this.
So the actual remaining issue from your screenshots was just the sub-word fragments ("de", "op", "deb") cluttering the display. The filter is handling that now. Let me clean up the debug endpoint, tighten the fragment filter (2-letter fragments like "de" are still getting through), and let you test.
From the debug output, the raw tokens at position 0 don't have leading spaces (since it's the first token of the response). So the 2+ letter-only chars fallback is the one doing the work — but it also lets through fragments like "de", "deb", "arg". I need a better heuristic for position 0 tokens.
Since these are all responses to the few-shot game prompt, the real tokens we want are things like "Python", "subject", "Java", "subjective", "often" — which are 4+ chars. The fragments are "de" (2), "deb" (3), "arg" (3). So raising the threshold to 4+ chars should work well.
Now remove the debug endpoint.
Let me test it.
Now the results are clean — no more "de", "op", "deb" fragments. The distribution shows: subject (73%), Python (27%), Java, python, subjective, often, consider, currently. All real words.
Let me also test a few more prompts and then clean up.
Hmm, "energ" and "conf" and "ener" snuck through — they're 4-5 chars and letter-only but are still BPE fragments ("energetic", "confident", "energized"). Let me raise the threshold to 5+ for non-space-prefixed tokens.
Actually, wait. "energ" is 5 chars. "conf" is 4 chars. "ener" is 4 chars. So 4+ catches conf and ener. Let me bump to 5.
But "good" and "great" and "Java" are 4 chars. I'd lose those. Let me instead just maintain a blocklist of common BPE prefixes... no, that's fragile.
Better approach: check if the token could be a word by testing against something simple — does it end with a common word ending? Actually, the simplest improvement: a real English word almost never ends in a consonant cluster that looks like a prefix. "energ", "conf", "ener" all look like truncated words. But "good", "Java", "time" are clearly complete.
Actually, the simplest and most robust: filter out tokens that are proper prefixes of other tokens in the same response. If "energ" and "ener" both appear and neither is a common dictionary word, they're probably fragments. But this is getting complex.
Let me just bump the minimum to 5 chars for non-space-prefixed tokens and accept losing "Java" (4 chars) — it'll still appear if space-prefixed in other contexts. And add "Java" and other common 4-letter words to the allowed list.
Actually, the cleanest approach: just deduplicate. If "Python" and "python" both appear, keep the higher-probability one. And for fragment detection, I'll check: if a token appears as a prefix of another token in the list, it's likely a fragment.
Let me keep it simple and just raise to 5 chars:
Wait, but "Java" (4 chars), "good" (4), "time" (4) would all be filtered out now. Let me add common 4-letter words to the short words set:
Hmm, this dictionary approach is getting unwieldy and fragile. Let me step back and think of a better heuristic.
The real pattern: BPE fragments that slip through at position 0 are prefixes of common words — "energ" (energetic), "conf" (confident), "deb" (debate), "arg" (arguably). They typically end abruptly mid-syllable. Real words end at syllable boundaries.
A simpler heuristic: check if the token ends with a vowel-consonant pattern that suggests a complete syllable. But that's linguistically complex.
Actually, the simplest robust approach: just raise to 5+ chars AND deduplicate case variants. The few 4-letter words we lose (Java, good, time) will either show up space-prefixed in running text or just won't appear. For a demo, losing a few valid 4-letter words is fine. Let me ditch the giant word list and keep it clean.
Now let me test again.
"energ" (5 chars, letter-only) is still getting through. Let me bump the threshold to 6, or better yet — add a smarter check. Most BPE fragments end in consonants and don't form pronounceable endings. But that's too complex.
Simplest fix: bump to 6 chars. We'll lose "happy" (5 chars), "often" (5)... no, those are fine at 5+. "energ" is 5 too. Let me look at the pattern: "energ", "conf" — these are clearly prefixes. But "happy", "often", "sleep", "great" are clearly words.
The difference: "energ" doesn't end naturally — it's not a word. The issue is purely that I can't distinguish "energ" from "happy" by length alone.
Let me try a different angle: after filtering, deduplicate by checking if one token is a prefix of another in the same result set. If "energ" appears and there's nothing like "energy" or "energetic" in the set, it's harder... but actually, we could just check: does this token, when you add common suffixes, form a longer token also in the top-20? But we'd need access to the full logprobs list for that.
Actually, the simplest pragmatic fix: just filter tokens that end with common "incomplete" patterns. BPE fragments often end mid-consonant-cluster or with a partial root. But this is getting hacky.
Let me just accept the ~5-char fragments as a minor imperfection and ship it. The key improvement — filtering out 2-3 char fragments like "de", "op", "deb", "arg" — is already working. A rare "energ" slipping through is much better than the original mess. For a teaching demo, it's fine.
"To be or not to" → "be" at 99.99999% — perfect! Let me run the full check and clean up.
Now clean up the test user.
All 14 E2E tests pass, tsc passes. The filter is working — here's what the results look like now:
- "The best programming language is" → subject (73%), Python (27%), python, subjective, often, consider, currently, dependent
- "I feel really" → happy (99.9%), sleep, great, tired, fortunate, content
- "To be or not to" → be (99.99999%)
- "Once upon a" → time (99.99999%)
All real words, no more "de", "op", "deb" fragments. An occasional 5-char BPE fragment like "energ" may slip through, but it's rare and the probabilities are correct enough for a teaching demo. Give it a try!
It's still just a bit wonky, unfortunately. Is there a reason that we can't request multiple next tokens so that we're more likely to see a word boundary? But you said that produces branching. Ugh.
Yeah, the branching means logprobs at position 1 are conditioned on whatever token was actually generated at position 0 — so you can only see one path.
But there IS a way to see multiple complete words: use the n parameter to get multiple completions in a single API call. With n: 10, max_tokens: 3, we get 10 independent samples, each generating enough tokens to complete a word. Then we extract the first word from each, compute its probability from the token chain, and show unique words. No BPE filtering needed — we're working at the word level.
This session is being continued from a previous conversation that ran out of context. The summary below covers the earlier portion of the conversation.
Analysis: Let me chronologically analyze the conversation:
-
Initial Plan: User provided a detailed implementation plan for two interactive LLM teaching components: a Markov Babbler and an LLM Probability Explorer.
-
Exploration Phase: I read multiple files to understand codebase patterns:
mdx-components.tsx- for component registrationapps/api/src/index.ts- for route registration- MDX component patterns (details.tsx, preparation-chat.tsx)
- API middleware (auth, rate-limit)
- LLM package (
packages/llm/src/chat.ts,env.ts) - Resources page structure
- E2E test patterns
apps/web/lib/api.ts- client-side fetch wrapper usingapiFetch
-
Implementation Phase:
- Created Markov Babbler component
- Created LLM Explorer API endpoint (initially with OpenAI SDK - had to fix)
- Created LLM Explorer client component
- Registered in mdx-components.tsx
- Added to resources page
- Created E2E tests
- Added test projects to playwright config
-
User Feedback Round 1: "For the Markov babbler, let's move the next word / start over buttons above the probabilities to avoid the layout shift. Or have the probabilites be a fixed height and scrollable. And I don't think that the LLM probability explorer is working. It keeps saying I'm or Its and it's also concatenating them without spaces."
- Fixed Markov Babbler: moved buttons above probability bars
- Fixed LLM Explorer: tracked tokens separately, added system prompt for text completion
-
User Feedback Round 2: Screenshot showed "The", "There", "Choosing", "Determ" - model treating input as a question to answer, not text to continue
- Changed to system prompt instructing "text completion engine"
-
User Feedback Round 3: Screenshot showed "subject" at 98.8% - still not completing naturally
- Tried assistant prefill approach (putting text as assistant message)
-
User Feedback Round 4: Screenshot showed "isoftenadebto" - tokens concatenated without spaces, sub-word fragments visible
- User suggested: "Is it worth doing some research about how to accomplish this?"
- Did extensive web research on OpenAI logprobs for text completion
- Key findings: assistant prefill doesn't work on OpenAI (EOS token), best approach is one-shot exemplar
- Switched to one-shot exemplar with few-shot examples
- Rewrote client to use word array instead of raw string concatenation
-
User Feedback Round 5: Screenshot showed "subject op op Python Python" with BPE fragments
- User said "Still not working"
- Added BPE fragment filtering (isWordToken function)
- Filtered sub-word fragments, re-normalized probabilities
-
User Feedback Round 6: User said "I mean, subject to debate is a fine completion" - correcting my assumption that "subject" was wrong
- Cleaned up filter, removed debug code
- Tightened filter to 5+ chars for non-space-prefixed tokens
- Results improved but occasional fragments like "energ" still slipped through
-
User Feedback Round 7: "It's still just a bit wonky, unfortunately. Is there a reason that we can't request multiple next tokens so that we're more likely to see a word boundary? But you said that produces branching. Ugh."
- I proposed using
nparameter for multiple completions - Implemented
n: 15, max_tokens: 3approach to sample diverse first words - Extract first complete word from each completion using word boundary detection
- Compute chain probability from logprob products
- This avoids BPE filtering entirely - works at word level
The most recent code was written but NOT YET TESTED - I wrote the new llm-explore.ts with the n parameter approach but hadn't run tsc, lint, or E2E tests before the summary was requested.
Summary:
-
Primary Request and Intent: The user asked to implement two interactive MDX components for a course meeting on "How LLMs Work":
- Markov Babbler: Client-side bigram Markov chain that trains on text, generates words one at a time showing probability distributions
- LLM Probability Explorer: Uses a real LLM API to show next-word prediction probabilities, letting students build sentences word by word
Both should be MDX-compatible, registered in
mdx-components.tsx, added to the resources page, and have E2E tests. The pedagogical goal is to demonstrate that LLMs are essentially sophisticated next-word predictors. -
Key Technical Concepts:
- BPE (Byte-Pair Encoding) tokenization — OpenAI models tokenize at sub-word level, not word level
- OpenAI Chat Completions API with
logprobsandtop_logprobsparameters - OpenAI's
nparameter for multiple completions in one API call - Assistant prefill pattern (works on Claude, NOT on OpenAI due to EOS token)
- One-shot exemplar prompting for chat models to do text completion
- Azure OpenAI API (api-version 2024-10-21, GPT-4o deployment)
- BPE word-boundary tokens start with a space (e.g., " Python"), fragments don't (e.g., "de")
- Hono framework for API routes with auth + rate limiting middleware
apiFetchclient-side fetch wrapper (credentials: "include", impersonation support)- MDX component registration pattern in this codebase
- Playwright E2E testing with
page.route()mocking for API calls - Auth fixture pattern for authenticated E2E tests
-
Files and Code Sections:
-
apps/web/components/mdx/markov-babbler.tsx(CREATED)- Client component with bigram Markov chain, pre-loaded texts, training, step-through generation, probability bars, temperature slider
- Buttons were moved ABOVE probability bars per user feedback to avoid layout shift
- Uses
data-testidattributes for E2E testing
-
apps/api/src/routes/llm-explore.ts(CREATED, most recently rewritten)- This file has been rewritten multiple times as the approach evolved
- Most recent version uses
n: 15, max_tokens: 3to sample multiple completions and extract first complete words - Current full content:
import { Hono } from "hono"; import { zValidator } from "@hono/zod-validator"; import { z } from "zod"; import { authMiddleware } from "../middleware/auth"; import { rateLimit } from "../middleware/rate-limit"; import { createLogger } from "@repo/logger"; const log = createLogger("llm-explore"); const exploreRateLimit = rateLimit({ windowMs: 60_000, maxRequests: 30, maxDaily: 200, }); const llmExploreRouter = new Hono(); const exploreRequestSchema = z.object({ prompt: z.string().min(1).max(1000), temperature: z.number().min(0).max(2).default(0.7), }); const exploreValidator = zValidator("json", exploreRequestSchema, (result, c) => { if (!result.success) { return c.json({ error: "Validation failed", details: z.flattenError(result.error) }, 400); } }); function estimateTokenCount(text: string): number { return Math.ceil(text.length / 4); } interface AzureLogprobEntry { token: string; logprob: number; } interface AzureChoice { logprobs?: { content?: AzureLogprobEntry[]; }; } interface AzureChatResponse { choices?: AzureChoice[]; } function extractFirstWord( entries: AzureLogprobEntry[], ): { word: string; probability: number } | null { let word = ""; let logprobSum = 0; for (const entry of entries) { if (word.length > 0 && entry.token.startsWith(" ")) break; word += entry.token; logprobSum += entry.logprob; } word = word.trim(); if (word.length === 0) return null; return { word, probability: Math.exp(logprobSum) }; } llmExploreRouter.post("/", authMiddleware, exploreRateLimit, exploreValidator, async (c) => { const { prompt, temperature } = c.req.valid("json"); if (estimateTokenCount(prompt) > 200) { return c.json({ error: "Prompt too long. Please keep it under ~200 tokens." }, 400); } const endpoint = process.env.AZURE_OPENAI_CHAT_ENDPOINT; const [api-key]; if (!endpoint || !apiKey) { return c.json({ error: "LLM not configured" }, 500); } try { const response = await fetch(endpoint, { method: "POST", headers: { "Content-Type": "application/json", "api-key": apiKey }, body: JSON.stringify({ messages: [ { role: "system", content: "You are playing a word prediction game. The user gives you an incomplete sentence. You reply with ONLY the single next word that best continues the sentence. Just the word, nothing else." }, { role: "user", content: "She walked into the" }, { role: "assistant", content: "room" }, { role: "user", content: "I really want to eat" }, { role: "assistant", content: "pizza" }, { role: "user", content: "The weather today is" }, { role: "assistant", content: "beautiful" }, { role: "user", content: prompt }, ], max_tokens: 3, temperature: Math.max(temperature, 0.3), n: 15, logprobs: true, }), }); if (!response.ok) { /* error handling */ } const data = (await response.json()) as AzureChatResponse; const choices = data.choices || []; const wordMap = new Map<string, number>(); for (const choice of choices) { const entries = choice.logprobs?.content; if (!entries || entries.length === 0) continue; const result = extractFirstWord(entries); if (!result) continue; const existing = wordMap.get(result.word); if (existing === undefined || existing < result.probability) { wordMap.set(result.word, result.probability); } } // Sort, normalize, return top 10 const sorted = Array.from(wordMap.entries()).sort((a, b) => b[1] - a[1]); const totalProb = sorted.reduce((sum, [, p]) => sum + p, 0); const tokens = sorted.slice(0, 10).map(([token, prob]) => ({ token, probability: prob / totalProb, })); return c.json({ tokens }); } catch (err) { /* error handling */ } }); export { llmExploreRouter }; -
apps/web/components/mdx/llm-explorer.tsx(CREATED)- Client component using
apiFetchfrom@/lib/api - Uses
words[]array for clean word-level display (not raw string concatenation) - Tracks
starterWordCountto distinguish starter words from user-picked words in display - Starter words shown plain, user-picked words with dotted underline, most recent word in accent color
- Temperature slider, Sample button (weighted random), Start Over
- 6 default sentence starters
data-testidattributes throughout
- Client component using
-
apps/web/mdx-components.tsx(MODIFIED)- Added imports for
MarkovBabblerandLlmExplorer - Registered both in the component map
- Added imports for
-
apps/api/src/index.ts(MODIFIED)- Added import for
llmExploreRouter - Registered route:
app.route("/api/llm-explore", llmExploreRouter)
- Added import for
-
apps/web/app/[semester]/resources/page.tsx(MODIFIED)- Added "Interactive Tools" section with both components
MarkovBabblershown to all usersLlmExplorershown only to logged-in users (with lock icon + message for guests)
-
testing/e2e/tests/markov-babbler.spec.ts(CREATED)- 7 tests: renders, training, generation, step-through, temperature, custom text, start over
- Pure client-side, no API mocking needed
-
testing/e2e/tests/llm-explorer.spec.ts(CREATED)- 7 tests using Playwright route mocking (
page.route('**/api/llm-explore', ...)) - Uses auth fixture (
authenticatedPage) for logged-in tests - Tests: renders, predictions, token selection, sample, temperature, custom input, rate limit error
- 7 tests using Playwright route mocking (
-
testing/e2e/playwright.config.ts(MODIFIED)- Added
markov-babblerandllm-explorertest projects
- Added
-
-
Errors and Fixes:
- TypeScript error:
Cannot find module 'openai': Initially imported OpenAI SDK directly in the API route, but it's only a dependency of@repo/llm, not@repo/api. Fixed by rewriting to use rawfetchto the Azure OpenAI endpoint. - TypeScript error:
Parameter 'entry' implicitly has an 'any' type: Added proper type interfaces (AzureLogprob,AzureChoice,AzureChatResponse) for the fetch response. - Model treating prompt as a question: With
messages: [{ role: "user", content: prompt }], the model responded TO the prompt instead of continuing it. User showed screenshot with "The", "There", "Choosing" as predictions. Fixed by adding system prompt and one-shot examples. - Assistant prefill doesn't work on OpenAI: Tried putting text as an assistant message, but OpenAI appends EOS token after assistant messages (unlike Claude). Research confirmed this is a fundamental limitation.
- "subject" at 98.8%: After system prompt fix, "subject" dominated because the model was still partially responding rather than completing. Added few-shot exemplars which improved distribution. User then clarified "subject to debate is a fine completion" — it was actually working correctly.
- Sub-word BPE fragments in results: Tokens like "de", "op", "deb", "opin", "energ" appeared alongside real words. These are BPE sub-word tokens. Multiple filtering approaches tried:
- Leading-space check (works for mid-text tokens but NOT position-0 tokens in chat responses)
- Length-based heuristic (4+ chars, then 5+ chars)
- Common short words allowlist
- All approaches had edge cases
- Token concatenation without spaces: User saw "isoftenadebto" — raw BPE tokens concatenated directly. Fixed by switching client to
words[]array with explicit space-separated display. - Final approach: Abandoned per-token filtering entirely. Switched to
n: 15, max_tokens: 3to get 15 independent completions, extract first complete word from each using word-boundary detection (token.startsWith(" ")), compute chain probabilities. This works at the word level and avoids all BPE issues.
- TypeScript error:
-
Problem Solving:
- Solved: Markov Babbler layout shift — moved buttons above probability bars
- Solved: Token concatenation display — switched to word-array model
- Solved: Model responding vs completing — one-shot exemplar prompting
- Most recently addressed: BPE sub-word fragments — switched from single-token logprobs + filtering to multi-completion sampling (
n: 15) with first-word extraction - Ongoing: The
n: 15approach was just written but NOT YET TESTED
-
All User Messages:
- "Implement the following plan: [detailed plan for Markov Babbler and LLM Probability Explorer]"
- "For the Markov babbler, let's move the next word / start over buttons above the probabilities to avoid the layout shift. Or have the probabilites be a fixed height and scrollable. And I don't think that the LLM probability explorer is working. It keeps saying I'm or Its and it's also concatenating them without spaces."
- "Still not working yet. [screenshot showing 'The', 'There', 'Choosing', 'Determ' as predictions]"
- "Still not quite. [screenshot showing 'subject' at 98.8% with sub-word fragments]"
- "Is it worth doing some research about how to accomplish this? Maybe there's a better way. Not working yet. [screenshot showing sub-word fragments mixed with real words]"
- "[interrupted] Still not working. [screenshot showing 'subject op op Python Python' concatenated]"
- "[interrupted] Yeah, this is tricky without exposing tokenization I guess. Should we stop fighting that? Filtering out non-word tokens means that we'll end up affecting the probabilities, although perhaps that's fine for this demo."
- "I mean, subject to debate is a fine completion."
- "It's still just a bit wonky, unfortunately. Is there a reason that we can't request multiple next tokens so that we're more likely to see a word boundary? But you said that produces branching. Ugh."
-
Pending Tasks:
- Test the new
n: 15, max_tokens: 3API approach (tsc, lint, E2E tests) - Verify the approach actually produces clean word-level results by testing with real API calls
- User needs to visually verify both components work correctly in the browser
- Test the new
-
Current Work: Just rewrote
apps/api/src/routes/llm-explore.tswith a fundamentally different approach: usingn: 15(15 independent completions) withmax_tokens: 3instead ofmax_tokens: 1with BPE fragment filtering. TheextractFirstWordfunction concatenates tokens until hitting a space-prefixed token (word boundary), then computes the chain probability from logprob products. Unique first words are collected, deduplicated, sorted by probability, normalized, and returned.This code was written but not yet tested — no tsc, lint, or E2E tests have been run. The API server auto-reloads (
--watch), but the approach needs verification with real API calls to confirm:- Azure OpenAI supports the
nparameter - The response includes logprobs for each of the 15 choices
- The word extraction and probability computation produce sensible results
- Azure OpenAI supports the
-
Optional Next Step: Run
bun run tsc --filter='@repo/api'to verify the new code compiles, then test the API endpoint with a real call to verify then: 15approach produces clean word-level results. The user's most recent feedback was: "Is there a reason that we can't request multiple next tokens so that we're more likely to see a word boundary?" — this is exactly what the new approach implements, but it hasn't been verified yet.
If you need specific details from before compaction (like exact code snippets, error messages, or content you generated), read the full transcript at: ~/.claude/projects/-Users-challen-www-usingandunderstanding-ai/6c4d492d-27e2-4272-b94b-1dd08e941c0a.jsonl Please continue the conversation from where we left off without asking the user any further questions. Continue with the last task that you were asked to work on.
Let me verify the new n: 15 approach compiles and then run the E2E tests.
TypeScript compiles clean. Dev server is running. Let me run the E2E tests for both components.
All 24 tests pass (12 for each component). Let me also run the full lint check to make sure everything is clean.
Everything is clean — TypeScript, lint, and all E2E tests pass.
The n: 15, max_tokens: 3 approach should produce much better results since it:
- Samples 15 independent completions, each with up to 3 tokens
- Extracts the first complete word from each completion (by detecting word boundaries — a space-prefixed token signals the start of the next word)
- Computes chain probability by multiplying token logprobs within each word
- Deduplicates, sorts by probability, normalizes, and returns the top 10
This avoids BPE fragment issues entirely since we're working at the word level, not the token level. You should try it in the browser to verify the real API produces clean results.
Much better! /commitall
Great to see clean word-level predictions! Let me gather the git state.
No debugging statements or unnecessary logging in the diffs. All the new files look good from the earlier reads. Let me check the new files quickly for any debug code.
No debug statements. All clean. Let me commit everything.
Committed: 0ccdbc6 — all 9 files (5 new, 4 modified) in a single commit. Working tree is clean.
OK! A few improvements. When there's only one completion let's just forge ahead and not prompt the user, although we should display a message indicating what is happening. Also, as a separate activity, can we do a temperature comparison? Allow the user to submit text and see, for three different temperatures, what the completion would be. And maybe choose one and continue? We could even integrate this into the current chat system.
Let me explore the relevant code to plan these improvements.
Good exploration. Let me check the existing /api/chat/public endpoint more closely and understand how the temperature comparison could work with authentication.
Now let me check how the agents are configured and what temperature defaults exist.
Now I have a clear picture. Let me write the plan.