Per-Message Token Counting and Token-Based Truncation
Implement the following plan:
Per-Message Token Counting and Token-Based Truncation
Goal
Add server-side per-message token counting to avoid 400 errors from context overflow. Store token counts in message metadata and expose to the frontend. Use token budgets to automatically truncate history when it would exceed the model's context window.
Design Decisions
- Library:
gpt-tokenizer(pure JS, fastest npm tokenizer, supportso200k_baseencoding used by both GPT-4o and GPT-5.2) - Storage: Use existing JSONB
metadatacolumn on messages table — no schema migration needed - Truncation order: Message-count slider (
contextWindow) applies first, then token-based truncation acts as a safety net - Context windows: 128K tokens for both GPT-4o and GPT-5.2
- Completion reserve: 4,096 tokens reserved for model response
- No new API endpoints — token counts flow through existing message metadata
Files to Change
1. packages/llm/package.json — Add dependency
Add "gpt-tokenizer": "3.4.0" to dependencies (exact pin per project convention). Run bun install.
2. New file: packages/llm/src/tokenizer.ts — Token counting utilities
Exports:
countTokens(text: string): number— count tokens in a plain stringcountMessageTokens(message: ChatMessage): number— count tokens for a message including ~3 token ChatML overheadgetContextWindow(model: ChatModel): number— returns 128,000 for both modelscalculateHistoryBudget(model, systemPromptTokens, currentMessageTokens, completionReserve?): number— tokens available for historytruncateHistory(historyMessages, tokenBudget): { messages, droppedCount }— drop oldest messages to fit budget- Constants:
DEFAULT_COMPLETION_RESERVE = 4096
Uses encode from gpt-tokenizer (defaults to o200k_base).
3. packages/llm/src/index.ts — Export new utilities
Add tokenizer exports block.
4. apps/api/src/routes/chat.ts — Integrate token counting and truncation
Changes in the main POST / handler:
a. Import countTokens, countMessageTokens, calculateHistoryBudget, truncateHistory from @repo/llm
b. Store tokenCount on user message (at the existing DB insert ~line 164):
metadata: { tokenCount: countMessageTokens({ role: "user", content: data.message }) }
c. Token-based truncation — insert after system prompt is fully built (~line 158) and before the LLM call (~line 170). This runs after the existing contextWindow message-count truncation:
const systemPromptTokens = countTokens(systemPrompt);
const historyBudget = calculateHistoryBudget(
chatModel, systemPromptTokens,
countMessageTokens({ role: "user", content: data.message })
);
const currentMessage = chatMessages[chatMessages.length - 1]!;
const historyMessages = chatMessages.slice(0, -1);
const { messages: truncatedHistory, droppedCount } = truncateHistory(historyMessages, historyBudget);
chatMessages = [...truncatedHistory, currentMessage];
d. Store tokenCount on assistant messages — add tokenCount to metadata in both streaming and non-streaming DB inserts. Also store droppedMessageCount when > 0.
Metadata Shape After Changes
User message: { tokenCount: 42 }
Assistant message: { tokenCount: 350, usage: {...}, ragChunksUsed: [...], droppedMessageCount: 3 }
(droppedMessageCount only present when > 0)
Verification
bun install # Install gpt-tokenizer
bun run tsc # Type check
bun run lint # Lint
# Manual: send messages in long conversations, verify no 400 errors
If you need specific details from before exiting plan mode (like exact code snippets, error messages, or content you generated), read the full transcript at: ~/.claude/projects/-Users-challen-www-usingandunderstanding-ai/f3f0122c-eca0-405f-b34a-4b6fce605700.jsonl
I'll implement this plan step by step. Let me start by reading the relevant files to understand the current code.
Now let me also check the ChatMessage type to understand its shape.
Good, I have a clear picture of the codebase. Let me now implement the changes.
Step 1: Add gpt-tokenizer dependency.
Step 2: Create the tokenizer module.
Step 3: Export tokenizer utilities from index.ts.
Step 4: Integrate token counting and truncation in chat.ts.
First, update the import:
Now add token counting to the user message insert:
Now update the streaming assistant message insert to include tokenCount and droppedMessageCount:
Now update the non-streaming assistant message insert:
Step 5: Verify with tsc and lint.
Both tsc and lint pass cleanly.
All tasks complete. Here's a summary of the changes:
1. packages/llm/package.json — Added "gpt-tokenizer": "3.4.0" dependency.
2. packages/llm/src/tokenizer.ts (new) — Token counting utilities:
countTokens(text)— counts tokens in a string usingo200k_baseencodingcountMessageTokens(message)— counts tokens for a ChatMessage + 3 token ChatML overheadgetContextWindow(model)— returns 128,000 for both modelscalculateHistoryBudget(model, systemPromptTokens, currentMessageTokens, completionReserve?)— computes tokens available for historytruncateHistory(historyMessages, tokenBudget)— drops oldest messages to fit budget, returns{ messages, droppedCount }DEFAULT_COMPLETION_RESERVE = 4096
3. packages/llm/src/index.ts — Added tokenizer exports block.
4. apps/api/src/routes/chat.ts — Integrated token counting and truncation:
- After message-count truncation and system prompt construction, computes a token budget and truncates history if needed
- User messages stored with
metadata: { tokenCount } - Assistant messages stored with
metadata: { tokenCount, ..., droppedMessageCount }(droppedMessageCount only when > 0)
Can we add this information to the chat UI and create e2e tests to make sure it's being computed properly?
Good, I have a thorough understanding. Let me read the key files directly to see exact code details.
Now let me design the implementation plan.
(no content)