Per-Message Token Counting and Token-Based Truncation

January 29, 202610:55 AM – 11:10 AM(15 min)2 turns

Model: claude-opus-4-5-20251101

Read12TaskUpdate10Edit6TaskCreate5Task4Bash3Write2EnterPlanMode1ExitPlanMode1

10:55 AM

Implement the following plan:

Per-Message Token Counting and Token-Based Truncation

Goal

Add server-side per-message token counting to avoid 400 errors from context overflow. Store token counts in message metadata and expose to the frontend. Use token budgets to automatically truncate history when it would exceed the model's context window.

Design Decisions

Library: gpt-tokenizer (pure JS, fastest npm tokenizer, supports o200k_base encoding used by both GPT-4o and GPT-5.2)
Storage: Use existing JSONB metadata column on messages table — no schema migration needed
Truncation order: Message-count slider (contextWindow) applies first, then token-based truncation acts as a safety net
Context windows: 128K tokens for both GPT-4o and GPT-5.2
Completion reserve: 4,096 tokens reserved for model response
No new API endpoints — token counts flow through existing message metadata

Files to Change

1. `packages/llm/package.json` — Add dependency

Add "gpt-tokenizer": "3.4.0" to dependencies (exact pin per project convention). Run bun install.

2. New file: `packages/llm/src/tokenizer.ts` — Token counting utilities

Exports:

countTokens(text: string): number — count tokens in a plain string
countMessageTokens(message: ChatMessage): number — count tokens for a message including ~3 token ChatML overhead
getContextWindow(model: ChatModel): number — returns 128,000 for both models
calculateHistoryBudget(model, systemPromptTokens, currentMessageTokens, completionReserve?): number — tokens available for history
truncateHistory(historyMessages, tokenBudget): { messages, droppedCount } — drop oldest messages to fit budget
Constants: DEFAULT_COMPLETION_RESERVE = 4096

Uses encode from gpt-tokenizer (defaults to o200k_base).

3. `packages/llm/src/index.ts` — Export new utilities

Add tokenizer exports block.

4. `apps/api/src/routes/chat.ts` — Integrate token counting and truncation

Changes in the main POST / handler:

a. Import countTokens, countMessageTokens, calculateHistoryBudget, truncateHistory from @repo/llm

b. Store tokenCount on user message (at the existing DB insert ~line 164):

metadata: { tokenCount: countMessageTokens({ role: "user", content: data.message }) }

c. Token-based truncation — insert after system prompt is fully built (~line 158) and before the LLM call (~line 170). This runs after the existing contextWindow message-count truncation:

const systemPromptTokens = countTokens(systemPrompt);
const historyBudget = calculateHistoryBudget(
  chatModel, systemPromptTokens,
  countMessageTokens({ role: "user", content: data.message })
);
const currentMessage = chatMessages[chatMessages.length - 1]!;
const historyMessages = chatMessages.slice(0, -1);
const { messages: truncatedHistory, droppedCount } = truncateHistory(historyMessages, historyBudget);
chatMessages = [...truncatedHistory, currentMessage];

d. Store tokenCount on assistant messages — add tokenCount to metadata in both streaming and non-streaming DB inserts. Also store droppedMessageCount when > 0.

Metadata Shape After Changes

User message: { tokenCount: 42 } Assistant message: { tokenCount: 350, usage: {...}, ragChunksUsed: [...], droppedMessageCount: 3 }

(droppedMessageCount only present when > 0)

Verification

bun install                  # Install gpt-tokenizer
bun run tsc                  # Type check
bun run lint                 # Lint
# Manual: send messages in long conversations, verify no 400 errors

If you need specific details from before exiting plan mode (like exact code snippets, error messages, or content you generated), read the full transcript at: ~/.claude/projects/-Users-challen-www-usingandunderstanding-ai/f3f0122c-eca0-405f-b34a-4b6fce605700.jsonl

I'll implement this plan step by step. Let me start by reading the relevant files to understand the current code.

Now let me also check the ChatMessage type to understand its shape.

Good, I have a clear picture of the codebase. Let me now implement the changes.

Step 1: Add gpt-tokenizer dependency.

Step 2: Create the tokenizer module.

Step 3: Export tokenizer utilities from index.ts.

Step 4: Integrate token counting and truncation in chat.ts.

First, update the import:

Now add token counting to the user message insert:

Now update the streaming assistant message insert to include tokenCount and droppedMessageCount:

Now update the non-streaming assistant message insert:

Step 5: Verify with tsc and lint.

Both tsc and lint pass cleanly.

All tasks complete. Here's a summary of the changes:

1. packages/llm/package.json — Added "gpt-tokenizer": "3.4.0" dependency.

2. packages/llm/src/tokenizer.ts (new) — Token counting utilities:

countTokens(text) — counts tokens in a string using o200k_base encoding
countMessageTokens(message) — counts tokens for a ChatMessage + 3 token ChatML overhead
getContextWindow(model) — returns 128,000 for both models
calculateHistoryBudget(model, systemPromptTokens, currentMessageTokens, completionReserve?) — computes tokens available for history
truncateHistory(historyMessages, tokenBudget) — drops oldest messages to fit budget, returns { messages, droppedCount }
DEFAULT_COMPLETION_RESERVE = 4096

3. packages/llm/src/index.ts — Added tokenizer exports block.

4. apps/api/src/routes/chat.ts — Integrated token counting and truncation:

After message-count truncation and system prompt construction, computes a token budget and truncates history if needed
User messages stored with metadata: { tokenCount }
Assistant messages stored with metadata: { tokenCount, ..., droppedMessageCount } (droppedMessageCount only when > 0)

TaskUpdate10Edit6TaskCreate5Read4Bash3Write1

11:02 AM

Can we add this information to the chat UI and create e2e tests to make sure it's being computed properly?

Good, I have a thorough understanding. Let me read the key files directly to see exact code details.

Now let me design the implementation plan.

(no content)

Read8Task4EnterPlanMode1Write1ExitPlanMode1

Per-Message Token Counting and Token-Based Truncation

Goal

Design Decisions

Files to Change

1. packages/llm/package.json — Add dependency

2. New file: packages/llm/src/tokenizer.ts — Token counting utilities

3. packages/llm/src/index.ts — Export new utilities

4. apps/api/src/routes/chat.ts — Integrate token counting and truncation

Metadata Shape After Changes

Verification

1. `packages/llm/package.json` — Add dependency

2. New file: `packages/llm/src/tokenizer.ts` — Token counting utilities

3. `packages/llm/src/index.ts` — Export new utilities

4. `apps/api/src/routes/chat.ts` — Integrate token counting and truncation