Plan: Improve page capture deduplication and junk filtering

March 4, 20269:08 PM – 9:36 PM(28 min)6 turns

Model: claude-opus-4-6, claude-sonnet-4-6

Bash15Edit7Read6Glob3

9:08 PM

Implement the following plan:

Plan: Improve page capture deduplication and junk filtering

Context

The Tampermonkey userscript captures whatever page is active, including auth iframes and tracking frames from NYTimes and Substack. It also doesn't prevent duplicate captures of the same URL. Production currently has 4 junk captures and 1 duplicate to clean up.

Changes

1. Userscript: Filter junk URLs before capture

File: apps/api/src/routes/admin/capture-tokens.ts (the userscript is generated inline ~line 103)

Add a blocklist check in the GM_registerMenuCommand handler before sending. Skip capture and notify user if the URL matches known junk patterns:

URLs containing /auth/ or /session path segments (e.g., myaccount.nytimes.com/auth/prefetch-assets)
Substack internal frames (substack.com/session-attribution-frame, substack.com/channel-frame)
Any URL where the page content is tiny (< 1KB of innerHTML), which catches empty iframes

2. API: Deduplicate by URL

File: apps/api/src/routes/captures.ts

Before inserting a new capture, check if a capture with the same URL already exists for this user. If so, return the existing capture's ID with a 200 status instead of creating a duplicate. This is a simple SELECT before the INSERT.

3. Clean up production junk

Delete the 5 junk/duplicate rows from prod:

2x myaccount.nytimes.com/auth/prefetch-assets (IDs: d583cedb..., 1ca6de9c...)
2x substack.com/session-attribution-frame and substack.com/channel-frame (IDs: 57024aa2..., 7691b264...)
1x duplicate NYTimes data centers article (ID: d10e5f36..., keep the newer one 3d3b0bac...)

Then pull the 3 good captures to content/spring2026/readings/.

Verification

bun run tsc — type check
bun run lint — lint
Inspect the generated userscript at /api/admin/capture-tokens/script.user.js to verify blocklist and size check are present
bun run captures:pull prod — pull cleaned captures

If you need specific details from before exiting plan mode (like exact code snippets, error messages, or content you generated), read the full transcript at: ~/.claude/projects/-Users-challen-www-usingandunderstanding-ai/9f8f46e1-416e-4099-8c5c-49b5fd6d6076.jsonl

If this plan can be broken down into multiple independent tasks, consider using the TeamCreate tool to create a team and parallelize the work.

Let me read the relevant files first.

Now let me implement the changes. I'll do the userscript blocklist and the API dedup in parallel.

1. Userscript: Add blocklist and size check

2. API: Add dedup check before insert

Now let me run the type check and lint in parallel.

Both pass cleanly. Now for the production cleanup — let me delete the 5 junk/duplicate rows.

0 rows deleted — the IDs may have already been cleaned up, or they might be slightly different. Let me check what's actually in prod.

The IDs are slightly different from the plan (UUIDs were truncated). Now I have the real IDs. Delete the 5 junk/duplicate rows (keep 3d3b0bac as the newer NYT article, delete d10e5f36).

All 5 junk/duplicate rows deleted. Now pull the 3 remaining good captures.

All done. Here's a summary of what was done:

Userscript changes (capture-tokens.ts):

Added a blocklist that filters URLs containing /auth/, /session, and specific Substack iframe URLs
Added a content size check (< 1KB) to catch empty iframes
Shows a notification when a capture is skipped

API dedup (captures.ts):

Before inserting, checks if the same URL was already captured by the same user
Returns the existing capture ID with 200 instead of creating a duplicate

Production cleanup:

Deleted 5 rows: 2 NYT auth prefetch pages, 2 Substack internal frames, 1 duplicate NYT article
Pulled 3 remaining good captures to content/spring2026/readings/

Bash6Edit3Read2

9:18 PM

I'm confused as to what was causing the script to capture those junk URLs in the first place. Can you explain further?

The userscript uses @match *://*/* which means it's injected into every page the browser loads — including iframes embedded within the page you're actually reading.

When you click "Capture This Page" on a NYTimes article, Tampermonkey runs the capture handler in every frame where the script is active. So the main article fires a capture, but so do the invisible iframes that NYTimes and Substack embed for auth/tracking purposes (like myaccount.nytimes.com/auth/prefetch-assets). Each frame has its own document and location, so each one sends its own POST to the API independently.

That's why you got captures in pairs — the article plus its auth iframe, both captured at nearly the same timestamp (e.g., 00:41:12.384 and 00:41:12.537). The Substack frames (channel-frame, session-attribution-frame) came from reading the Hyperdimensional post the same way.

The blocklist now prevents these iframe captures client-side before they're sent, and the server-side dedup prevents duplicates if somehow two captures of the same URL still get through.

9:19 PM

Ah, gotcha. That's why it says 3 in the Greasemonkey bar? That's a good tell. OK. Did we examine the new captures and clean them up as needed?

Good point — let me check what was pulled.

The captures look good content-wise. A few cleanup issues:

NYT article has Advertisement / SKIP ADVERTISEMENT junk at the top and bottom
NYT article slug is very long: opinion-americans-are-trying-to-stop-data-centers-because-they-can-t-stop-a-i-nytimes-2c47
All three have requiresAuth: true — the New Yorker and NYT are paywalled so that's correct, but the Substack (Hyperdimensional) is free

Let me also check if there's a pattern for existing reading slugs/frontmatter.

Here's what I see comparing the new captures to the existing readings:

Issues to clean up:

NYT data centers article — has Advertisement and SKIP ADVERTISEMENT junk at top (lines 10-13) and bottom (lines 153-155). The slug is also very long.
Missing frontmatter fields — the existing readings have date, readingTime, sourceDetail, and sometimes subtitle and authorUrl. The new captures only have title, slug, author, source, sourceUrl, and requiresAuth.
Hyperdimensional (Clawed) — requiresAuth: true but this is a free Substack post, should probably be false.

Want me to clean these up? Specifically:

Strip the NYT ad junk from the data centers article
Shorten that slug (e.g., data-centers-ai-regulation-nytimes-2c47)
Set requiresAuth: false on the Hyperdimensional piece
Add missing frontmatter fields (date, readingTime, etc.) to all three

Read4Glob3Bash1

9:20 PM

Yes.

Now rename the NYT file to match the new slug.

Now verify everything builds cleanly.

All clean. Here's what I fixed across the three captures:

NYT data centers: Stripped Advertisement/SKIP ADVERTISEMENT junk from top and bottom, removed the newsletter intro line, dropped "Opinion |" from title, shortened slug to data-centers-ai-regulation-nytimes-2c47, renamed file to match
Hyperdimensional (Clawed): Set requiresAuth: false since it's a free Substack
All three: Added date, sourceDetail, and readingTime frontmatter to match existing reading conventions

Edit4Bash3

9:22 PM

Your task

First, run git status, git diff, and git log --oneline -10 to understand the current state.

Then add and commit all modified files with a descriptive message. Break the changes into multiple commits if appropriate. If there are untracked files, ask if they should be added or added to the appropriate .gitignore file. The user may also want to take over at this point, so provide an abort option as well. Always check changes for unnecessary debugging statements or logging that might have been accidentally included. Prompt the user if you find any, and offer to remove each one by one. Do not use interactive git staging (git add -p) as it does not work.

No debug logging. The 3 new reading files should be added. Two logical commits — one for the API changes, one for the readings.

Two commits:

408caec — API/userscript: junk filtering + deduplication
c1bfd83 — Three new readings (Cassidy/New Yorker, Ball/Hyperdimensional, Wallace-Wells/NYT)

Bash3

9:36 PM

Your task

First, run git status, git diff, and git log --oneline -10 to understand the current state.

Clean version bump across all packages, no debug logging. One commit.

Done — 69e293b bumps all packages to 2026.3.9.

Bash2