Overview
The Shopping Agent is a real-time menu intelligence system that combines web scraping, database matching, and AI personalization into a single coherent pipeline. When a user taps Shop Now from a dispensary page, the system extracts the dispensary’s live menu, matches every product against the 16,000+ strain database, scores results against the user’s personal preference profile, and delivers ranked recommendations — all within 15–45 seconds.
This page documents the complete technical architecture for developers working on the Shopping Agent pipeline.
System Architecture
Mobile App
│
├─ POST /api/v1/shopping/scan-token (menu scan flow)
│ ↓
│ Hono API (Edge)
│ ├─ Cache check (Supabase)
│ ├─ Dedup check (active Trigger.dev runs)
│ └─ Trigger.dev task trigger → returns publicAccessToken + runId
│
├─ WebSocket subscription (useRealtimeTaskTrigger)
│ ↓
│ Trigger.dev: shopping-menu-scan task
│ │
│ Stage 1: cache-check ──────────── 5%
│ Stage 2: scrape (Firecrawl) ───── 10–40%
│ Stage 3: match (pg_trgm) ──────── 40–60%
│ Stage 4: personalize (Claude) ─── 60–85%
│ Stage 5: cache-save (Supabase) ── 85–95%
│ Stage 6: complete ──────────────── 100%
│ │
│ └─ Task output → WebSocket → Mobile App renders results
│
└─ POST /api/v1/research/strains/queue-batch (discovery queue flow)
Authorization: Bearer <Clerk token>
body: { strainNames, source: "shopping_discovery" }
↓
Hono API (Edge)
↓
Trigger.dev: strain research pipeline
↓
Supabase strains_v2 (full profile added within hours)
Stage 1: Cache Check
Before any scraping begins, the task queries the menu_scans Supabase table for an unexpired entry matching the dispensary domain.
const cached = await supabase
.from('menu_scans')
.select('*')
.eq('website_domain', dispensaryDomain)
.gt('expires_at', new Date().toISOString())
.order('scanned_at', { ascending: false })
.limit(1)
.single();
if (cached.data) {
// Return cached result immediately — skip all subsequent stages
return buildOutputFromCache(cached.data);
}
Cache TTL: 4 hours, shared across all users of the same dispensary. This means one user’s scan benefits every other High IQ user who visits that dispensary page within the same window.
Unique constraint: The table enforces UNIQUE (website_domain, categories_hash). If the menu composition has changed (new categories), a fresh scan overwrites the old entry.
Menu extraction uses a tool-based architecture built on Firecrawl. Instead of a single extraction method, the system provides five independent tools that the pipeline orchestrator chains with cascading fallback. Each tool is independently callable and testable.
| Tool | Function | Best For | Credits |
|---|
| Extract | extractProducts(urls) | Multi-page menus with wildcard patterns | 1 |
| Scrape+JSON | scrapeAndExtract(url) | Single-page menus with direct URL | 1 |
| Scrape | scrapeMenuPage(url) | Raw markdown for AI processing | 1 |
| Scrape+AI | extractProductsWithAI(markdown) | Fallback when structured extraction fails | ~0.05 (Claude) |
| Sitemap | discoverSitemapUrls(domain) | URL discovery for multi-page menus | 1 |
All tools are exported from @tiwih/trigger and live in packages/trigger/src/lib/firecrawl-agent.ts.
The Extract tool uses Firecrawl’s extract() API with wildcard URL patterns to crawl an entire menu section and extract structured product data across multiple pages. This is the primary extraction method because most dispensary menus span multiple URLs (e.g., /shop/flower, /shop/edibles, /shop/vapes).
import { extractProducts, DISPENSARY_EXTRACT_SCHEMA } from '@tiwih/trigger';
const result = await extractProducts(
['https://dispensary.com/shop/*'], // Wildcard pattern
{
schema: DISPENSARY_EXTRACT_SCHEMA, // Plain JSON schema (not Zod!)
prompt: 'Extract all cannabis products...',
showSources: true,
timeout: 180,
}
);
// result.products: DispensaryProduct[]
// result.sources: string[] (URLs that were crawled)
The Firecrawl SDK’s isZodSchema() detection triggers a broken tryZodV4Conversion path with Zod v4. The DISPENSARY_EXTRACT_SCHEMA is a plain JSON schema object that bypasses this entirely. Never pass Zod schemas to the Extract API.
Tool 2: Scrape+JSON (Single-Page)
For menus on a single URL, the Scrape+JSON tool combines scraping and extraction in one call — Firecrawl renders the page, then uses its built-in LLM to fill the schema.
import { scrapeAndExtract } from '@tiwih/trigger';
const result = await scrapeAndExtract(
'https://dispensary.com/shop/flower',
{ waitFor: 3000 } // Wait for JS rendering
);
// result.products: DispensaryProduct[]
When structured extraction returns 0 products (e.g., heavily JavaScript-rendered SPAs, age-gated sites), the pipeline falls back to scraping raw markdown and having Claude extract products from the text.
import { scrapeMenuPage, extractProductsWithAI } from '@tiwih/trigger';
const scrapeResult = await scrapeMenuPage(url, { waitFor: 3000 });
const products = await extractProductsWithAI(scrapeResult.markdown, {
maxProducts: 200,
});
// products: DispensaryProduct[]
The Sitemap tool uses Firecrawl’s map() to discover all URLs on a domain, then filters for menu-related paths. This is used as a fallback when the initial scan returns 0 products — it discovers additional menu URLs to retry extraction.
import { discoverSitemapUrls } from '@tiwih/trigger';
const sitemap = await discoverSitemapUrls('dispensary.com');
// sitemap.menuUrls: string[] — URLs matching /menu, /shop, /products, etc.
// sitemap.productUrls: string[] — Individual product page URLs
// sitemap.totalDiscovered: number — Total URLs found
Orchestrator: Cascading Fallback
The scanDispensaryMenu() orchestrator chains these tools with automatic fallback:
Extract (wildcard) ─── products > 0? ──→ Return
│ no
▼
Scrape+JSON ────────── products > 0? ──→ Return
│ no
▼
Scrape → AI ────────── products > 0? ──→ Return
│ no
▼
Sitemap Discovery ──── find menu URLs → Retry Extract/Scrape
Each result includes a method field ("extract", "scrape-json", or "scrape-ai") indicating which tool succeeded.
| Tool | Nuera Cannabis | Ascend Cannabis | Notes |
|---|
Extract (wildcard /shop/*) | 392 products | — | Crawls all menu pages |
| Scrape+JSON (single URL) | 89 products | — | Single page only |
| Extract (specific URL) | 39 products | — | Single URL, no wildcard |
| Scrape+AI (Claude fallback) | 72 products | — | From raw markdown |
Why Plain JSON Schema, Not Zod
The Firecrawl JS SDK detects Zod schemas via isZodSchema() and attempts conversion through tryZodV4Conversion. With Zod v4 (used in this project), this conversion silently fails, causing Extract to return 0 products. Using a plain JSON schema object (DISPENSARY_EXTRACT_SCHEMA) bypasses the SDK’s Zod detection entirely.
The extraction stage emits progress metadata so the mobile UI can show a live status like “Scanning menu… found 47 products so far.”
task.updateMetadata({
stage: 'scrape',
progress: 10 + (extractedSoFar / estimatedTotal) * 30,
message: `Scanning menu... found ${extractedSoFar} products`,
productCount: extractedSoFar,
});
Two test scripts are available in packages/trigger/:
# Test the orchestrator (cascading fallback)
pnpm test:shopping # Default: Nuera Cannabis
pnpm test:shopping --url <url> # Custom URL
pnpm test:shopping --sitemap # Test sitemap discovery
# Test all 3 Firecrawl extraction methods side by side
pnpm test:extract # Default: Nuera Cannabis /shop/flower
pnpm test:extract <url> scrape # Test only scrape+JSON mode
pnpm test:extract <url> wildcard # Test only wildcard extract mode
pnpm test:extract <url> specific # Test only specific-URL extract mode
Stage 3: Strain Matching
Every extracted product name is run through a three-tier matching algorithm against the strains_v2 Supabase table. Matches are attempted in order; the first successful match wins.
Tier 1: Exact Match
SELECT id, slug, name_display, high_family
FROM strains_v2
WHERE name_canonical = lower(trim($1))
LIMIT 1;
name_canonical is a pre-computed lowercase, stripped version of the strain name stored at ingestion time. This handles the most common case: the dispensary uses the standard strain name.
Confidence: high
Tier 2: Slug Match
SELECT id, slug, name_display, high_family
FROM strains_v2
WHERE slug = slugify($1)
LIMIT 1;
slugify() converts a string to URL-safe format (lowercase, hyphens, no special characters). Catches cases like “OG Kush” → og-kush matching a database entry with slug og-kush.
Confidence: high
Tier 3: Trigram Similarity (pg_trgm)
SELECT id, slug, name_display, high_family,
similarity(name_canonical, lower(trim($1))) AS sim
FROM strains_v2
WHERE similarity(name_canonical, lower(trim($1))) > 0.4
ORDER BY sim DESC
LIMIT 1;
PostgreSQL pg_trgm breaks strings into trigrams (3-character substrings) and computes a similarity score from 0.0 to 1.0. A threshold of 0.4 is permissive enough to catch common dispensary name variations while filtering out genuinely unrelated strings.
Confidence: medium if similarity > 0.6, low if 0.4–0.6.
Unmatched Products
Products that pass through all three tiers without a match are added to the discoveries array. These represent real strains available locally that are not yet in the High IQ database.
Parallelism
Matching runs concurrently for all extracted products using Promise.all() batched in groups of 20 to avoid overwhelming the Supabase connection pool.
const BATCH_SIZE = 20;
const batches = chunk(extractedProducts, BATCH_SIZE);
const matchResults: Product[] = [];
for (const batch of batches) {
const batchResults = await Promise.all(
batch.map(product => matchProduct(product, supabase))
);
matchResults.push(...batchResults);
}
Stage 4: AI Personalization
Once all products are matched, Claude Sonnet generates personalized recommendations. The personalization step has two parts: deterministic tag assignment and AI recommendation generation.
Deterministic Tag Assignment
Tags are assigned by comparing the matched strain IDs against the user’s profile data passed in the request. This is pure logic — no AI involved.
function assignTags(product: Product, userContext: UserContext): PersonalizationTag[] {
const tags: PersonalizationTag[] = [];
if (userContext.favoriteStrainIds.includes(product.strainId)) {
tags.push('favorite_in_stock');
}
if (userContext.lowStashStrainIds.includes(product.strainId)) {
tags.push('running_low');
}
if (userContext.recentStrainIds.includes(product.strainId)) {
tags.push('bought_before');
}
if (isSimilarToFavorite(product, userContext)) {
tags.push('similar_to_favorite');
}
if (matchesPreferences(product, userContext)) {
tags.push('matches_preferences');
}
if (!product.strainId) {
tags.push('new_discovery');
}
if (isOnSale(product)) {
tags.push('great_deal');
}
return tags;
}
isSimilarToFavorite() uses the High IQ strain similarity scores (pre-computed and stored in Supabase) to find products with similar terpene and effect profiles to the user’s favorites.
AI Recommendation Generation
After tags are assigned, the full product list (with tags and strain data) is passed to Claude Sonnet. The AI selects the top 3–5 picks and writes a plain-English reason for each.
const recommendations = await generateObject({
model: anthropic('claude-sonnet-4-6'),
schema: RecommendationsSchema,
prompt: `
You are a knowledgeable cannabis advisor.
The user has the following preferences:
- Favorite strains: ${favoriteNames.join(', ')}
- Preferred types: ${userContext.preferredTypes.join(', ')}
- Recently purchased: ${recentNames.join(', ')}
Here are the available products at this dispensary:
${JSON.stringify(taggedProducts, null, 2)}
Select the 3-5 best products for this user and explain in 1-2 sentences why each
is a great match. Focus on the specific combination of tags and strain properties
that make it right for them. Be specific — mention actual terpene names, effects,
or price value when relevant.
`,
});
The generateObject() call uses AI SDK 6 with schema for structured output, ensuring the recommendations are always valid JSON.
Stage 5: Cache Save
Results are upserted into the menu_scans table with a 4-hour expiry.
await supabase.from('menu_scans').upsert({
website_domain: dispensaryDomain,
categories_hash: computeCategoriesHash(products),
scanned_at: new Date().toISOString(),
expires_at: addHours(new Date(), 4).toISOString(),
total_products: products.length,
matched_count: products.filter(p => p.matched).length,
unmatched_count: products.filter(p => !p.matched).length,
products_json: products,
recommendations_json: recommendations,
discoveries_json: discoveries,
}, {
onConflict: 'website_domain,categories_hash',
});
The categories_hash is an MD5 of the sorted category list. If the dispensary adds a new product category (e.g., starts selling topicals), the hash changes, triggering a fresh scan on the next request.
Stage 6: Complete
The task returns the full output payload, which Trigger.dev delivers to the mobile app via WebSocket. The useRealtimeTaskTrigger hook in the app receives the completed run and triggers a state update to show the results screen.
De-duplication
The Hono API endpoint checks for active Trigger.dev runs before triggering a new one.
const activeRuns = await trigger.runs.list({
taskIdentifier: 'shopping-menu-scan',
status: ['EXECUTING', 'WAITING'],
metadata: { dispensaryDomain },
});
if (activeRuns.data.length > 0) {
// Return a token for the already-running scan
const token = await trigger.auth.createPublicToken({
scopes: { read: { runs: [activeRuns.data[0].id] } },
});
return { source: 'dedup', publicAccessToken: token, runId: activeRuns.data[0].id };
}
This prevents two users opening the same dispensary page at the same time from triggering two concurrent scans.
Mobile App Integration
Screens
| Screen | File | Description |
|---|
ScanScreen | _screens/shopping/ScanScreen.tsx | Animated progress during scanning |
ResultsScreen | _screens/shopping/ResultsScreen.tsx | Product list with category tabs and recommendations |
DiscoveryScreen | _screens/shopping/DiscoveryScreen.tsx | Batch research queue for unmatched strains — calls Hono API directly |
Real-Time Hook
The mobile app uses useRealtimeTaskTrigger from @trigger.dev/react-hooks to subscribe to run progress and output without polling.
import { useRealtimeTaskTrigger } from '@trigger.dev/react-hooks';
import type { ShoppingMenuScanTask } from '@tiwih/trigger';
function ShoppingAgentScreen({ dispensary }) {
const { submit, runs } = useRealtimeTaskTrigger<typeof ShoppingMenuScanTask>(
'shopping-menu-scan'
);
const handleShopNow = async () => {
const { publicAccessToken, cachedResult, runId } = await api.shopping.scanToken({
dispensaryDomain: dispensary.domain,
menuUrl: dispensary.menuUrl,
userId: currentUser.id,
userContext: buildUserContext(favorites, stash, orders),
});
if (cachedResult) {
// Skip scan — go directly to results
setResults(cachedResult);
return;
}
// Subscribe to the live run
submit(
{ dispensaryDomain: dispensary.domain, menuUrl: dispensary.menuUrl },
{ publicAccessToken }
);
};
const activeRun = runs[0];
const metadata = activeRun?.metadata;
if (activeRun?.status === 'COMPLETED') {
return <ResultsScreen data={activeRun.output} />;
}
return (
<ScanScreen
stage={metadata?.stage}
progress={metadata?.progress ?? 0}
message={metadata?.message ?? 'Starting...'}
/>
);
}
Discovery Queue: Queueing Unmatched Strains for Research
When a user taps “Add to Research Queue” in the DiscoveryScreen, the app submits the unmatched strain names directly to the Hono API — no Convex middleman involved.
Why Direct API, Not Convex
Convex is the source of truth for user-owned data: orders, stash, favorites, and dispensaries. Strain research is a platform-level concern — the data ends up in Supabase (strains_v2) and benefits all users, not just the submitter. Routing it through Convex would violate the data layer boundary and add unnecessary latency.
Data Flow
DiscoveryScreen
│
├─ useAuth() from Clerk — obtain Clerk session token
│
└─ POST /api/v1/research/strains/queue-batch
{
strainNames: ["Strain A", "Strain B", ...],
source: "shopping_discovery"
}
Authorization: Bearer <clerk_token>
│
↓
Hono API (Edge) — validates auth, enqueues strains
│
↓
Trigger.dev strain research pipeline
│
↓
Supabase strains_v2 (full strain profile, typically within hours)
Source Type
The /queue-batch endpoint accepts a source field that identifies how the strain was discovered. The shopping_discovery value was added alongside the existing order_upload and manual values specifically for this flow.
// Source enum values for /api/v1/research/strains/queue-batch
type QueueBatchSource = 'order_upload' | 'manual' | 'shopping_discovery';
Implementation Details
// DiscoveryScreen.tsx (simplified)
import { useAuth } from '@clerk/clerk-expo';
const { getToken } = useAuth();
const submittingRef = useRef(false); // guard against double-submission
const handleQueueForResearch = async (strainNames: string[]) => {
if (submittingRef.current) return;
submittingRef.current = true;
try {
const token = await getToken();
const response = await fetch(
`${Config.API_BASE_URL}/api/v1/research/strains/queue-batch`,
{
method: 'POST',
headers: {
'Content-Type': 'application/json',
Authorization: `Bearer ${token}`,
},
body: JSON.stringify({
strainNames,
source: 'shopping_discovery',
}),
}
);
if (!response.ok) throw new Error('Queue request failed');
Alert.alert('Queued', `${strainNames.length} strain(s) queued for research.`);
} catch {
Alert.alert('Error', 'Could not queue strains. Please try again.');
} finally {
submittingRef.current = false;
}
};
The useRef guard prevents the user from double-submitting if they tap the button quickly while the request is in flight. Unlike useState, a ref update does not trigger a re-render, so the button can remain visually enabled for the next valid submission without a flash of disabled state.
The Trigger.dev strain research pipeline processes queued strains asynchronously. Full strain profiles (genetics, terpenes, effects, images) are typically available within a few hours of submission.
Configuration
Environment Variables
The following environment variables are required in apps/api/.env and must be set in the Vercel project settings for production.
| Variable | Required | Description |
|---|
TRIGGER_SECRET_KEY | Yes | Trigger.dev secret key for triggering tasks and creating public tokens |
FIRECRAWL_API_KEY | Yes | Firecrawl API key for the menu scanning agent |
ANTHROPIC_API_KEY | Yes | Anthropic API key for Claude Sonnet personalization (or set AI_GATEWAY_API_KEY if routing through Vercel AI Gateway) |
SUPABASE_URL | Yes | Supabase project URL |
SUPABASE_SERVICE_ROLE_KEY | Yes | Supabase service role key for server-side reads and writes |
Trigger.dev Package Setup
The shopping-menu-scan task lives in the @tiwih/trigger package at packages/trigger/src/tasks/shopping-menu-scan.ts. It is deployed to the Trigger.dev cloud alongside the other pipeline tasks.
# Deploy trigger tasks (from monorepo root)
cd packages/trigger && npx trigger deploy
Adjustable Limits
| Constant | Default | Description |
|---|
CACHE_TTL_HOURS | 4 | Menu scan cache duration |
TRIGRAM_THRESHOLD | 0.4 | Minimum similarity score for a trigram match |
TRIGRAM_MEDIUM_CONFIDENCE | 0.6 | Threshold above which trigram matches are rated medium confidence |
MAX_RECOMMENDATIONS | 5 | Maximum AI recommendations returned |
MATCH_BATCH_SIZE | 20 | Products matched concurrently per database round-trip |
| Operation | Typical Duration | Notes |
|---|
| Cache hit response | < 100ms | Full results from Supabase, no task triggered |
| Extract (wildcard) | 15–60 seconds | Crawls multiple pages; larger menus take longer |
| Scrape+JSON (single page) | 5–15 seconds | Single page render + LLM extraction |
| Scrape+AI (fallback) | 8–20 seconds | Scrape (3–8s) + Claude extraction (5–12s) |
| Sitemap discovery | 3–8 seconds | Firecrawl map() — used only as fallback |
| Strain matching (100 products) | 1–3 seconds | pg_trgm with GIN index is fast |
| Claude Sonnet personalization | 2–5 seconds | generateObject with schema is deterministic |
| Total (cold scan) | 15–60 seconds | P50 ~25s, P95 ~60s (Extract crawls more pages) |
The Supabase strains_v2 table has a GIN index on name_canonical for trigram searches: CREATE INDEX idx_strains_name_trgm ON strains_v2 USING GIN (name_canonical gin_trgm_ops). Without this index, trigram matching on 16,000 rows would be too slow for the pipeline.
Observability
All pipeline stages emit structured logs via @tiwih/logger with the shopping category. Trigger.dev’s dashboard shows per-run stage durations, metadata snapshots, and task output — making it straightforward to identify where time is spent on any given scan.