Data Sources - High IQ Docs

Overview

High IQ’s strain database contains over 16,000 cannabis strains with detailed profiles covering genetics, effects, terpenes, cannabinoids, grow information, and more. This data does not come from a single source — it is aggregated, normalized, and enriched through multiple pipelines that run continuously. This page explains where our data originates, how it flows through the platform, and what quality controls ensure accuracy.

Primary Data Sources

Supabase Strain Database

The canonical strain database lives in Supabase (PostgreSQL) and serves as the single source of truth for all strain information across the platform. Every API response, website page, and mobile app screen pulls from this centralized store. The database includes:

Strain profiles — Name, aliases, genetics (parent strains), breeder information, strain type (indica/sativa/hybrid percentages)
Chemical data — THC, CBD, CBG, CBN, THCV, CBC percentages with lab verification flags
Terpene profiles — Individual terpene concentrations in mg/g where available, plus qualitative aroma descriptors
Effects — Reported effects with frequency data (e.g., “relaxed” reported by 78% of users)
Growing information — Flowering time, yield, difficulty, indoor/outdoor suitability
Media — Strain images, descriptions, and educational content

Competitor Monitoring

High IQ runs automated weekly monitoring of major cannabis platforms to identify new strains and track market trends:

Source	Strain Count	Sync Frequency	Method
Leafly	~4,500 strains	Weekly (Sunday 3 AM CT)	Public sitemap extraction
AllBud	~15,000 strains	Weekly (Sunday 3 AM CT)	Public sitemap extraction

The competitor monitoring pipeline works as follows:

Sitemap Extraction

Public sitemaps are parsed to discover strain URLs. Only strain page URLs are extracted — no scraping of copyrighted content occurs.

Deduplication

Discovered strain names are normalized and compared against the existing database to identify genuinely new strains versus name variations of existing entries.

Status Tracking

Each competitor strain receives a status: pending (new discovery), queued (ready for processing), processed (enriched and added), skipped (duplicate or insufficient data), or duplicate (exact match found).

Queue Prioritization

New strain candidates enter the unprocessed strain queue, sorted by popularity so high-demand strains are enriched first.

Competitor monitoring extracts only strain names from public sitemaps. We do not copy descriptions, images, or other copyrighted content from competitor platforms. All strain profile content in High IQ is independently sourced and written.

Research Pipelines

Two automated research pipelines enrich the database with scientific and media content:

Strain Research Pipeline

The strain research pipeline is a multi-stage process powered by Trigger.dev that runs in the cloud with no timeout limits:

Data collection — Gathers strain genetics, chemical profiles, and effect data from multiple public sources
AI enrichment — Generates comprehensive descriptions, effect summaries, and educational content
Quality validation — Automated quality gates ensure data meets minimum completeness thresholds
Database insertion — Validated data is written to the production database

Each pipeline run can process strains for over 2 hours without interruption, with per-stage visibility and granular retry capabilities.

Paper Research Pipeline

A daily automated pipeline aggregates cannabis research papers:

Search — Queries PubMed for recent cannabis research publications
Deduplication — Filters out papers already in the database
Summarization — AI generates plain-language summaries of each paper
Quality Gate — Ensures summaries are accurate and educational
Imaging — Generates thumbnail images for each paper
Save — Validated papers are stored for display in the research hub

This pipeline runs daily at 7 AM Central Time, keeping the research hub current with the latest published science.

Strain Music Pipeline

A background pipeline generates a unique AI-composed song for each strain using Google Lyria 3 Pro:

Analyze — Fetches strain data, chemical profile, and images
Build Prompt — Gemini 3.1 Flash Lite generates a music generation prompt based on the strain’s character
Generate — Lyria 3 Pro creates a unique audio track ($0.08/song)
Upload — Track is stored in Supabase Storage (strain-music bucket) and linked to the strain record

The pipeline auto-triggers after the strain research pipeline completes. Tracks are accessible from strain detail pages in the mobile app and on the website’s Music Hub at /music — browse by genre, mood, or curated playlist — as well as from each strain’s detail page via an embedded audio player.

User Contributions

High IQ users contribute data in several ways:

Label Scanner

Users scan dispensary labels with the AI-powered label scanner, which extracts terpene profiles, THC/CBD percentages, and strain names. This real-world lab data enriches existing strain profiles. Learn more

Stash Data

When users add strains to their stash with dispensary information and pricing, this aggregated data helps track strain availability and market pricing trends.

Ratings & Reviews

User ratings and effect reports contribute to the engagement component of strain scoring and help validate effect profiles.

Strain Submissions

Users can submit strains not found in the database. Submissions enter the processing queue for verification and enrichment.

Every label scan contributes to the collective knowledge base. Even if a strain is already in the database, your scan may add terpene data from a different grower or batch, improving the overall profile.

Data Flow Architecture

The following shows how data moves through the High IQ platform:

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│  Competitor      │     │  Research         │     │  User            │
│  Monitoring      │     │  Pipelines        │     │  Contributions   │
│  (Weekly)        │     │  (Daily/On-demand)│     │  (Real-time)     │
└────────┬────────┘     └────────┬─────────┘     └────────┬────────┘
         │                       │                         │
         ▼                       ▼                         ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Supabase Database                              │
│                    (Single Source of Truth)                       │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │    Hono API      │
                    │    (Edge Cache)  │
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              ▼              ▼              ▼
        ┌──────────┐  ┌──────────┐  ┌──────────┐
        │ Website  │  │ Mobile   │  │ Third    │
        │          │  │ App      │  │ Party    │
        └──────────┘  └──────────┘  └──────────┘

Data Quality Controls

Normalization

Raw strain data arrives in inconsistent formats. The @tiwih/api-normalizers package standardizes:

Strain names — Consistent capitalization, removal of special characters, alias resolution (e.g., “GSC” maps to “Girl Scout Cookies”)
Terpene names — Standardized to canonical names (e.g., “b-caryophyllene” becomes “caryophyllene”)
Effect names — Mapped to a controlled vocabulary to prevent duplicates like “relaxed” vs. “relaxing”
Cannabinoid values — Converted to consistent percentage format with validation ranges

Deduplication

Multiple layers of deduplication prevent strain records from being created for the same cultivar:

Name matching — Fuzzy matching catches common variations (e.g., “Blue Dream” vs. “BlueDream” vs. “Blue Dream #1”)
Genetic matching — Strains with identical parent lineage are flagged for manual review
Competitor cross-referencing — The competitor monitoring pipeline tracks which strains have already been imported

Freshness

Data freshness varies by source:

Data Type	Update Frequency	Staleness Threshold
Competitor discoveries	Weekly	7 days
Research papers	Daily	24 hours
User label scans	Real-time	Immediate
Strain enrichment	On-demand via pipeline	Varies by queue position
YouTube videos (by slug)	On-demand; falls back to YouTube search URL when quota is exhausted	30 days
YouTube videos (by ID)	Short-lived cache for mobile strain pages	15 minutes
AI-generated music tracks	Generated once per strain via Lyria 3 Pro; versioned on regeneration	Permanent (updated on-demand)

Database Statistics

The strain database currently contains:

16,000+ total strain profiles
11,500+ candidate strains from competitor monitoring awaiting enrichment
6 High Families classifying strains by terpene profile
20+ terpenes tracked with concentration data
6+ cannabinoids tracked per strain
Daily research paper ingestion from PubMed

Strain data is provided for informational purposes only. Effects and potency can vary based on growing conditions, individual tolerance, and consumption method. Always start low and go slow.

Frequently Asked Questions

Do you scrape competitor websites?

No. We extract only strain names from publicly available sitemaps (the same data search engines use to index pages). We do not copy descriptions, images, reviews, or any copyrighted content. All High IQ strain profiles are independently written and sourced.

How can I report incorrect strain data?

Contact us at support@thisiswhyimhigh.com with the strain name and the specific data you believe is incorrect. We investigate all reports and update our database accordingly.

Is lab data verified?

Where possible, we include data sourced from Certificates of Analysis (COAs) submitted through the label scanner. However, we cannot independently verify all lab results. Look for the lab-verified badge on strain profiles to identify data backed by COA submissions.

Can dispensaries submit their strain data?

We are exploring partnerships with dispensaries for direct data feeds. Contact support@thisiswhyimhigh.com if your dispensary is interested.

​Overview

​Primary Data Sources

​Supabase Strain Database

​Competitor Monitoring

​Research Pipelines

​Strain Research Pipeline

​Paper Research Pipeline

​Strain Music Pipeline

​User Contributions

Label Scanner

Stash Data

Ratings & Reviews

Strain Submissions

​Data Flow Architecture

​Data Quality Controls

​Normalization

​Deduplication

​Freshness

​Database Statistics

​Frequently Asked Questions

Overview

Primary Data Sources

Supabase Strain Database

Competitor Monitoring

Research Pipelines

Strain Research Pipeline

Paper Research Pipeline

Strain Music Pipeline

User Contributions

Data Flow Architecture

Data Quality Controls

Normalization

Deduplication

Freshness

Database Statistics

Frequently Asked Questions