How Our AI Monitor Actually Works: Methodology & Data Transparency

💡 TL;DR: We test your brand against 20–50 category-relevant prompts daily, across 4 AI platforms via official APIs, normalize for response variability, and report rolling 7-day averages. No browser sessions. No "influenced by your own searches." Here's the detail.

Why Methodology Transparency Matters

The AI visibility monitoring market has a trust problem. Teams try two or three tools and get different numbers from each. No one explains why the numbers differ. No one publishes their methodology. The result: buyers reasonably conclude that the numbers are meaningless, and stop buying.

The criticism you see in forums is legitimate: "You just test a few prompts and the AI answers differently every time anyway." That's a real technical challenge — LLM responses are non-deterministic, and there's no standard for what "AI visibility" means yet. Different tools make different tradeoffs in how they address it. Without knowing what those tradeoffs are, you can't evaluate whether a tool's data is useful for your situation.

We're publishing our methodology because we think you deserve to evaluate it. If our approach doesn't fit your use case, you should know that before signing up. And if it does fit, you'll understand exactly what you're measuring — which makes the data more useful, not less.

Which Platforms We Monitor

We test brand visibility across all four major AI platforms that drive significant buyer discovery in 2026:

🤖

ChatGPT (OpenAI)

We use the OpenAI API (gpt-4o as primary model) with temperature=0 for consistency. We do not use the ChatGPT web interface — API calls are isolated, not influenced by browsing history or account context.

🔍

Perplexity

We use Perplexity's API with the sonar-pro model, which reflects the same retrieval-augmented responses users see in the Perplexity web interface. Perplexity's live web retrieval makes it the most dynamic of the four platforms.

🌐

Google AI Overviews

We test AI Overview appearance via Google Search API queries in a clean, logged-out context. Results reflect organic AI Overview generation, not personalized search. Coverage depends on Google's AI Overview availability for each query type.

🧠

Claude (Anthropic)

We use the Anthropic API (claude-3-5-sonnet) with temperature=0. Claude's training data has a knowledge cutoff, which means it reflects brand awareness established before that cutoff rather than real-time web state.

⚠️ Important context: Each platform has different architecture. ChatGPT and Claude draw primarily from training data. Perplexity retrieves live web content. Google AI Overviews blend search index data with generative responses. Your visibility may differ significantly across platforms — this is expected, not a data error.

How We Build and Test Prompts

This is where most tools are vague. We're not.

Prompt Categories

For each monitored brand and product category, we run three types of prompts:

Category discovery prompts — "What are the best [category] tools in 2026?" / "What tools do you recommend for [use case]?" These simulate buyer intent: someone who has a problem and is asking AI what to use.
Explicit brand prompts — "What do you know about [brand name]?" / "What is [brand] and who is it for?" These test knowledge depth — whether the AI has enough information about your brand to describe it accurately.
Comparison prompts — "[Brand A] vs [Brand B]" / "What are the differences between [your brand] and [competitor]?" These test competitive positioning in AI-generated comparisons.

How Many Prompts?

For each monitored brand, we run 20–50 prompts per platform per day, depending on the category size and number of competitors configured. Prompt sets are reviewed and updated monthly to reflect how buyer language evolves.

Prompt Isolation

Every API call uses a fresh context — no session history, no prior messages, no system prompt that could bias the response. Each prompt is the complete, isolated input. This eliminates the concern about "AI being influenced by your own searches" — we're not using browser sessions or logged-in accounts where browsing history could affect results.

Prompt Templates

We use templated prompt structures rather than ad-hoc generation, ensuring consistency across daily runs. Prompts within a category are identical day-to-day unless we push a deliberate update (which we version and log). This means day-over-day comparisons are valid — you're comparing results from the same prompts, not different phrasings.

20–50

Prompts tested per platform/day

Prompt categories (discovery, brand, comparison)

Session history / browsing influence

How We Normalize Results

LLM responses are non-deterministic — the same prompt can produce different outputs on different API calls. This is the core challenge every AI monitoring tool faces, and most don't explain how they address it. Here's ours:

Run each prompt 3x per platform per day

We make 3 independent API calls for every prompt, with fresh context each time. This gives us a sample of the response distribution rather than a single data point that may be an outlier.
Score each response for brand presence

For each response, our NLP pipeline scores: (a) whether the brand was mentioned, (b) the position of the mention relative to other brands, (c) the sentiment context (recommended, criticized, neutral, compared), and (d) whether the mention was substantive vs. incidental.
Average across the 3 runs

The daily score for each prompt is the average of the 3 runs. If a brand appeared in 2 of 3 runs, its appearance score for that prompt is 0.67 (67%), not 1.0. This surface-level normalization smooths out one-off non-appearances without overstating consistent presence.
Roll up to category-level visibility score

The overall AI visibility score is the weighted average across all prompts in the category set, with category discovery prompts weighted 2x relative to brand-specific prompts (because discovery prompts better represent unsolicited buyer behavior).
Apply 7-day rolling average for trend reporting

The score displayed in your dashboard is the 7-day rolling average, not the single-day score. This smooths day-to-day variance from LLM non-determinism and makes trend lines meaningful. You can view raw daily scores in the detail panel.

What Your AI Visibility Score Means

Your AI visibility score is a percentage: the share of AI responses — across all tested prompts and platforms — that include a mention of your brand. A score of 45% means that in 45% of the simulated buyer discovery queries we ran, your brand appeared in the AI's response.

Score interpretation guide

0–10% — Minimal presence. AI has little information about your brand or doesn't associate it with your category. You're invisible in most AI-driven discovery for your product area.
10–30% — Emerging presence. AI mentions you in some category contexts, likely in secondary or "also consider" positions. Competitor brands probably score higher for the same queries.
30–60% — Competitive presence. You're appearing regularly in AI discovery responses. Whether you're appearing as a primary recommendation or in the supporting cast varies by query type.
60–80% — Strong presence. AI consistently includes you in category recommendations. Your brand has enough established context that AI treats you as a standard answer to buyer queries.
80%+ — Category leadership. AI routinely leads with your brand or includes you in every response. You've built the kind of multi-source presence that AI models treat as definitionally representative of your category.

Scores are relative to your query set and category. A 70% score in a niche B2B category may represent more competitive dominance than a 40% score in a broad consumer category. Use competitor benchmarking — not absolute score levels — as your primary performance indicator.

Hallucination Detection

Beyond tracking whether your brand appears, we flag AI hallucinations — cases where an LLM makes factually incorrect claims about your brand. This matters more than it might seem: if ChatGPT tells a potential buyer that your pricing is $500/month when it's $19/month, or claims you don't support a feature you do support, that inaccuracy is actively hurting conversions.

Our hallucination detection pipeline works by:

Extracting factual claims from AI responses about your brand (pricing, features, founding year, integrations, team size, etc.)
Comparing extracted claims against your brand profile data (which you provide during setup)
Flagging discrepancies that exceed a confidence threshold as potential hallucinations
Categorizing severity: pricing errors and feature claims are flagged as High; positioning and description errors as Medium; stylistic inaccuracies as Low

Hallucination alerts appear in your dashboard as a separate feed, distinct from visibility score tracking. We recommend reviewing High-severity flags promptly — they represent cases where AI is actively misinforming potential buyers about your product.

Honest Limitations

No AI monitoring tool is perfect. Here's what ours doesn't do well — and why.

⚠️ We can't monitor private or personalized AI responses

When a user has a long chat history with ChatGPT, their responses may be personalized based on prior context. We monitor the baseline API behavior — the starting point for users with no prior context. Personalized responses are impossible to monitor at scale and affect every AI monitoring tool equally.

⚠️ Training data cutoffs affect ChatGPT and Claude scores

ChatGPT and Claude draw from training data with a knowledge cutoff. If your brand built significant presence after that cutoff, it may not be reflected in these platforms' responses yet — even if you've done everything right. We display each platform's approximate knowledge cutoff in your dashboard so you can calibrate expectations.

⚠️ Our prompt set may not cover every query buyers ask

We cover the 20–50 most representative queries for your category, not every possible way a buyer might ask about your space. If your buyers use highly specific or unusual query patterns, contact us — we can add custom prompts to your monitoring set.

⚠️ Scores measure AI presence, not AI-driven traffic

We measure whether you appear in AI responses, not whether those AI responses are driving users to your site. Connecting AI visibility to actual traffic outcomes requires pairing our data with your web analytics — we surface the visibility layer; attribution to business outcomes is your responsibility.

Update Frequency

Here's the monitoring schedule by platform:

ChatGPT (API): Daily. Results available in dashboard by 06:00 UTC each morning.
Perplexity (API): Daily. Results available by 06:00 UTC. Perplexity's live retrieval means scores can shift more day-to-day than other platforms.
Google AI Overviews: Daily during market hours. Availability varies by query — not all queries trigger AI Overviews in all regions.
Claude (API): Daily. Results available by 06:00 UTC. Claude's training cutoff means scores change primarily after model updates, not daily.

Slack alerts fire within 60 minutes of a significant visibility change (≥10 percentage point shift in 7-day average). You can configure alert thresholds in your dashboard settings.

See Your AI Visibility Score

Now that you know exactly how it's calculated — run the check and see where you stand across ChatGPT, Perplexity, Google AI, and Claude.

Run Free AI Visibility Check →

How Our AI Monitor Actually Works

In this article