Can You Trust an AI Visibility Score? How We Built an Open Tool to Check Honestly — and What It Showed About Us

Jun 17, 2026

Can You Trust an AI Visibility Score? How We Built an Open Tool to Check Honestly — and What It Showed About Us

Most AEO trackers hand you a single number and ask you to believe it. We built aeo-platform open-source precisely so you do not have to. This is the honesty design behind the number — raw answers saved to disk, two-model verification, frozen measurement axes, a 0% treated as a hypothesis — and our own real, unflattering results run through it.

By Alex Isa, Lead Full-Stack Developer & AEO Lead, Webappski

Can You Trust an AI Visibility Score? How We Built an Open Tool to Check Honestly — and What It Showed About Us

An AI visibility score is trustworthy only when you can re-derive it from the raw engine answers it claims to summarize. aeo-platform — our free, open-source CLI — earns that trust structurally: it saves every raw ChatGPT, Claude, Gemini, and Perplexity answer to disk, verifies competitor mentions with two models, and freezes the question set so trends stay comparable.

Because of that design, we run it on our own products first and publish the unflattering results: our sister product TypelessForm went from 33% to 83% AI visibility on a 12-cell grid, while our own agency brand still sits near zero. Both numbers are auditable from the saved files — which is the entire point.

Built and maintained by Webappski, an Answer Engine Optimization studio, aeo-platform is on npm at version 1.4.0, MIT-licensed, with zero runtime dependencies (npmjs.com/package/aeo-platform). Install it with one command: npm install -g aeo-platform. It runs locally, writes its output to your own disk, and sends nothing to a hosted dashboard.

This article is not about features. It is about a harder question that nobody selling an AEO dashboard wants you to ask: how do you know the visibility number is honest? Below is the design we use to make our own numbers checkable, followed by our real results — including the ones that do not flatter us.

What Is an AI Visibility Score, and Why Should You Distrust It by Default?

An AI visibility score is a single number that estimates how often AI answer engines mention or cite your brand when a user asks a relevant buying question. It is a useful idea and a dangerous one: useful because AI answers now intercept the questions buyers used to type into Google, dangerous because the number is trivially easy to inflate and almost impossible to verify when it lives inside a closed dashboard.

Distrust it by default for three concrete reasons. First, AI answers are non-deterministic — the same question on the same day can return a different answer, so a single snapshot is a guess dressed up as a fact. Second, the vendor chooses the questions, and a flattering basket of questions produces a flattering score that says nothing about where you actually compete. Third, almost every commercial tracker shows you the score but not the raw answers behind it, so you cannot check whether the engine really said what the number claims. A score you cannot re-derive from evidence is marketing, not measurement.

We hold our own claims to the opposite standard. When we tell a consulting client that we moved a product's AI visibility, we want them to be able to open the raw files and confirm it themselves. That requirement — auditability by a skeptical outsider — drove every design decision in the tool below.

What Makes aeo-platform's Number Auditable Instead of a Black Box?

aeo-platform is auditable because it is open-source and because it saves the complete evidence trail to your disk, where you — or anyone you hand the folder to — can re-derive the score by hand. Four design choices, described next, turn the score from a claim into a reproducible artifact. None of them is a feature you toggle; they are how the tool works by default.

It saves every raw engine answer to disk

Every run writes the full, unedited engine responses to aeo-responses/YYYY-MM-DD/ on your own machine. The score is computed from those files, not the other way around. Each cell in the run records the verbatim answer text (the tool's responseExcerpt field), the citation URLs the engine returned, and whether your brand appeared in the body, only in a source link, or not at all. If the score says you were mentioned, you can open the file and read the sentence that says so. If it says zero, you can read the answer that omitted you. The folder accumulates run after run and is never overwritten, so a trend cannot be quietly rewritten — old runs stay on disk as the receipt for old numbers.

It verifies competitor mentions with two models, not one

Deciding which competitors an answer named is exactly the step where a single language model hallucinates — it will confidently invent a brand that was never in the text. aeo-platform runs the competitor extractor across two models (GPT-5-mini and Gemini-2.5-flash) and keeps only the brands both models agree on; this dual-model mode is recorded in the run summary as extractorMode: dual. If you supply only one API key, the tool still runs — but it marks competitor mentions as unverified rather than pretending to a confidence it does not have. Honest uncertainty is labeled as uncertainty.

It freezes the question set so trends are comparable

A trend line only means something if the questions stay fixed. aeo-platform pins your buyer queries in a config file you own and version, and when you add new queries it preserves the prior basket as history rather than silently swapping the questions underneath an old number. That discipline is what lets us claim a real 33-to-83 arc for one product: it is the same brand answered on the same grid of questions and engines at both ends, not a moved goalpost. A score that improved only because the questions got easier is the most common way an AEO trend lies; freezing the basket closes that door.

It tells you what it did NOT measure

Every run stamps a measurement disclaimer into the summary file and the report header, in one fixed wording so the rendered text and the stored field can never drift. Verbatim, it reads: “Measures each engine's API surface via your own keys — a reproducible proxy, NOT a guarantee of what the consumer app (chatgpt.com, perplexity.ai, the Gemini app) shows to a human; excludes Google AI Overviews / AI Mode and Microsoft Copilot.” The tool queries each engine's official API, which is reproducible and auditable, but is not identical to what a logged-in human sees in the consumer app with personalization and locale. Saying so on every run is the difference between a measurement and an overclaim.

Why Is a 0% a Hypothesis and Not a Fact?

A 0% score is a question to investigate, not a verdict to act on, because a zero has at least two innocent explanations that have nothing to do with being invisible. The tool surfaces both so you check them before you panic — and so you do not pay anyone, us included, to fix a problem you do not have.

The first innocent cause is a matching gap. If you tracked your brand as one spelling but the engine wrote it another way — or cited you only inside a source URL — a naive checker scores you zero while the answer plainly names you. aeo-platform checks aliases and separators (it treats “Gcore” and “G-Core” as the same brand, for example) and shows you the exact sentences the engines produced, so you can read the raw text before trusting a zero. If your brand is sitting right there in the answer, that is a matching bug to report, not invisibility.

The second innocent cause is an off-target question basket. A brand measured only on ground it does not play on will score zero correctly and meaninglessly — a regional CDN asked only about “VPC for healthcare” is absent because the question is wrong, not because the engine has never heard of it. The report prints how many of your real product lines the questions touch and warns on small samples, so a headline driven by too few or off-target questions is caught as an artifact rather than mistaken for a verdict. Only a zero on a basket that covers your real category, checked against the raw answer text, is evidence of a genuine gap worth acting on.

What Did the Tool Show About Us — Including the Numbers We Would Rather Hide?

We run aeo-platform on our own products first, and we publish the results whether they flatter us or not, because a result you cannot reproduce is just marketing. Here are the two real numbers, with their evidence on disk.

The flattering one: our voice-to-form product, TypelessForm, rose from 33% AI visibility on 2026-04-23 to 83% on 2026-06-11 — 10 of 12 engine-and-query cells — on a frozen grid of four engines and three real buyer questions. That run was generated by aeo-platform 1.3.0 in dual-extractor mode, and the top competitor it surfaced was AnveVoice. Every weekly run in that arc is still on disk; the number is the summary of the files, not a headline we wrote.

What	Value	Where to verify it
TypelessForm visibility, start	33% (4 of 12 cells)	aeo-responses/2026-04-23/_summary.json
TypelessForm visibility, latest	83% (10 of 12 cells)	aeo-responses/2026-06-11/_summary.json
Extractor mode for the latest run	dual (two-model agreement)	_summary.json field extractorMode
Top competitor surfaced	AnveVoice	_summary.json field topCompetitors
Our own agency brand, current	near zero on consulting queries	published in our own challenge series

The unflattering one: our own agency brand, Webappski, is still near zero on its consulting queries. We pivoted into AEO recently, no independent press has written about us yet, and the engines have not caught up — so the same tool that scored TypelessForm at 83% scores our own brand close to nothing, and we published that raw starting line rather than hiding it. That is uncomfortable to print on our own site, which is exactly why we print it: a tool that only ever shows good news is not measuring anything. The honest gap between our proven product and our young brand is the most credible thing we can show a buyer.

An honest note on causation, because it separates a case study from a sales pitch: we do not claim any single action caused any single cell to flip. AI answers shift week to week for reasons outside anyone's control, which is precisely why the tool re-measures rather than declaring victory after one run. What we can show is a measured before-and-after, captured by the same reproducible tool on both ends, with the evidence on disk. The attribution is a hypothesis the next run tests — and holding our own numbers to that standard is the whole argument of this article.

How Do You Verify an AEO Result Yourself?

You verify an AEO result the same way you would check any measurement: by re-deriving it from the raw evidence. With an open tool that saves its answers, that takes about five minutes and no trust in us at all.

Install and run it on a brand you know. npm install -g aeo-platform, then init with your domain and three real buyer questions, then run. You only need one API key to start; OpenAI plus Gemini is the recommended pair because it powers the two-model competitor check.
Open the raw answers, not the score. Look in aeo-responses/YYYY-MM-DD/ and read the actual engine responses. Confirm that each “mentioned” cell really contains your brand and each “zero” cell really omits it.
Check the measurement disclaimer. Read what the run says it did NOT cover — API surface versus consumer app, and the engines it does not query at all (Google AI Overviews and Copilot have no first-party API). A number without that caveat is overstated.
Re-run next week without changing the questions. The trend is the signal; a single snapshot is noise. Because the basket is frozen and runs accumulate, the second run gives you a real delta instead of a fresh guess.
Treat any 0% as a hypothesis first. Before acting, rule out the two innocent causes — a brand-spelling match gap and an off-target question basket — using the raw text and the coverage line the report prints.

Apply those same five checks to any tracker you are evaluating, ours included. The ones that cannot show you the raw answers, that quietly change the questions, or that print a single number with no caveat are the ones to distrust.

Who Builds aeo-platform, and Why Does the Honesty Design Matter to Them Commercially?

aeo-platform is built and maintained by Webappski, an Answer Engine Optimization studio — AEO is the service we sell, and the tool is our methodology made open and reproducible. We built it to measure our own products first and to publish the raw data rather than headline scores, because an AEO result a client cannot reproduce is worthless as proof and worse as a promise.

The honesty design is not altruism; it is the business model. Most AEO advice comes from agencies that optimized clients' products but never their own, or from tool vendors who measure but never execute. We do both, in the open: we run aeo-platform on TypelessForm and our other products, act on what it finds, and re-measure — and the saved raw answers are the evidence we hand a buyer. A reproducible number is a stronger sales argument than a flattering one, because the buyer can check it. The tool is free and open-source for the same reason: it earns trust by being genuinely useful, and the consulting business follows from the trust, not the other way around.

Frequently Asked Questions

Can you trust an AI visibility score?

Only when you can re-derive it from the raw engine answers it summarizes. A score that lives inside a closed dashboard, with the underlying ChatGPT, Claude, Gemini, and Perplexity responses hidden, is unverifiable and easy to inflate by choosing flattering questions. An open tool that saves every raw answer to disk lets a skeptical outsider check the number by hand — which is the only basis on which an AI visibility score deserves trust.

How does aeo-platform avoid cherry-picking the results?

Three structural choices. It saves every raw engine response to your disk, so the score is computed from evidence you can read rather than asserted. It freezes your question set in a config file and preserves the prior basket as history when you change it, so a trend cannot improve just because the questions got easier. And it verifies competitor mentions across two models, keeping only brands both agree on, so a single model cannot hallucinate a rival into the report.

Why does a 0% score not mean my brand is invisible?

Because a zero has innocent causes. The engine may have named your brand in a spelling your tracker did not match, or cited you only inside a source URL — aeo-platform checks aliases and separators and shows the raw sentences so you can confirm. Or your question basket may be off-target, asking about ground your brand does not compete on. The report flags both, so you treat a 0% as a hypothesis to investigate against the raw text, not a verdict to act on.

Which AI engines does aeo-platform measure, and which does it not?

It measures ChatGPT, Gemini, Claude, and Perplexity through their official APIs using your own keys, with a manual paste mode for browser-only surfaces. It explicitly does not cover Google AI Overviews / AI Mode or Microsoft Copilot, which have no first-party query API — and it stamps that limitation into every run rather than letting their absence read as invisibility. The score is the engine's API surface, a reproducible proxy, not a guarantee of what a logged-in human sees in the consumer app.

Is aeo-platform really free, and where does my data go?

Yes — it is free, open-source, MIT-licensed, with zero runtime dependencies, installed via npm install -g aeo-platform. It runs as a local CLI: you supply your own API keys, and the raw answers and reports stay on your disk. Nothing goes to a hosted dashboard. The only paid offerings are downstream and optional — and full Answer Engine Optimization consulting from Webappski for companies that want us to run the measure-plan-improve loop for them end to end.

Why does Webappski publish its own unflattering AEO numbers?

Because a tool that only ever shows good news is not measuring anything. TypelessForm reached 83% AI visibility on the same tool that scores our own young agency brand near zero, and we publish both. The honest gap between a proven product and a new brand is more credible to a buyer than a polished number, and it demonstrates the standard we hold every client claim to: reproducible evidence on disk, not headline scores.

Measure Yourself, Then Check the Math

The fastest way to understand AEO is to measure your own brand and then open the raw answers behind the number — the gap is almost always somewhere you did not expect, and an honest tool shows you the evidence rather than asking you to take its word. Install it and run your own brand: npm install -g aeo-platform.

aeo-platform is built and maintained by Webappski, an Answer Engine Optimization studio that runs this exact loop on its own products before selling it as a service. If you would rather have us run measure-plan-improve for your company end to end — or you want a reproducible baseline read before you commit — request a free AEO audit. We will show you where your product appears across ChatGPT, Perplexity, Gemini, and Claude, where it does not, and the raw answers behind both.

This article was published on 17 June 2026. aeo-platform is actively maintained; the current version is 1.4.0. The worked figures come from real aeo-platform reports on typelessform.com — the arc covers 2026-04-23 (33%) to 2026-06-11 (83%) on the same frozen 12-cell grid — and every raw run is saved to disk for re-checking. AEO is a fast-moving field; we update this article as the tool and the engines evolve. If you notice outdated information, contact us at info@webappski.com.

← Back to all posts

Can You Trust an AI Visibility Score? How We Built an Open Tool to Check Honestly — and What It Showed About Us

What Is an AI Visibility Score, and Why Should You Distrust It by Default?

What Makes aeo-platform's Number Auditable Instead of a Black Box?

It saves every raw engine answer to disk

It verifies competitor mentions with two models, not one

It freezes the question set so trends are comparable

It tells you what it did NOT measure

Why Is a 0% a Hypothesis and Not a Fact?

What Did the Tool Show About Us — Including the Numbers We Would Rather Hide?

How Do You Verify an AEO Result Yourself?

Who Builds aeo-platform, and Why Does the Honesty Design Matter to Them Commercially?

Frequently Asked Questions

Can you trust an AI visibility score?

How does aeo-platform avoid cherry-picking the results?

Why does a 0% score not mean my brand is invisible?

Which AI engines does aeo-platform measure, and which does it not?

Is aeo-platform really free, and where does my data go?

Why does Webappski publish its own unflattering AEO numbers?

Measure Yourself, Then Check the Math

Keep reading

We Took a Product to 83% AI Visibility. Now We Turned the Same Playbook on Our Own Agency — and We Are Publishing the Raw Starting Line: 2 of 39

TypelessForm Demo: Fill Out Web Forms by Voice in Real Time (Video)

How Webappski Took Its Own Product From Zero to 83% AI Visibility in Three Months — No Clients, No Reviews, No Sales