Why Confidence Scores Matter When Extracting a Design System

Why "extracted" should always come with a caveat

In the world of AI tooling, most engineers have learned to look at generated code with a healthy dose of skepticism. When a tool claims it can just extract a design system from a live URL, you should probably raise an eyebrow.

The truth is, modern web architecture is actively hostile to simple parsing. A lot of sites bury their tokens deep inside obfuscated CSS-in-JS frameworks. With dynamic theming, the DOM might serve up light mode variables while the stylesheet is actually pointing to dark mode defaults. Add in class hashing that scrambles the structural logic, and it becomes almost impossible for a basic script to figure out if a color belongs to a primary button or an error state.

Because of all these roadblocks, an extracted file without any reliability metric is basically useless. If you drop an unverified token file into your repo to ground your AI agent, you're just begging for hallucinations in your codebase. Curious how these files are actually structured? Take a look at the DESIGN.md specification.

The designmd.run confidence score, decoded

To fix this trust issue, designmd.run attaches a strict confidence score to every single extraction. This score tells you exactly how much you can trust the file before you even download it.

The score runs from 0 to 100% and breaks down into four clear bands, shown as color badges in the UI. A green badge means High (≥ 80%). Blue means Good (60–79%). Yellow is Fair (40–59%), and red means Low (< 40%). These metrics let you know right away if the extracted DESIGN.md file is actually ready for your AI workflow. You can read more about how we arrived at these exact thresholds.

How the score is computed

The confidence score isn't just a vanity metric. It comes from a pretty intense reconciliation process.

The extraction pipeline runs on three parallel tracks. The first track looks at native CSS variables and computed styles. This is the most reliable source because it's the browser's absolute truth. The second track uses vision AI to take a screenshot of the page, trying to visually figure out spatial relationships and hex codes. The third track digs into page metadata and framework-specific structures.

We calculate the final score by weighing these sources against each other. If the CSS parser finds a hex code, and the vision AI confirms that exact hex code is sitting on a primary button, the confidence score shoots up. But if the vision AI spots a drop shadow and the CSS parser can't find the variable because it's obfuscated, the score drops. This reconciliation step makes sure the final output is graded objectively.

What to do at each confidence band

The confidence band tells you exactly what your next step should be.

If you get a High score, the data is solid. You can safely download the file, drop it in your repo, and tell Cursor or Claude Code to start building UI right away.

If the score is Good, the extraction is reliable but might have a few minor gaps. Go ahead and ship it to your repo, but do a quick visual check just to make sure it didn't miss a secondary hover state or a tertiary font family.

If you get a Fair score, you'll need to roll up your sleeves. The pipeline probably ran into heavy JS rendering issues. You'll want to review the tokens carefully and fill in any missing spacing scales yourself. And if the score is Low? The pipeline hit some serious anti-scraping walls or a totally blank DOM. Don't rely on a Low-scoring file without a heavy manual review. If you run into a site that just won't parse, check out our guide on how to extract a design system from a website.

Lint status, separately

Right next to the confidence score, you'll see a lint status. This measures structural compliance, not extraction accuracy.

Think of it this way: the confidence score tells you if the hex code is actually correct. The lint status tells you if the file itself perfectly matches the specification schema.

A Pass status means zero errors or warnings—the file is ready for AI to read. A Partial status means there are warnings, like a missing optional Dark Mode block or a low-confidence shadow token, but the file is still totally usable. A Fail status means you have a major structural error, like a component trying to use a color token that doesn't even exist in the YAML frontmatter. If you run into errors, check out how the design.md file format is explained so you know how to fix them.

How to improve a low score

If you end up with a low confidence score, don't panic. You have options.

First, just try running the extraction again. Sites with heavy dynamic content sometimes need a CDN warmup, or they might fail to run their client-side rendering fast enough during the first headless browser pass. A second try often grabs the full DOM.

If you're using our API, you have a lot more control. You can tweak caching and timeout limits to give those heavy sites enough time to fully render. And finally, if a normal, public site keeps giving you a low score, file a quality report in the platform. That helps us train the parsing engine to handle that specific edge case in the future.

Frequently asked questions

Does a low confidence score mean the file is broken?

Not necessarily broken, but it definitely means the pipeline couldn't verify a big chunk of the tokens. You'll need to review it manually.

What is the difference between confidence score and lint status?

The confidence score checks if the extracted data matches the live website. The lint status checks if the file structure matches our specification.

Can an AI coding agent use a file with a Fair confidence score?

Yes, but the agent might hallucinate the missing values. My recommendation is to manually patch any gaps in a Fair-scoring file before you use it in an AI session.

Get Started

See your extraction's confidence score before you even download it—every result is rated. Visit the designmd.run homepage today.

Search design systems