How SiteGuard Works

Technical documentation of our 14 scan modules

No Lighthouse — Custom Analysis Engine

SiteGuard deliberately does NOT use Google Lighthouse. Instead, we use a custom fetch-based analysis engine that runs on Vercel Serverless. This means: consistent results, no browser dependency, and significantly faster scans (2-3 seconds per page instead of 15-30 seconds with Lighthouse).

Important Legal & Accessibility Notice

Automated legal and accessibility scans provide technical signals and prioritization. They do not replace legal advice, a legal review, or a manual WCAG/EAA/BFSG certification by qualified experts.

How a Scan Works

1Website URL is fetched — HTML, response headers, cookies are captured
2Internal links + sitemap are analyzed — up to 5 pages are scanned (root + 4 subpages)
3Depending on the plan, up to 12 core modules plus Cookie Audit and Discoverability run as companion scans — checks use HTML, headers, cookies, DNS/RDAP, and targeted HEAD requests depending on the module
4Results are aggregated — scores 0-100 per module, issues sorted by severity
5AI report (optional) — Claude AI creates a management summary with top 5 action items

Methodology, Limits, and Confidence

SiteGuard intentionally separates measured technical signals, derived prioritization, and areas that require manual or legal review. This table shows how reliable each result category is.

Accessibility

Measured

HTML signals such as lang attribute, title, alt text, H1, labels, link text, and selected WCAG mapping.

Not Automatically Assessed

Keyboard navigation, screen reader behavior, focus order, rendered UI contrast, and full WCAG/BFSG/EAA conformance.

Confidence

High for technical presence signals; medium for the derived priority.

Performance

Measured

Response time, page size, resource count, broken links, redirects, SSL status, and HEAD request results.

Not Automatically Assessed

Real Core Web Vitals such as LCP, CLS, and INP without browser or CrUX data.

Confidence

High for fetch, header, and link signals; no statement on real Core Web Vitals.

Privacy & Legal

Measured

Cookie and tracker patterns, CMP detection, imprint/privacy links, contact details, and HTTPS signals.

Not Automatically Assessed

Legal completeness, individual legal bases, contractual context, data flows, and sector-specific obligations.

Confidence

Medium; technical evidence is reliable, legal assessment remains case-specific.

Cookie Audit

Measured

Cookies, storage, third-party requests, and CMP declarations before consent, after rejection, and after acceptance.

Not Automatically Assessed

Legal approval of consent design, banner wording, and complete review of all privacy texts.

Confidence

Medium to high for observed browser states; no legal clearance.

Security

Measured

Publicly visible headers, SSL/TLS, CSP, HSTS, mixed content, CORS signals, and known frontend library patterns.

Not Automatically Assessed

Penetration testing, auth/business-logic vulnerabilities, server internals, and non-public infrastructure.

Confidence

High for observable web signals; no statement on hidden vulnerabilities.

SEO & Discoverability

Measured

Meta tags, robots.txt, sitemap, canonical, hreflang, structured data, crawl coverage, and noindex conflicts.

Not Automatically Assessed

Actual Google ranking, guaranteed indexing, search demand, backlink quality, and competitive analysis.

Confidence

High for technical discoverability signals; no ranking or indexing guarantee.

The 14 Modules and Companion Scans in Detail

Privacy Scanner

HTML pattern matching + cookie headers

Cookie detection from Set-Cookie headers
13 third-party tracker patterns (Google, Meta, TikTok, LinkedIn, etc.)
10 consent management platforms (Cookiebot, Usercentrics, OneTrust, etc.)
Cookie classification (necessary/analytics/advertising)
GDPR/ePrivacy assessment

Scoring: Start 100. No consent banner: -50. Trackers without consent: -10 each (max -30). Non-essential cookies: -5 each (max -20).

Accessibility Audit

HTML regex analysis (no axe-core browser needed)

Images without alt text (WCAG 1.1.1)
Missing HTML lang attribute (WCAG 3.1.1)
Missing page title (WCAG 2.4.2)
H1 presence and heading hierarchy (WCAG 1.3.1)
Inputs without labels (WCAG 4.1.2)
Empty links (WCAG 2.4.4)
EAA prioritization signal

Scoring: Start 100. Missing lang: -15. Missing title: -10. Images without alt: -3 each (max -20). Inputs without label: -5 each (max -15). No H1: -10.

SEO + GEO Audit

HTML analysis + HEAD requests + JSON-LD parsing

Title, meta description, viewport, canonical
Open Graph (8 tags) + Twitter Card (4 tags)
Structured data: 13 Schema.org types with required field validation
GEO score: content structure, entity signals, AI discoverability, citation readiness
Sitemap.xml, robots.txt, hreflang tags
Favicon completeness, social preview quality
Image optimization: dimensions, lazy loading, WebP/AVIF, file size
RSS/Atom feed detection, web manifest, resource hints

Scoring: 28+ individual checks. Missing title: -15. Missing meta: -15. No OG: -10. No structured data: -10. Plus GEO score 0-100 separately.

Security Scanner

fetch() + node:https for SSL inspection

10 HTTP security headers (HSTS, CSP, X-Frame-Options, etc.)
SSL/TLS certificate validation + expiry
CSP deep analysis (unsafe-inline, unsafe-eval, wildcards, frame-ancestors)
HTTPS redirect check
Mixed content detection
Subresource Integrity (SRI)
CORS configuration
Server information leakage (version disclosure)
Outdated JS libraries (jQuery <3.5, Bootstrap <5, etc.)
Grading A+ to F (like SecurityHeaders.com)

Scoring: Start 100. Missing HSTS: -15. Missing/weak CSP: -15. SSL issues: up to -30. CORS wildcard: -10. Mixed content: -3 each. Server leak: -3 each.

Performance Check

fetch() with timing + HEAD requests for links

Response time (via fetch timing)
Page size (Content-Length)
Broken links: all resource URLs (a, img, script, link, video, iframe)
Redirect chains (manual following, hop counting)
SSL validation
Resource count (scripts, stylesheets, images)
Broken images (HEAD request check)
Oversized images (>500KB)

Scoring: Start 100. Response >3s: -20, >5s: -30. Broken internal: -5 each. Broken external: -2 each. Broken image: -3 each. Redirect chains: -2 each.

Tag Validator

HTML pattern matching across all scanned pages

12 tag types: GA4, GTM, Meta Pixel, LinkedIn, TikTok, Hotjar, Matomo, etc.
Tag ID extraction (G-XXXXX, GTM-XXXXX, pixel IDs)
DataLayer detection
Cross-page consistency check (tag on homepage but missing on subpages?)
Duplicate detection

Scoring: Start 100. No analytics: -20. GA without GTM: -10. No dataLayer with GTM: -15. Inconsistent tags: -3 each (max -9).

Legal Compliance

HTML pattern matching for DACH law

Imprint/Impressum link present
Privacy policy/Datenschutz link present
Cookie banner detection (20+ CMP platforms)
Terms/AGB link present
Contact information (email, phone)
HTTPS active

Scoring: Start 100. No imprint: -25. No privacy: -25. No cookie banner: -15. No terms: -10. No contact: -10.

Content Changes

Text fingerprinting + comparison

Text extraction (HTML tags stripped)
Word, link, and image counts
Content hash (fingerprint)
Comparison with previous scan
Change detection: none/minor/significant/major

Scoring: 100 = no change. 80 = minor change (<10%). 50 = significant. 30 = major.

SSL & Domain

node:https + node:dns + RDAP API

SSL certificate: validity, issuer, expiry, protocol
DNS records: A, AAAA, MX, NS, TXT
DMARC record
SPF record
Domain WHOIS via RDAP (expiry date, registrar)

Scoring: Start 100. SSL expired: -40. SSL <7 days: -25. No DMARC: -10. No SPF: -10. Domain <30 days: -15.

CO₂ Footprint

fetch() + page size measurement

Transfer size (KB)
Resource count (scripts, styles, images, fonts)
CO₂ estimate: 0.2g per MB transferred
Rating: A+ (<0.5g), A (<1g), B (<1.5g), C (<2g), D (>2g)
Comparison with global average (1.76g per page view)

Scoring: A+ = 100. A = 85. B = 70. C = 50. D = 30.

Tech Stack

HTML + response header analysis

17 CMS with version detection (WordPress, TYPO3, Drupal, Shopify, etc.)
15 frontend frameworks with version (React, Next.js, Vue, Angular, jQuery, etc.)
10 JS libraries (Lodash, GSAP, Three.js, D3.js, etc.)
8 CDN providers (Cloudflare, Vercel, AWS CloudFront, etc.)
14 analytics tools
9 CSS frameworks
Font providers (Google Fonts, Adobe Fonts)
Server + hosting detection
Programming language hints (X-Powered-By)

Scoring: Informational — always score 100. No deductions.

Third-Party Risk

HTML src/href extraction + domain classification

External domains from all resource URLs
Categorization: analytics, advertising, social, CDN, fonts, maps, video, payment
Risk assessment: known tracker (high), CDN (low), unknown (medium)
Count: total third-parties, high-risk percentage

Scoring: Start 100. >10 third-parties: -5. >20: -15. High risk: -10 each (max -30). Unknown: -3 each (max -15).

Cookie Audit

Browser/fetch-based companion scan across three consent states

Consent banner detection and provider
Cookies, Local Storage, and Session Storage before consent, after rejection, and after acceptance
Third-party requests by consent state
CMP declarations
Findings for non-essential cookies and tracking before consent

Scoring: Dedicated score from consent findings. Cookie Audit runs as a website-scoped companion scan and is not stored as a regular scan_result module.

Discoverability

robots.txt + sitemap fetching + crawl comparison

robots.txt status and sitemap directives
Sitemap URLs and lastmod validation
Crawl coverage
Noindex conflicts
Orphan URLs and missing sitemap entries
IndexNow key and submission

Scoring: Dedicated score from discoverability findings. Discoverability runs as a website-scoped companion scan and is stored separately.

Additional Features

Multi-Page Scanning

Each scan analyzes up to 5 pages (root + 4 subpages from sitemap and internal links).

Uptime Monitoring

Ping check every 5 minutes with status history, response time, and downtime alerts.

AI Reports

Claude AI generates management summaries with top 5 action items.

PDF Export

Branded PDF report with scores, charts, and issues for download.

CSV/JSON Export

Scan results as CSV or JSON for further analysis.

Score Trends

Trend chart showing score development across all scans.

Scheduled Scans

Automatic scans daily, weekly, or monthly via Inngest cron.

Why Not Lighthouse?

Google Lighthouse is unreliable in serverless environments, produces inconsistent results, and requires a full Chrome browser. SiteGuard uses a custom fetch-based engine instead: consistent results, 2-3 seconds per page (vs 15-30s), and runs on Vercel Serverless without Chromium.

Note: Real Core Web Vitals (LCP, CLS, INP) require a browser. These can be added later via the CrUX API (Chrome real-user data) — more reliable than Lighthouse lab data.