XML sitemap validator: why most sitemaps are quietly broken in 2026
By Nikhil Kumar, Founder of LandKit. Last updated May 2026.
Most sitemaps that "validate" in 2026 are still costing the site indexed pages. The XML is well-formed. Google fetches it. And half the URLs sit in "Discovered, currently not indexed" forever.
An XML sitemap validator's real job is not to confirm the file parses. It's to flag the patterns that drag pages out of the index: bloated changefreq tags Google ignores, lastmod dates Google has stopped trusting, mixed canonical signals, and sitemap-index structures that bury the URLs you actually want crawled. Google confirmed in its 2025 build-and-submit docs that it ignores <priority> and <changefreq> entirely, which means a lot of "best-practice" sitemaps from 2018 are pure overhead in 2026.
Why most sitemaps in 2026 are technically valid and strategically broken
A sitemap can pass every XML schema check and still cost you indexed pages. The breakage isn't syntax. It's the gap between what the sitemap claims and what Google actually does with each tag. Google explicitly states in its sitemap docs that it "ignores <priority> and <changefreq> values," and uses <lastmod> only when "consistently and verifiably accurate." A clean parse is the floor, not the bar.
Three things separate a passing sitemap from a working one.
The first is canonical hygiene. If your sitemap lists ?utm_source=foo parameter URLs, paginated archives, or non-canonical duplicates, you're spending crawl budget on pages Google has already chosen not to index.
The second is lastmod truth. Gary Illyes told Search Engine Journal in June 2024 that Google's trust in lastmod is "binary: we either trust it or we don't." Lie once at scale and Google ignores the field on your domain.
The third is index hierarchy. A 47,000-URL flat sitemap and a 47,000-URL sitemap-index split into 10 themed files send Googlebot completely different signals about which content tiers matter.
You can pressure-test all three with our robots.txt validator running side by side, since a sitemap line in robots.txt is still the cleanest discovery path now that the ping endpoint is dead.
Does changefreq and priority still matter for Google in 2026?
No. Google has publicly ignored both fields for years, and as of the 2025 sitemap documentation refresh, the language is no longer hedged: "Google ignores <priority> and <changefreq> values." John Mueller has repeated this in office hours since 2017, and the SEO Roundtable archive tracks the public statements going back nearly a decade. If your sitemap ships with these tags, they are inert.
The bigger problem is what they hide.
When a CMS auto-stamps every URL with <priority>1.0</priority> and <changefreq>daily</changefreq>, your sitemap has zero internal differentiation. Every page screams "important." Google reads none of it.
Bing has historically said it considers <changefreq> as one signal among many, and Yandex still parses both. So if international visibility matters, you can keep them. But for Google, treat them the way you'd treat a deprecated meta tag: harmless if accurate, dishonest if you forgot what you set.
The Drupal Simple Sitemap maintainers removed both tags by default in 2024 explicitly because they were misleading site owners into thinking the values affected Google. WordPress 6.5 followed a similar logic, ditching priority/changefreq output in favor of native lastmod support.
What you should do instead is invest the engineering hour you would have spent on priority logic into accurate lastmod stamping. That's the only field Google still reads as a scheduling hint.
What lastmod actually needs to do (the binary trust signal)
Lastmod is the one optional sitemap tag Google still uses, and the bar for using it is higher in 2026 than most sites realize. Google treats lastmod as binary: trusted or ignored, no in-between. To stay trusted, the timestamp must reflect a "significant update to the page" (Google's own language), match the actual page modification, and use W3C Datetime format. Ship a lastmod that says "2026-05-07" but serve a page that hasn't changed since 2023 and Google quietly stops using the field across your domain.
Gary Illyes was unusually direct about this on Bluesky in early 2025, when sites started bulk-updating lastmod just because they swapped a copyright year from 2024 to 2025. Per Search Engine Roundtable's coverage, Illyes flat-out told operators not to do it, because a footer date change is not a content change. He also noted Google "purposefully" doesn't define "significant" so site owners have to use judgment.
The practical rule I use:
Update lastmod when the main content, structured data, or in-content links change. Don't update it for footer years, sidebar widgets, or related-posts shuffles.
If you can't tell whether a change is significant, ask whether you'd want Google to re-rank the page based on it. If yes, update. If no, leave it.
The Bing Webmaster team made the same point in its February 2023 lastmod guidance, framing the field as the single most useful signal a site can give a crawler. The WordPress 6.5 release in April 2024 finally made this default behavior on the largest CMS in the world, which is why lastmod adoption jumped sharply across non-enterprise sites in 2024 and 2025.
When should I use a sitemap index versus a flat sitemap?
Use a flat sitemap if your site has fewer than ~50,000 URLs and the content tiers are similar in priority. Use a sitemap index when you cross the 50,000-URL or 50MB-uncompressed limit Google enforces, when you publish in clearly distinct content categories, or when you want segmented coverage data inside the Google Search Console sitemap report. The official sitemaps.org protocol caps any single file at 50,000 URLs and 52,428,800 bytes uncompressed. Past that, an index is mandatory.
The decision tree below is the one I run for every audit.
| Site profile | URLs | Recommended structure | Why |
|---|---|---|---|
| Small SaaS, blog, portfolio | Under 5,000 | Single flat sitemap.xml | Easier to debug; no index overhead |
| Mid-size content site | 5,000 to 50,000 | Single flat sitemap, segmented internally | Stays inside protocol limits; one file to validate |
| Multi-section site (blog + product + docs) | 10,000 to 200,000 | Sitemap index with one child per section | Lets you read indexed/discovered counts per section in GSC |
| Ecommerce or marketplace | 200,000+ | Sitemap index, child sitemaps under 45k URLs each | Margin under 50k limit; isolates problem segments |
| News or fast-publishing | Any | Separate news sitemap + main index | News sitemap has its own 1,000-URL/48-hour rules |
The hidden upside of a sitemap index is diagnostic. If GSC shows 92% indexed for your /blog/ child sitemap but 41% for /products/, you know exactly which content tier is failing. With one flat sitemap, you get an aggregate that hides the failure.
The downside is fragmentation risk. If your child sitemaps drift out of sync (one fresh, one stale, one 404ing), the index becomes a liability. Run our canonical tag checker on the URLs in each child sitemap to catch the most common drift, which is non-canonical URLs creeping back into auto-generated children.
For sites publishing structured data heavily, validate child sitemaps' top URLs through the schema validator before submission. A sitemap-index that points to URLs with broken JSON-LD ships the same problem at 10x the surface area.
Why "Discovered, currently not indexed" is a sitemap-quality problem
"Discovered, currently not indexed" means Google saw the URL (often through your sitemap) and decided not to fetch it yet. Per Google's official Page Indexing documentation, the engine has the URL but is rationing crawl. Ahrefs' public guidance on the status attributes the bulk of cases to crawl-budget pressure, content-quality signals, and weak internal linking, not to a broken sitemap per se. But the sitemap is where you can fix the upstream cause fastest.
Three sitemap-side moves that actually move pages out of "Discovered, currently not indexed":
Strip non-canonical URLs ruthlessly. If your sitemap lists /blog/post/, /blog/post/?ref=email, and /blog/post?utm_source=newsletter, you're forcing Google to choose. Pick the canonical, drop the rest.
Remove anything noindex'd or robots-blocked. Listing a noindex URL in your sitemap is a contradiction Google calls out specifically in the indexing-troubleshooting docs. The fix is binary: either index it or delete it from the sitemap.
Tighten lastmod accuracy. Google's crawl budget documentation, last updated late 2025, explicitly says crawl budget rises only with server resources or content quality. Lying about lastmod looks like the latter, badly.
The single biggest mistake I see in audits is the Yoast/Rank Math default of including every URL the CMS knows about. A SaaS site with 2,400 real pages can ship a sitemap with 18,000 URLs once you count tag pages, author archives, paginated category pages, and attachment pages. Google reads that as a low-signal, high-noise feed and rations crawl accordingly.
The fix takes 30 minutes in any modern SEO plugin. Exclude tags, authors, attachments, and paginated archives. Submit a focused 2,400-URL sitemap. Watch the indexed-coverage curve recover over the next four to six weeks.
How does Google now schedule crawls if priority is dead?
Google now schedules crawls primarily off three signals: lastmod accuracy, server-response health, and inferred page importance from internal links and external citations. The crawl budget documentation refreshed in December 2025 is explicit that the only ways to grow crawl capacity are adding server resources or improving content quality. The 2023 deprecation of the sitemaps ping endpoint made lastmod the de facto scheduling input for previously-discovered URLs.
That's a meaningful shift. Pre-2023, you could ping Google when your sitemap changed and trigger a recrawl pass.
Now you can't.
The only proactive levers a site has are: keep the sitemap fresh and accurate, keep robots.txt pointing to it, and submit it inside Search Console. Everything else, Google decides on its own schedule.
This is why lastmod accuracy compounds. Sites with trusted lastmod get re-fetched faster after edits, which means new content reaches the index faster, which feeds the AI engines (ChatGPT, Perplexity, Gemini) that pull from Google's freshness signals through Bing-shared infrastructure. ChatGPT's live-search results overlap with Bing's top-10 about 87% of the time according to multiple SearchGPT-Bing alignment studies, so an under-indexed page is a missing AI citation too.
Sites running our free SEO audit tool regularly find that their indexed-pages-to-sitemap-pages ratio is the strongest single predictor of organic traffic growth over a 90-day window, ahead of backlinks or schema coverage.
What an XML sitemap validator should actually check in 2026
A modern XML sitemap validator should test ten things, in this order. The list below is what we instrument inside the LandKit validator and what I recommend any operator run before submitting a sitemap to Search Console. Anything less is just XML schema validation, which a browser gives you for free.
- The file is reachable, returns HTTP 200, and is served as
application/xmlortext/xml. - The XML is well-formed and uses the sitemaps.org 0.9 namespace.
- URL count is below 50,000 per file. URLs over the limit should split into a sitemap index.
- Uncompressed file size is under 50MB (52,428,800 bytes), per protocol.
- Every
<loc>is absolute, fully-qualified, and on the same host as the sitemap itself. <lastmod>values are in W3C Datetime format and are not bulk-stamped to today's date.- The sitemap is a sitemap-index when it should be (50k+ URLs, multi-section site).
- URLs in the sitemap match canonical URLs (no parameters, no
noindex, no robots-blocked). - URLs in the sitemap return 200, not 404 or 301, when fetched.
- The sitemap is referenced in
robots.txtvia aSitemap:directive.
Most free validators stop at step 2 or 3. The interesting failures live in steps 6 through 10.
A sitemap can pass an XML schema check, fail step 8, and quietly leak crawl budget for years.
Frequently asked questions
How do I fix discovered, currently not indexed in Google Search Console?
Strip non-canonical URLs from your sitemap, remove any noindex or robots-blocked pages, fix lastmod accuracy, and improve internal linking to the affected URLs. Per Google's page-indexing documentation, this status means Google saw the URL but didn't fetch it. The fastest fix is sitemap hygiene plus a content-quality pass, then a 4-to-6-week patience window for crawl rationing to ease.
Do priority and changefreq tags do anything in 2026?
No, not for Google. The Google sitemap documentation states clearly that "Google ignores <priority> and <changefreq> values." John Mueller has confirmed this in office hours since 2017. Bing still uses changefreq as one signal among many, and Yandex parses both, so leave them in if you have international SEO needs. For Google-only sites, they're decorative.
What's the maximum number of URLs in a sitemap?
50,000 URLs per file, with a maximum uncompressed size of 50MB (52,428,800 bytes). This is set by the official sitemaps.org protocol and enforced by Google. For larger sites, use a sitemap index file, which can reference up to 50,000 individual sitemaps, giving a theoretical ceiling of 2.5 billion URLs. Google respects the same limits in its crawl pipeline.
Should the lastmod date be the date Google last crawled, or the date I last edited?
The date you last meaningfully edited the page. Per Gary Illyes' June 2024 statement on lastmod trust, Google treats the field as binary trusted or untrusted at the domain level. Update lastmod when the main content, structured data, or in-content links change. Don't update it for footer copyright years, sidebar widgets, or auto-shuffled related-posts blocks.
Why does Google Search Console say couldn't fetch my sitemap?
The most common causes are an incorrect URL, a robots.txt that blocks the sitemap path, a property mismatch (submitting an HTTPS sitemap to an HTTP-verified property), or a server returning 5xx errors when Googlebot fetches. Per the Google Search Console help docs, the fix is to load the sitemap URL directly in a browser, confirm it returns valid XML, and verify the sitemap path isn't disallowed in robots.txt.
Do I need to ping Google when my sitemap updates?
No. Google deprecated the sitemaps ping endpoint in June 2023 and shut it down completely by the end of that year. Sending pings to the old endpoint now returns a 404 with no effect. The replacement is automatic: keep lastmod accurate, keep the sitemap referenced in robots.txt, submit it once in Search Console, and Google handles re-fetching on its own schedule.
Run the validator, then fix the three things it surfaces
Validating the XML is the warm-up. The actual work is the three patterns above: stripping non-canonical URLs, getting lastmod into the trusted-binary state, and choosing flat-vs-index based on content tiers, not file size. If your sitemap passes a schema check but your indexed-to-submitted ratio is below 80% in Search Console, one of those three is the cause every single time.
Fix them in that order. Watch coverage in GSC at the 14-day mark, then again at the 6-week mark. Sitemap hygiene compounds slowly, then suddenly.
Nikhil Kumar is the founder of LandKit, the SEO and AI visibility growth OS that tracks brand mentions across ChatGPT, Claude, Perplexity, and Gemini. He builds free SEO tools for solo founders and lean teams. Connect with him on LinkedIn.