Technical SEOMay 12, 202614 min read

Robots.txt Best Practices: 11 Mistakes That Kill Indexing

Across 200+ technical SEO audits we found the same robots.txt mistakes over and over. Each one looks tiny in isolation, but together they explain why some sites lose 30-90% of crawl coverage and slowly disappear from the index. Here are all 11 with copy-paste fixes and an audit checklist at the end.

Before you start — fetch your live file at yourdomain.com/robots.txt and keep it open in a tab. Paste it into our free Robots.txt Tester as you read so you can verify each mistake against your own configuration.

Why Robots.txt Mistakes Are So Expensive

Unlike a meta tag mistake on a single page, a robots.txt mistake applies to everything on the host. A single misplaced Disallow: can wipe out millions of URLs from Google's crawl queue overnight. We have seen e-commerce sites lose 40% of organic traffic in two weeks because somebody pushed a staging robots.txt to production with one extra slash in it.

Worse, the mistakes are silent. There is no error message. The site renders normally for humans. The damage shows up weeks later in Google Search Console as a slow decline in "Discovered — currently not indexed" numbers.

The 11 Mistakes

#1 Blocking CSS and JavaScript

The classic. A developer adds Disallow: /assets/ or Disallow: /static/ to "save crawl budget". Result: Googlebot cannot fetch the CSS that styles your page or the JS that hydrates your React components. Google's rendering engine sees a broken page, your Core Web Vitals signals tank, and structured data parsing breaks for any markup that depends on JS execution.

Fix: Add explicit Allow rules for the file types Googlebot needs:

User-agent: *
Disallow: /assets/
Allow: /assets/*.css
Allow: /assets/*.js
Allow: /assets/*.woff2

#2 Using noindex directive (removed Sep 2019)

Google officially stopped honoring Noindex: in robots.txt on September 1, 2019. Yet we still find it in production files weekly. The lines are silently ignored, but worse, they create a false sense of safety — people think their /admin/ pages are deindexed when they are actually fully indexable.

Fix: Delete Noindex: lines and move noindex to HTML meta or HTTP header:

<!-- In HTML head -->
<meta name="robots" content="noindex,nofollow">

# Or HTTP header
X-Robots-Tag: noindex, nofollow

#3 Forgetting the trailing slash

This catches even experienced devs. Disallow: /admin (no slash) blocks everything that starts with /admin, including /admin-panel, /admin-login, and even /administrators.html. If you have a public "admin"-named blog category, it disappears too.

Fix: Be explicit with terminator:

# BAD - blocks /admin-anything
Disallow: /admin

# GOOD - blocks /admin/ folder only
Disallow: /admin/

# GOOD - blocks /admin and /admin/ exactly
Disallow: /admin$
Disallow: /admin/

#4 Case-sensitive path mismatches

Robots.txt paths are case-sensitive. Disallow: /Private/ does not block /private/. If your CMS has inconsistent URL casing (looking at you, older WordPress installs), you may think /Private is blocked when half your traffic actually hits /private.

Fix: Audit your canonical URL casing and write rules in the exact case your URLs use. If both /Private and /private exist, write two rules or fix the casing inconsistency at the source.

#5 Missing Sitemap directive

Every robots.txt should end with the absolute URL to your sitemap. This is the only standard mechanism for telling all major search engines (not just Google) where your sitemap lives. Without it, Bing, Yandex, and DuckDuckGo rely on you submitting through their webmaster tools individually.

Fix: Add at the bottom, after all groups:

Sitemap: https://yourdomain.com/sitemap.xml
Sitemap: https://yourdomain.com/sitemap-blog.xml

#6 Returning HTML 404 instead of plain text

If /robots.txt returns your custom 404 page (HTML), Google's parser sees garbage and treats the entire site as "allow all". That can be fine, except the same misconfiguration usually means your sitemap.xml also 404s, which is fatal for indexing.

Fix: Curl your file and verify:

curl -I https://yourdomain.com/robots.txt

# Expect:
# HTTP/2 200
# content-type: text/plain

#7 Blocking faceted URLs without canonical fix

E-commerce sites love to Disallow: /*? to stop Googlebot crawling infinite filter combinations. Problem: this also blocks legit parameter URLs that should be indexed (paginated archives, language switches with ?lang=, UTM-tagged campaign landing pages).

Fix: Be surgical — block specific parameters, not all of them:

# BAD - blocks all query strings
Disallow: /*?

# GOOD - blocks only known facet params
Disallow: /*?color=
Disallow: /*?size=
Disallow: /*?sort=

Better still: set the canonical to the unfaceted URL on the page itself, and let Google ignore the duplicates naturally.

#8 Conflicting User-agent groups

Many devs assume the wildcard group User-agent: * applies to all crawlers including Googlebot. It does not. Once Googlebot finds User-agent: Googlebot, it ignores the wildcard group entirely. So rules written under * that you assumed Googlebot inherits are silently dropped.

Fix: Duplicate critical rules into specific UA groups:

User-agent: *
Disallow: /private/
Disallow: /admin/

User-agent: Googlebot
Disallow: /private/
Disallow: /admin/
Allow: /

User-agent: Bingbot
Disallow: /private/
Disallow: /admin/

#9 Using regex syntax that does not exist

Robots.txt only supports two special characters: * (any sequence) and $ (end of URL). Anything else — parentheses, brackets, plus signs, lookaheads — is treated as a literal character. We have seen Disallow: /(secret|hidden)/ in production where the dev assumed regex worked. That rule blocks the literal string /(secret|hidden)/ and nothing else.

Fix: Write one Disallow per pattern:

# WRONG - this is not regex
Disallow: /(secret|hidden)/

# CORRECT
Disallow: /secret/
Disallow: /hidden/

#10 Treating robots.txt as access control

Listing your /admin/, /private-api/, or /backup/ directories in a public robots.txt is the equivalent of writing a treasure map for attackers. The file is publicly readable by anyone, including bad actors who specifically scan it looking for juicy disallowed paths to attack.

Fix: Use real authentication (HTTP Basic Auth, server-side checks, IP whitelisting) for sensitive paths. Robots.txt is a politeness convention, not a security boundary. Many bad bots ignore it entirely.

#11 Forgetting to update after site migrations

The single most common production accident: a developer copies the staging robots.txt (which contains Disallow: / to block staging from Google) to production during a deploy. The site goes live with everything blocked. Recovery takes 4-8 weeks even after fixing because Google has to recrawl and re-evaluate.

Fix: Three defenses, in order:

Never ship the same robots.txt to staging and production — generate it dynamically based on environment variables.
Add a deployment check that fails the build if production robots.txt contains Disallow: / with no overriding Allow.
Set up uptime monitoring on yourdomain.com/robots.txt with content-match alerts on the word "Disallow: /" (with no Allow follow-up).

The Fix Table (TL;DR)

Mistake	Symptom	One-Line Fix
Blocking CSS/JS	Failing Core Web Vitals, broken rich results	`Allow: /assets/*.css`
Noindex directive	Pages still in index	Move to `<meta robots>` or `X-Robots-Tag`
Missing trailing slash	Adjacent paths blocked	`Disallow: /admin/`
Case mismatch	Paths not blocked as expected	Match URL casing exactly
No Sitemap directive	Bing/Yandex not finding sitemap	`Sitemap: https://yourdomain.com/sitemap.xml`
HTML 404 on /robots.txt	Parser falls back to allow-all	Return `200 text/plain`
Over-broad query block	Legitimate ?param= URLs deindexed	Block specific params only
Wildcard inheritance assumption	Googlebot ignores * group rules	Duplicate rules in named groups
Regex syntax	Literal string match, blocks nothing	Multiple Disallow lines
Disclosed sensitive paths	Attackers scan robots.txt for targets	Use HTTP auth + omit from robots.txt
Staging file in production	Whole site deindexed	Dynamic robots.txt per environment

The 8-Minute Audit Checklist

Run this after every site change. It catches 90% of robots.txt problems before they hit production.

Fetch the live file: Open yourdomain.com/robots.txt in an incognito tab. Confirm it returns 200 and shows plain text in the browser, not an HTML 404.
Grep for the killer pattern: Search the file for Disallow: / with no overriding Allow:. If present, your whole site is blocked.
Validate each User-agent group: Make sure every named UA group (Googlebot, Bingbot, etc.) has the same critical Disallows you put in the wildcard group.
Test 10 critical URLs: Paste your robots.txt and your top 10 money pages into the Robots.txt Tester. Every one should return ALLOWED for Googlebot.
Test CSS & JS specifically: Add /assets/main.css and /assets/app.js to the test list. Both must be ALLOWED.
Verify Sitemap directive: Confirm the absolute URL works in your browser and returns valid XML.
Check character encoding: The file must be UTF-8. BOMs or Windows-1252 can break parsing.
Submit to Search Console: Use the robots.txt report under Settings to force a Google re-fetch and confirm parse status.

Run This Audit in Your Browser

Paste your robots.txt and URLs into the free Robots.txt Tester. Get per-URL allow/block verdicts with the exact rule that triggered each decision. Works for Googlebot, Bingbot, and 9 other crawlers.

Open Robots.txt Tester →

What Robots.txt Cannot Do

Three persistent misconceptions worth nailing down:

1. It does not remove pages from the index. Robots.txt only blocks crawling. URLs blocked here can still appear in search results if they have external backlinks — Google just shows them with no snippet and the note "No information is available for this page". To truly deindex, use noindex meta tag or X-Robots-Tag, or submit a removal request in Search Console.

2. It does not stop bad bots. Spam scrapers, vulnerability scanners, and AI training crawlers from unknown operators routinely ignore robots.txt. If you need to block them, use server-side rules (Cloudflare WAF, .htaccess, nginx geo blocks).

3. It does not consolidate signal between duplicates. If two URLs serve the same content and you block one with robots.txt, you do not pass link equity to the other. For duplicate consolidation, use rel="canonical" or 301 redirects.

Frequently Asked Questions

Does Google still honor the noindex directive in robots.txt?

No. Google officially stopped supporting Noindex: in robots.txt on September 1, 2019. Use the noindex meta tag in HTML or the X-Robots-Tag HTTP header instead.

Can a URL be indexed even if robots.txt blocks it?

Yes. Robots.txt only blocks crawling, not indexing. If a blocked URL has external backlinks, Google may still index it — shown in SERPs with no snippet and the note "No information is available for this page".

Should I block crawlers like AhrefsBot or SemrushBot?

Only if you specifically want to hide your backlink profile from competitors who use those tools. Note that this also blocks your own ability to monitor competitor data on those platforms.

What is the maximum size of a robots.txt file?

Google parses up to 500 KiB of robots.txt content. Anything beyond that limit is ignored. For most sites, the file should be well under 5 KB.

Should the Allow directive come before or after Disallow?

Order does not matter for Google or Bing — they use the longest-match rule per RFC 9309. The most specific path wins regardless of line position in the file.

Robots.txt Best Practices: 11 Mistakes That Kill Indexing

Why Robots.txt Mistakes Are So Expensive

The 11 Mistakes

#1 Blocking CSS and JavaScript

#2 Using noindex directive (removed Sep 2019)

#3 Forgetting the trailing slash

#4 Case-sensitive path mismatches

#5 Missing Sitemap directive

#6 Returning HTML 404 instead of plain text

#7 Blocking faceted URLs without canonical fix

#8 Conflicting User-agent groups

#9 Using regex syntax that does not exist

#10 Treating robots.txt as access control

#11 Forgetting to update after site migrations

The Fix Table (TL;DR)

The 8-Minute Audit Checklist

Run This Audit in Your Browser

What Robots.txt Cannot Do

Frequently Asked Questions

Does Google still honor the noindex directive in robots.txt?

Can a URL be indexed even if robots.txt blocks it?

Should I block crawlers like AhrefsBot or SemrushBot?

What is the maximum size of a robots.txt file?

Should the Allow directive come before or after Disallow?

Related Reading