IndexDoctor.io
Crawling

Page blocked by robots.txt: why and what to do

robots.txt looks simple, but a single Disallow line in the wrong group can keep entire sections of your site out of Google's crawl.

What this usually means

Google reports a URL is "Blocked by robots.txt." That means a Disallow rule in your robots.txt file matched the URL for the user-agent Googlebot, so Google won't fetch the page contents to crawl them. The URL can still appear in search, just without a snippet, which is usually worse than not appearing at all.

Why it matters

robots.txt controls crawling, not indexing. A blocked URL with strong external signals can still be indexed as a bare URL with no title or description. Long-running blocks also stop Google from re-evaluating noindex tags, canonicals, or content updates on the affected pages.

Common causes
  • A wildcard Disallow rule (e.g. Disallow: /api/, Disallow: /private/) accidentally matched a public URL.
  • A staging robots.txt with Disallow: / was deployed to production.
  • A user-agent specific group for Googlebot is more restrictive than the catch-all * group.
  • Allow rules that look more specific are actually shorter than the matching Disallow.
  • URL parameters or trailing slashes cause unexpected matches.
  • A CDN or framework serves a different robots.txt than the one in your repo.
How to diagnose it
  1. Open Robots Tester, paste the URL Google is blocking, and select Googlebot.
  2. Confirm robots.txt actually loads with status 200.
  3. Read which group matched (Googlebot or *) and which exact rule applied.
  4. If the matched rule is a Disallow, that's the line to change.
  5. Also test with Bingbot and other crawlers if those matter to your traffic.
How to fix it
  1. 1

    Identify the exact line

    Robots Tester shows the exact group, line number, and rule that matched. Edit that line in your robots.txt source, not somewhere upstream that overrides it.

  2. 2

    Loosen the Disallow or add a targeted Allow

    If the Disallow is too broad, narrow it (e.g. Disallow: /api/internal/ instead of Disallow: /api/). Otherwise add a more specific Allow line for the URLs you want crawled.

  3. 3

    Sync staging vs. production robots.txt

    A common foot-gun: a one-line Disallow: / staging file gets promoted to production. Add an explicit production robots.txt to your deploy pipeline.

  4. 4

    Confirm what your CDN actually serves

    Re-run Robots Tester after deploy. The file in your repo doesn't matter, only the file Googlebot fetches at /robots.txt does.

  5. 5

    Use noindex instead of robots.txt to remove pages

    If your real goal is "don't index this URL," robots.txt is the wrong tool. Allow crawl and add a noindex meta tag or X-Robots-Tag header so Google can read the directive.

FAQ
Is robots.txt the same as noindex?

No. robots.txt controls whether Googlebot is allowed to crawl a URL. noindex tells Google not to include the URL in the search index. To use noindex effectively, the page must be crawlable.

Can Google index a URL blocked by robots.txt?

Yes, sometimes. If the URL has external links pointing to it, Google can index a bare URL with no title or description. To remove the URL completely, allow crawl and add noindex.

Should I block private pages with robots.txt?

Treat robots.txt as a crawling hint, not a security boundary. For anything truly private, require authentication. robots.txt is publicly readable, so it can even reveal what you're trying to hide.

Related fixes

Ready to diagnose your URL?

Robots Tester runs the exact checks discussed above.

Run Robots Tester