SEO

Robots.txt

Robots.txt is a publicly accessible text file located in a website's root directory (/robots.txt) that serves as a standard protocol (Robots Exclusion Protocol) for guiding search engine crawlers on which URLs they may access on the site.

Robots.txt is a publicly accessible text file located in a website's root directory (/robots.txt) that serves as a standard protocol (Robots Exclusion Protocol) for guiding search engine crawlers on which URLs they may access on the site.

Why It Matters

Search engines are limited in the number of pages they visit per day based on the crawl budget allocated to each website. A properly configured robots.txt blocks unnecessary paths — such as admin pages, API endpoints, and duplicate content — from being crawled, allowing crawl budget to be focused on core content. For large-scale sites with thousands of pages or more, this configuration directly impacts indexing speed and overall SEO performance.

Since 2025, the emergence of AI crawlers such as GPTBot, CCBot, PerplexityBot, and Google-Extended has further expanded the role of robots.txt. The safest default for public marketing content is to allow crawler access and control only paths that waste crawl budget or expose non-public surfaces. Block AI training crawlers only when that matches your content licensing and AI visibility strategy.

Key Directives

DirectiveDescriptionExample
User-agentSpecifies which crawler the rules apply to. * means all crawlers.User-agent: Googlebot
DisallowSpecifies paths to block from crawling.Disallow: /admin/
AllowPermits specific sub-paths within a Disallow-blocked parent path.Allow: /admin/public/
SitemapSpecifies the URL of the XML sitemap. Conventionally placed at the bottom of the file.Sitemap: https://example.com/sitemap.xml
Crawl-delaySets the wait time in seconds between crawler requests. Googlebot ignores this directive.Crawl-delay: 10

Configuration Guide

For a public blog, the baseline configuration should be simple:

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

Add Disallow rules only for areas that should not be crawled, such as internal search, admin routes, duplicate filter URLs, or API endpoints. If you need to block specific AI training crawlers while keeping search crawlers open, isolate those user agents:

User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /*?*utm_

# Block AI crawlers
User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

# Sitemap
Sitemap: https://example.com/sitemap.xml

Important considerations:

  1. File location: The file must be located at the domain root (https://example.com/robots.txt). Placing it in a subdirectory will cause crawlers to not recognize it.
  2. Case sensitivity: URL paths are case-sensitive. Disallow: /Private/ does not block /private/.
  3. Rule matching: Google uses the most specific matching rule. If Allow and Disallow rules both match a URL, the longer matching path wins.
  4. HTTP status handling: A 404 or 410 robots.txt is treated as if no restrictions exist. A 5xx response can temporarily stop crawling because Google cannot tell whether the rules are unavailable or intentionally restrictive.
  5. Testing is mandatory: Use Search Console's robots.txt report and URL Inspection tool to verify that Googlebot can fetch the file and that important URLs are not blocked.
  6. Sitemap integration: While directly submitting your sitemap to Google Search Console and Bing Webmaster Tools is recommended, it is also good practice to specify it in robots.txt.

Common Mistakes

  • Treating it as a security tool: Robots.txt is merely a request to crawlers — it does not physically block access. Sensitive pages require separate security measures such as server authentication or IP blocking.
  • Confusing Disallow with noindex: Disallow only blocks crawling, not indexing. Pages with external links can still appear in search results even without being crawled. To completely remove a page from search results, use the noindex meta tag.
  • Blocking a page before Google can see noindex: If you add Disallow and noindex together, Google may never crawl the page and therefore never see the noindex directive.
  • Accidentally blocking the entire site: Setting Disallow: / under User-agent: * blocks all crawlers from accessing the entire site. A frequent mistake is using this setting during a site redesign or on a staging environment and forgetting to revert it for production deployment.
  • Blocking CSS and JS files: Googlebot renders pages to evaluate content. Blocking CSS or JavaScript file crawling results in incomplete rendering and can lower SEO scores.
  • Exposing sensitive paths in robots.txt: Robots.txt is a publicly accessible file that anyone can view. Listing a private path like /secret-admin-panel/ in Disallow actually reveals the existence of that path to the outside world.

Sources:

Related inblog Posts

How inblog Helps

inblog allows search engine crawlers by default and provides AI crawler (GPTBot, etc.) management through the dashboard.