SEO

Robots.txt

Robots.txt is a publicly accessible text file located in a website's root directory (`/robots.txt`) that serves as a standard protocol (Robots Exclusion Protocol) for guiding search engine crawlers on which URLs they may access on the site.

Robots.txt is a publicly accessible text file located in a website's root directory (/robots.txt) that serves as a standard protocol (Robots Exclusion Protocol) for guiding search engine crawlers on which URLs they may access on the site.

Why It Matters

Search engines are limited in the number of pages they visit per day based on the crawl budget allocated to each website. A properly configured robots.txt blocks unnecessary paths — such as admin pages, API endpoints, and duplicate content — from being crawled, allowing crawl budget to be focused on core content. For large-scale sites with thousands of pages or more, this configuration directly impacts indexing speed and overall SEO performance.

Since 2025, the emergence of AI crawlers such as GPTBot, CCBot, and PerplexityBot has further expanded the role of robots.txt. Granular management is now required — allowing search engine crawlers while separately blocking AI training crawlers.

Key Directives

DirectiveDescriptionExample
User-agentSpecifies which crawler the rules apply to. * means all crawlers.User-agent: Googlebot
DisallowSpecifies paths to block from crawling.Disallow: /admin/
AllowPermits specific sub-paths within a Disallow-blocked parent path.Allow: /admin/public/
SitemapSpecifies the URL of the XML sitemap. Conventionally placed at the bottom of the file.Sitemap: https://example.com/sitemap.xml
Crawl-delaySets the wait time in seconds between crawler requests. Googlebot ignores this directive.Crawl-delay: 10

Configuration Guide

Below is a recommended robots.txt configuration as of 2026:

User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /*?*utm_

# Block AI crawlers
User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

# Sitemap
Sitemap: https://example.com/sitemap.xml

Important considerations:

  1. File location: The file must be located at the domain root (https://example.com/robots.txt). Placing it in a subdirectory will cause crawlers to not recognize it.
  2. Case sensitivity: URL paths are case-sensitive. Disallow: /Private/ does not block /private/.
  3. Testing is mandatory: Use Google Search Console's robots.txt tester to verify that Googlebot interprets the file correctly.
  4. Sitemap integration: While directly submitting your sitemap to Google Search Console and Bing Webmaster Tools is recommended, it is also good practice to specify it in robots.txt.

Common Mistakes

  • Treating it as a security tool: Robots.txt is merely a request to crawlers — it does not physically block access. Sensitive pages require separate security measures such as server authentication or IP blocking.
  • Confusing Disallow with noindex: Disallow only blocks crawling, not indexing. Pages with external links can still appear in search results even without being crawled. To completely remove a page from search results, use the noindex meta tag.
  • Accidentally blocking the entire site: Setting Disallow: / under User-agent: * blocks all crawlers from accessing the entire site. A frequent mistake is using this setting during a site redesign or on a staging environment and forgetting to revert it for production deployment.
  • Blocking CSS and JS files: Googlebot renders pages to evaluate content. Blocking CSS or JavaScript file crawling results in incomplete rendering and can lower SEO scores.
  • Exposing sensitive paths in robots.txt: Robots.txt is a publicly accessible file that anyone can view. Listing a private path like /secret-admin-panel/ in Disallow actually reveals the existence of that path to the outside world.

Related inblog Posts

How inblog Helps

inblog allows search engine crawlers by default and provides AI crawler (GPTBot, etc.) management through the dashboard.