What is Robots.txt? Why Robots.txt Matters for SEO?

Learn about Robots.txt and how to edit a robot file.

Mar 08, 2024

What is Robots.txt? Why Robots.txt Matters for SEO?

Contents

What is a Robots.txt file?Why Robots.txt Matters for SEO?The Inblog Default Directives How to Edit a Robots.txt file in Inblog?Connected as a Subdomain Connected as a Sub-directory

What is a Robots.txt file?

Warning: Any mistakes you make in your robots.txt can seriously harm your site, so read and understand this article before diving in.

A robots.txt file is a plain text document located in a website’s root directory, serving as a set of instructions to search engine bots.

Robots.txt specifies which pages or sections should be crawled and indexed and which should be ignored. This file helps website owners control the behavior of search engine crawlers, allowing them to manage access, limit indexing to specific areas, and regulate crawling rate. While it’s a public document, compliance with its directives is voluntary, but it is a powerful tool for guiding search engine bots and influencing the indexing process.

A basic robots.txt file might look something like this:

Even though you can use the robots.txt file to tell a crawler where it can’t go on your site, you can’t use it to say to a search engine which URLs not to show in the search results – in other words, blocking it won’t stop it from being indexed. If the search engine finds enough links to that URL, it will include it; it will just not know what’s on that page.

Why Robots.txt Matters for SEO?

Managing crawl budget

It’s generally understood that a search spider arrives at a website with a pre-determined “allowance” for how many pages it will crawl (or how much resource/time it’ll spend, based on a site’s authority/size/reputation, and how efficiently the server responds). SEOs call this the crawl budget.

If you think your website has problems with crawl budget, blocking search engines from ‘wasting’ energy on unimportant parts of your site might mean focusing instead on the sections that matter.

Providing the location of the sitemap

Providing the location of the XML sitemap in the robots.txt is done to help crawlers discover a website's content more effectively. While submitting the sitemap to Google via Google Search Console is possible, mentioning the sitemap directory in the robots.txt ensures that the sitemap is quickly discovered by crawlers.

The Inblog Default Directives

By default, Inblog generates a robots.txt file with the following contents:

There are four main components to configuring a robots.txt file. While it's not necessary to include all of them, "User-agent" must be included:

User-agent: Specifies the web robots or user agents the directives apply to. For example, you can specify all robots using '*' or target specific ones like Googlebot, Bingbot, etc.
Disallow: Specifies the directories or files that robots are not allowed to crawl. You can list specific URLs or directories here to block access to certain parts of your website.
Allow: Specifies the exceptions to the Disallow directive. It allows specific robots to access certain directories or files that are otherwise blocked by the Disallow directive.
Sitemap: Specifies where the sitemap is located (the complete absolute path URL from https:// to /sitemaps.xml).

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/
Sitemap: https://www.example.com/sitemap.xml

In this example:

All robots (*) are disallowed from crawling URLs under /admin/ and /private/.
However, they are allowed to crawl URLs under /public/.
The sitemap is specified at https://www.example.com/sitemap.xml.

Let's review the Inblog default directives again:

These directives allow all search engines to crawl your blog. In addition, we add a link to your sitemap, so search engines, and can find it and crawl your blog and posts more efficiently.

How to Edit a Robots.txt file in Inblog?

You can modify the robots.txt file in Inblog as follows. You can find the robots.txt editor under "Blog settings > Robots.txt" section.

The method of editing the Robots.txt file differs slightly depending on the method of connection (subdomain or subdirectory) when embedding a blog onto your website using a custom domain.

Connected as a Subdomain

When connecting a custom domain as a Subdomain like 'blog.example.com', Inblog provides default directives for the robots.txt file as follows:

User-agent: *
Allow: /
Sitemap: https://inblog.ai/blog/sitemap.xml

You can directly edit the file using the robots.txt editor in the blog settings.

Connected as a Sub-directory

When connecting a custom domain as a Sub-directory (or sub-folder) like 'example.com/blog', Inblog rewrites the robots.txt file from your root domain, such as 'yourdomain.com/robots.txt' to add your blog sitemap.

The modified robot file provided by Inblog is as follows:

User-agent: *
Allow: /
Sitemap: https://inblog.ai/sitemap.xml
Sitemap: https://inblog.ai/blog/sitemap.xml

To edit the robot file, follow the instructions below. This setup routes the robot file to InBlog, allowing you to manage it directly within InBlog.

NextJS

Add these rules to the rewrites function in next.config.js file as shown below. Make sure that these rules are in the top order.

const nextConfig = {
  async rewrites() {
    return {
      beforeFiles: [
        {
          source: "/blog",
          destination: "https://inblog.ai/grayzipblog",
        },
        {
          source: "/blog/:path*",
          destination: "https://inblog.ai/grayzipblog/:path*",
        },
        {
          source: "/robots.txt",
          destination: "https://inblog.ai/grayzipblog/robots.txt",
        },
        {
          source: "/_inblog/:path*",
          destination: "https://inblog.ai/grayzipblog/_inblog/:path*",
        },
      ],
    };
  },
};

module.exports = nextConfig

AWS Amplify

Go to App Settings -> Rewrites and redirects. Add the following rules in the Open text editor. Ensure these rules are positioned at the top of the list.

[
  {
    "source": "/blog",
    "target": "https://inblog.ai/grayzipblog",
    "status": "200",
    "condition": null
	},
  {
    "source": "/blog/<*>",
    "target": "https://inblog.ai/grayzipblog/<*>",
    "status": "200",
    "condition": null
  },
  {
    "source": "/robots.txt",
    "target": "https://inblog.ai/grayzipblog/robots.txt",
    "status": "200",
    "condition": null
  },
  {
    "source": "/_inblog/<*>",
    "target": "https://inblog.ai/grayzipblog/_inblog/<*>",
    "status": "200",
    "condition": null
  }
]

Subscribe to Inblog News!