Thursday, June 3, 2021

Robots.txt: The Deceptively Important File All Websites Need

The robots.txt file helps major search engines understand where they're allowed to go on your website.

But, while the major search engines do support the robots.txt file, they may not all adhere to the rules the same way.

Below, let's break down what a robots.txt file is, and how you can use it.

→ Download Now: SEO Starter Pack [Free Kit]

What is a robots.txt file?

Every day, there are visits to your website from bots — also known as robots or spiders. Search engines like Google, Yahoo, and Bing send these bots to your site so your content can be crawled and indexed and appear in search results.

Bots are a good thing, but there are some cases where you don't want the bot running around your website crawling and indexing everything. That's where the robots.txt file comes in.

By adding certain directives to a robots.txt file, you're directing the bots to crawl only the pages you want crawled.

However, it's important to understand that not every bot will adhere to the rules you write in your robots.txt file. Google, for instance, won't listen to any directives that you place in the file about crawling frequency.

Do you need a robots.txt file?

No, a robots.txt file is not required for a website.

If a bot comes to your website and it doesn't have one, it will just crawl your website and index pages as it normally would.

A robot.txt file is only needed if you want to have more control over what is being crawled.

Some benefits to having one include:

  • Help manage server overloads
  • Prevent crawl waste by bots that are visiting pages you do not want them to
  • Keep certain folders or subdomains private

Can a robots.txt file prevent indexing of content?

No, you cannot stop content from being indexed and shown in search results with a robots.txt file.

Not all robots will follow the instructions the same way, so some may index the content you set to not be crawled or indexed.

In addition, If the content you are trying to prevent from showing in the search results has external links to it, that will also cause the search engines to index it.

The only way to ensure your content is not indexed is to add a noindex meta tag to the page. This line of code looks like this and will go in the html of your page.

<meta name="robots" content="noindex">

It's important to note that if you want the search engines to not index a page, you will need to allow the page to be crawled in robots.txt.

Where is the robots.txt file located?

The robots.txt file will always sit at the root domain of a website. As an example, our own file can be found at https://www.hubspot.com/robots.txt.

In most websites you should be able to access the actual file so you can edit it in an FTP or by accessing the File Manager in your hosts CPanel.

In some CMS platforms you can find the file right in your administrative area. HubSpot, for instance, makes it easy to customize your robots.txt file from your account.

If you are on WordPress, the robots.txt file can be accessed in the public_html folder of your website.

the robots.txt file in the public_html folder on your WordPress website

WordPress does include a robots.txt file by default with a new installation that will include the following:

User-agent: *

Disallow: /wp-admin/

Disallow: /wp-includes/

The above is telling all bots to crawl all parts of the website except anything under the /wp-admin/ or /wp-includes/ directories.

But you may want to create a more robust file. Let's show you how, below.

Uses for a Robots.txt File

There could be many reasons you want to customize your robots.txt file — from controlling crawl budget, to blocking sections of a website from being crawled and indexed. Let's explore a few reasons for using a robots.txt file now.

1. Block All Crawlers

Blocking all crawlers from accessing your site is not something you would want to do on an active website, but is a great option for a development website. When you block the crawlers it will help prevent your pages from being shown on search engines, which is good if your pages aren't ready for viewing yet.

2. Disallow Certain Pages From Being Crawled

One of the most common and useful ways to use your robots.txt file is to limit search engine bot access to parts of your website. This can help maximize your crawl budget and prevent unwanted pages from winding up in the search results.

It is important to note that just because you have told a bot to not crawl a page, that doesn't mean it will not get indexed. If you don't want a page to show up in the search results, you need to add a noindex meta tag to the page.

Sample Robots.txt File Directives

The robots.txt file is made up of blocks of lines of directives. Each directive will begin with a user-agent, and then the rules for that user-agent will be placed below it.

When a specific search engine lands on your website, it will look for the user-agent that applies to them and read the block that refers to them.

There are several directives you can use in your file. Let's break those down, now.

1. User-Agent

The user-agent command allows you to target certain bots or spiders to direct. For instance, if you only want to target Bing or Google, this is the directive you'd use.

While there are hundreds of user-agents, below are examples of some of the most common user-agent options.

User-agent: Googlebot

User-agent: Googlebot-Image

User-agent: Googlebot-Mobile

User-agent: Googlebot-News

User-agent: Bingbot

User-agent: Baiduspider

User-agent: msnbot

User-agent: slurp     (Yahoo)

User-agent: yandex

It's important to note — user-agents are case-sensitive, so be sure to enter them properly.

Wildcard User-agent

The wildcard user-agent is noted with an (*) asterisk and lets you easily apply a directive to all user-agents that exist. So if you want a specific rule to apply to every bot, you can use this user-agent.

User-agent: *

User-agents will only follow the rules that most closely apply to them.

2. Disallow

The disallow directive tells search engines to not crawl or access certain pages or directories on a website.

Below are several examples of how you might use the disallow directive.

Block Access to a Specific Folder

In this example we are telling all bots to not crawl anything in the /portfolio directory on our website.

User-agent: *

Disallow: /portfolio

If we only want Bing to not crawl that directory, we would add it like this, instead:

User-agent: Bingbot

Disallow: /portfolio

Block PDF or Other File Types

If you don't want your PDF or other file types crawled, then the below directive should help. We are telling all bots that we do not want any PDF files crawled. The $ at the end is telling the search engine that it is the end of the URL.

So if I have a pdf file at mywebsite.com/site/myimportantinfo.pdf, the search engines won't access it.

User-agent: *

Disallow: *.pdf$

For PowerPoint files, you could use:

User-agent: *

Disallow: *.ppt$

A better option might be to create a folder for your PDF or other files and then disallow the crawlers to crawl it and noindex the whole directory with a meta tag.

Block Access to the Whole Website

Particularly useful if you have a development website or test folders, this directive is telling all bots to not crawl your site at all. It's important to remember to remove this when you set your site live, or you will have indexation issues.

User-agent: *

The * (asterisk) you see above is what we call a "wildcard" expression. When we use an asterisk, we are implying that the rules below should apply to all user-agents.

3. Allow

The allow directive can help you specify certain pages or directories that you do want bots to access and crawl. This can be an override rule to the disallow option, seen above.

In the example below we are telling Googlebot that we do not want the portfolio directory crawled, but we do want one specific portfolio item to be accessed and crawled:

User-agent: Googlebot

Disallow: /portfolio

Allow: /portfolio/crawlableportfolio

4. Sitemap

Including the location of your sitemap in your file can make it easier for search engine crawlers to crawl your sitemap.

If you submit your sitemaps directly to each search engine's webmaster tools, then it is not necessary to add it to your robots.txt file.

sitemap: https://yourwebsite.com/sitemap.xml

5. Crawl Delay

Crawl delay can tell a bot to slow down when crawling your website so your server does not become overwhelmed. The directive example below is asking Yandex to wait 10 seconds after each crawl action it takes on the website.

User-agent: yandex  

Crawl-delay: 10

This is a directive you should be careful with. On a very large website it can greatly minimize the number of URLs crawled each day, which would be counterproductive. This can be useful on smaller websites, however, where the bots are visiting a bit too much.

Note: Crawl-delay is not supported by Google or Baidu. If you want to ask their crawlers to slow their crawling of your website, you will need to do it through their tools.

What are regular expressions and wildcards?

Pattern matching is a more advanced way of controlling the way a bot crawls your website with the use of characters.

There are two expressions that are common and are used by both Bing and Google. These directives can be especially useful on ecommerce websites.

Asterisk: * is treated as a wildcard and can represent any sequence of characters

Dollar sign: $ is used to designate the end of a URL

A good example of using the * wildcard is in the scenario where you want to prevent the search engines from crawling pages that might have a question mark in them. The below code is telling all bots to disregard crawling any URLs that have a question mark in them.

User-agent: *

Disallow: /*?

How to Create or Edit a Robots.txt File

If you do not have an existing robots.txt file on your server, you can easily add one with the steps below.

  1. Open your preferred text editor to start a new document. Common editors that may exist on your computer are Notepad, TextEdit or Microsoft Word.
  2. Add the directives you would like to include to the document.
  3. Save the file with the name of “robots.txt”
  4. Test your file as shown in the next section
  5. Upload your .txt file to your server with a FTP or in your CPanel. How you upload it will depend on the type of website you have.

In WordPress you can use plugins like Yoast, All In One SEO, Rank Math to generate and edit your file.

You can also use a robots.txt generator tool to help you prepare one which might help minimize errors.

How to Test a Robots.txt File

Before you go live with the robots.txt file code you created, you will want to run it through a tester to ensure it's valid. This will help prevent issues with incorrect directives that may have been added.

The robots.txt testing tool is only available on the old version of Google Search Console. If your website is not connected to Google Search Console, you will need to do that first.

Visit the Google Support page then click the "open robots.txt tester" button. Select the property you would like to test for and then you will be taken to a screen, like the one below.

To test your new robots.txt code, just delete what is currently in the box and replace with your new code and click "Test". If the response to your test is "allowed", then your code is valid and you can revise your actual file with your new code.

the robots.txt tester on Google Support

Hopefully this post has made you feel less scared of digging into your robots.txt file — because doing so is one way to improve your rankings and boost your SEO efforts.

SEO Starter Pack


Robots.txt: The Deceptively Important File All Websites Need was originally posted by Local Sign Company Irvine, Ca. https://goo.gl/4NmUQV https://goo.gl/bQ1zHR http://www.pearltrees.com/anaheimsigns

No comments: