Picture of a silver robot

Robots.txt – What’s the deal?

There has been a lot of recent activity in the search engine optimisation (SEO) industry regarding robots.txt. We thought we’d delve a little deeper into what robots.txt is, how it works, what changes are currently taking place and how best to use this function to improve your SEO.

 

A very brief history of robots.txt

Robots.txt was originally proposed by Martijn Koster in 1994 on the www-talk mailing list, the main communication channel for World Wide Web (WWW) related activities at the time, while it’s used by most search engines, this standard was never formalised.

 

What is robots.txt

Robots.txt, also known as the robots exclusion standard or robots exclusion protocol, is a text file used to instruct web crawlers how to crawl pages on a website.

 

How robots.txt works

When a website owner wants to give a search engine specific instructions, a text file called robots.txt is placed in the website’s root folder, containing the relevant instructions in a specific format.

When a web crawler identifies a website linked to the specific search term, it will first check for a robots.txt file and if found, it will follow the directives in that text file and if it doesn’t, then it will proceed to crawl the site.

While robots.txt can disallow or noindex entire folders, files and pages, content that is referenced on other pages of your website can still be identified by Google and indexed, even if it doesn’t know what the content is.

 

Why use robots.txt

Using robots.txt can prevent duplicate content from appearing in search engine results pages (SERPS), it can keep entire sections of a website private; such as account login pages, keep internal search results pages from showing up on a public search, specify the location of a sitemap, prevent search engines from indexing certain files on your website, such as PDFs as well as specify a crawl delay to prevent server overload.

 

Recent changes to robots.txt

 

Formalising Robots.txt

Google aims to formalise the robots.txt directive and has submitted a request for comments on the Internet Engineering Task Force (IETF) to formalise the robots exclusion protocol specification after 25 years of being an informal internet standard.

Google announced on its blog that together with the original author of the protocol, webmasters, and other search engines, how the REP is used on the modern web has been documented and submitted to the IETF. The proposed REP draft reflects over 20 years of real-world experience of relying on robots.txt rules, used both by Googlebot and other major crawlers, as well as about half a billion websites that rely on REP.

 

What does this mean for you?

While nothing specific is changing, it will provide a definitive guide for keeping robots.txt up to date as well as making sure that specific syntax is followed.

 

Googlebot will no longer obey the noindex directive

Google announced that Googlebot will no longer obey robots.txt directives related to indexing. Users of the noindex directive have until September 2019 to remove it and use alternatives.

 

What does this mean for you?

If any of your pages are currently using a noindex directive, these need to be replaced. So how do you control crawling? Luckily there are a few ways to control indexing;

 

  • Place noindex tag in your robots meta tags

 

  • Create 404 status codes

 

  • Password protect your pages

 

  • You can still disallow pages in robots.txt

 

  • Remove URL in Search Console’s Remove URL tool

 

Open-sourcing robots.txt parser

Google announced this month that it is open-sourcing Google’s production robots.txt parser.

 

What does this mean for you?

Formalising the syntax around robots.txt will allow web owners to add their own directives and programme their own search engine to obey different directives.

 

What does this have to do with my SEO?

From an SEO perspective, these changes mean that there are going to be more standards around robots.txt, which has previously been pretty open and search engines can read and interpret them however they want.

Googlebot, and other crawlers, have crawl budgets. Seeing as the majority of the web makes use of Google for searching, we are going to deep dive into Googlebot’s crawl budget.

Googlebot has a crawl rate limit, which limits the maximum fetching rate for a given site. However, the crawl rate can go up or down based on a couple of factors, such as crawl health – if the site responds quickly the limit goes up and more connections can be used to crawl.

Because of this crawl rate limit, you want to set the most important content pages of your website to allow Googlebot to crawl and index the pages you are trying to rank for.

Website owners can also set a limit to reduce Googlebot’s crawling of a website, but higher limits do not automatically increase crawling.

Crawl demand is also a factor to take into consideration. Crawl demand is based on popularity; popular URLs tend to be crawled more often, and out of date or stale URLs are less likely to be crawled. If the crawl rate limit isn’t reached and there is no demand from indexing, there will be low activity from the Googlebot.

The combination of crawl rate limit and crawl demand defines the crawl budget as the number of URLs the Googlebot can and wants to crawl.

Ultimately, you want to help Googlebot crawl your site in the best way possible and not waste the crawl budget on unimportant pages on your website.

 

Finding your robots.txt file

If you want to check the robots.txt, type the basic URL of your website into your search bar and add /robots.txt to the end. For example: website.com/robots.txt

There are three results that will turn up;

 

  1. A view of a robots.txt file

 

  1. An empty page

 

  1. Or a 404 error

 

What is the best way to use robots.txt on my website?

Pages that generate dynamic content, such as an account page on an eCommerce website, would be pointless for the search engine to crawl because it doesn’t relate to anything that someone would search for.

Additional pages that you can set to disallow crawling include thank you pages and back-end login pages such wp-admin.

Read more on how to create the perfect robots.txt file by leading online marketer Neil Patel.

 

In conclusion

While there are some changes coming to robots.txt, they are not significantly going to impact your website.

However, by correctly using robots.txt you can take maximum advantage of web crawlers and improve your search visibility, so our advice is to accommodate where you can and reap the benefits!

Need assistance with your robots.txt? Get in touch!

Post a Comment