
Verifying You Have the Proper Robots.txt File
If you manage a website or are involved in online marketing, you’ve likely heard about the robots.txt file. When implemented incorrectly, this file can have very negative and unintended consequences, such as blocked pages and resources. Imagine trying to rank for a keyword on a page that Google can’t access. In this article, we will cover details including why a robots.txt file is important, how to access a robots.txt file, and everything else you need to know about them!
What is a Robots.txt File?
First things first, what is a Robots.txt file? This is a file that allows search engine crawlers to understand what parts of your website you do and do not want to be crawled. This is the very first location of your site a search engine will visit.
Why is it important?
- You have more control over what you want search engines to see or not see
- Prevents duplicate content
- Helps prevent specific pages, images and files from being indexed and taking up crawl budget space
- A sitemap can be adding directly into the robots.txt file to help search engines better understand your website
Accessing your Robots.txt file
If you’re unsure whether your site has a robots.txt file or you’ve never gone head to check, its an easy process! Simple add /robots.txt at the end of your domain. If you do this and don’t see anything or are taken to a 404 error page then you don’t have one. If this is the case then it is recommended to create a robots.txt file for your site.
What should a robots.txt file look like?
At the very minimum a robots.txt file should contain three main portions you need to understand.
1) User-agent
This command dictates which crawlers are allowed to crawl your website. Websites most commonly use * for the user-agent because it signifies “all user agents.” With the constantly growing of new search engine user0agents entering the market the list can get long. Here’s a list of some of the main user agents:
- Googlebot for Google
- Bingbot for Bing
- Yahoo! Slurp for Yahoo
- Baidu for Baidu
- Yandex for Yandex
When you want to give a specific command for a specific crawler then you would place the user-agent ID in the user-agent location. Every time you reference a crawler you need to follow with a separate set of disallow commands. For example, if you list Bingbot as the user-agent then you notify the crawler which pages to disallow. The most reputable crawlers, like Google, Bing, and Yahoo, will follow the directive of the robots.txt file. Spam crawlers (that usually show up as traffic to your website) are less likely to follow the commands. Most of the time, using the * and giving the same command to all crawlers is the best route.
2) Disallow
This is the command that lets crawlers know which files or pages on your site you don’t want crawled. Typically, disallowed files are pages that contain customer sensitive information such as checkout pages or backend office pages.
Most of the problems that arise in this type of file will happen in the disallow section. The issues will tend to arise when you try and block too much information in the file. The example above shows an appropriate file to disallow. Any files that begin with /wp-admin/ will not be crawled. The following example shows what you do not want to include in the disallow section.
What is the disallow command telling the crawlers in the picture above? In this situation, the crawlers are told not to index any of the pages on your website. If you want your website visible in the search engines, then including a single / in the disallow section is detrimental to your search visibility. If you notice a sudden drop in traffic, check your robots.txt file first to see if this issue is present.
Google even sends out Search Console messages letting websites know if the robots.txt file blocks information it needs to crawl, like CSS files and Javascript.
If you’d rather stay on the safe side of things then it’s recommended to let all crawlers crawl every page on your site. You simply do this by not disallowing anything.
3) Sitemap
A robots.txt file can also include the location of your website’s sitemap, which is highly recommended. The sitemap is the second place a crawler will visit after your robots.txt file. It helps search engines better understand the structure and hierarchy of your website. Make sure the sitemap lists your webpages, specifically the ones you are trying to market or your most valuable pages. Side note, remember if you have multiple sitemaps for your site then add them all into the file.
Checking if your Robots.txt file is working or not
It’s a good practice to make it a routine to check if your robots.txt file is working on your site or not using your search console account. You can use this tool to test it out. Here is a screenshot of what it looks like. All you need to do is input your robots.txt file or the specific URL you want to test and this tool will tell you if it’s accepted or blocked.
If there are any problems or errors with the robots.txt file for your website, Search Console will let you know. Remember that a search engine’s role is to a) crawl, b) index, and c) provide results.
The robots.txt file can block pages and sections that a search engine should crawl but not necessarily index. For example, if you create a link and point it to a webpage, Google could crawl that link and index the page that the link points to. Any time that Google indexes a page, it could show up in a search result.
If you don’t want a webpage to show up in a search result, include that information on the page itself. Include the code <meta name=”robots” content=”noindex”> in the <head> tags on the specific page you don’t want search engines to index. Or you can add a list of pages to be noindexed directly in the robots.txt file, keep reading to learn how.
Creating your own Robots.txt file that is simple and SEO friendly
If you don’t already have a robots.txt file there’s no need to worry because it’s easy to make one. We’ll show you how to make an SEO friendly file in just a few steps.
1) Use a plain text editor
For a Windows, use Notepad; for a Mac, use TextEdit. Avoid using Google Docs or Microsoft Word because they can insert code that you don’t intend to have in the file.
2) Assign a user-agent
As we mentioned above, most sites will typically allow all search engines to access their websites. If you choose to do this, simply type in:
User-agent: *
If you want to specify rules for different user-agents, you will need to separate the rules into multiple user-agents. For example, let’s take a look at SEMrush’s robots.txt file below:
They have listed out specific rules for different user-agents. SEMrush has a specific rule for Google’s user-agent compared to the pages it doesn’t want Bing’s user-agent to crawl. If you find yourself in a similar situation, follow the structure above and separate the rules into separate lines in your plain text editor.
3) Specify your disallow rules
To keep it as simple as possible for this scenario, we will not add anything to the disallow. Or you can choose to not include a disallow section and just leave in the user-agent rule. This means that search engines will crawl everything on the website.
To make your robots.txt file even more SEO-friendly, adding pages that site visitors don’t typically engage with into the disallow section is a good practice because this can help clear up the crawl budget. An example for a WordPress site would be:
The picture above demonstrates that WordPress admin pages (or backend pages) should not be crawled by any user-agent, in addition to the Thank You page (this ensures that only qualified leads will be accounted for, not accidental visitors that can access the page through a SERP). By filtering these kinds of pages from the crawl budget, you can put more attention into the valuable pages you want search engines to crawl and people to visit.
4) Add your sitemap
Last but not least, don’t forget to add your sitemap(s) as you finish creating your robots.txt file. List it out at the bottom after the Disallow section.
5) Submit to the root directory
Once you’re finished creating your robots.txt file, the last step is to upload it into the root directory of your website. Once it’s uploaded, navigate to your robots.txt file and see if the page loads on the search engine. Then test out your robots.txt file using Google’s robots.txt tester tool.
Make sure your Robots.txt file is SEO friendly
The robots.txt file is certainly a more technical aspect of SEO, and it can get confusing. While this file can be tricky, simply understanding how a robots.txt file works and how to create one will help you verify that your website is as visible as possible. It’s a powerful tool that can be used to take your SEO strategy even further.
But if you need help with your robots.txt file or any other part of the SEO campaign, we’re here to help! Advent can help increase online visibility, and stay relevant against other competitors. Let us help you take your SEO strategy to the next level!