Robots.txt Checklist

Before moving on the most essential topic of Robots.txt checklist we first need to understand the basic elements of Robots.txt.


  • 1. User-agent allow you to specify the robot name for Google, Bing, Yahoo, Yandex or any other.
  • 2. Disallow guide the robots not to crawl.
  • 3. Allow section allow the robots to crawl.
  • 4. Crawl-delay guide the robots to wait for a certain amount of time before starting the crawling.
  • 5. Noindex guide the search engine to remove page from indexing.
  • 6. Using # you can specify the comment. This help developer and other to better understand the code.
  • 7. Supporting strings like * and $ strings for text matching and URL.


Most important points relating to Robots.txt.

  • 1. Always have robots.txt file in the root. For example if I am having a website with a name alljobshub.com then the path of the robots.txt will be alljobshub.com/robots.txt
  • 2. Include all the files and directory which you don’t want to index.
  • 3. You can also specify the file or directory which you want to index. So, based on your requirement you can include the files.
  • 4. There is no need to block the java script and CSS. So, don’t disallow them.
  • 5. You can manage the crawl time by using crawl-delay. But the best way to manage the crawling time is to use Google Search Console.
  • 6. To check robots.txt file you can use Google Search Console. Using it you can validate it.
  • 7. Make sure that the size of Robots.txt should not be more than 500 kb.


Syntax overview –

Indexing of everything using robots.txt over search engine

User-agent: *

Disallow:


Blocking indexing of everything using robots.txt from search engine crawler

User-agent: *

Disallow: /


For detail overview on robots.txt please explore our blog – Meta Robots and Linking Attribute.


Get the Digital Marketing Updates and Insight with AllJobsHub




Types of Attribute – Meta Robot and Robot Linking Attribute

We are going to discuss in detail all the possible combination of Meta Robots tag. This blog helps you to understand indexing non-indexing of pages over search engine.

What you will learn out of this blog?

  • 1. How to index the content over search engine?
  • 2. How to Index web pages over search engine without transferring link juice?
  • 3. How to block pages from indexing over search engine?
  • 4. How to block the page from indexing over search engine but want to follow the link?


Situation 1: I am having a website www.alljobshub.com and I want to block specific page i.e. testimonial page from indexing over search engine.

In such case I will use the below code in the Header section of that page.

<meta name =”robots” content=”noindex, nofollow”>
This meta tag will protect your page from indexing in search engine.
To block page in Google only use the below code:
<meta name=”googlebot” content=”noindex, nofollow”>

Situation 2: I want to index my website tech blog page i.e. http://www.alljobshub.com/tech-blog/. In such case use the below code:

<meta name=”robots” content=”noindex, follow”>

Here, my page will be indexed and link juice will be followed.


Now, what is the difference between “nofollow” and “dofollow” attribute.

Nofollow – Nofollow attribute don’t allow search engine spider to follow the link and this restrict crawler/search engine spider to pass the link juice. In such case only traffic diverted not the link juice.

Dofollow – Dofollow attribute allow search engine crawler to pass the link juice and weightage of the link will be passed to the linking url.

If you don’t specify the follow/nofollow attribute, it will treat follow attribute by default.

Linking attribute:

<a href=”http://www.abc.com” rel=”nofollow”>
<a href=”http://www.abc.com” rel=”follow”>

Note: We have mentioned in detail about the robots Meta tag in the upper part of the content.




Get the Detail Overview on Robots.txt

What is Robots.txt?

Robots.txt is a file that guide the search engine crawler or spider not to crawl or index pages that part of it. Webmaster using this file for blocking unwanted pages from search engine indexing.

To implement robots.txt file you need to create a simple txt file. Integrate the code in the file and save it with the name “robots”. Upload this file over the server in the root folder.


Detail overview on Robots.txt file –

Case 1: If you want to block all the content of your website from all the crawler/spider of different search engines use the below mentioned code.

User-agent: *
Disallow: /

Case 2: Want to craw and index each and every thing.

User-agent: *
Allow: /

Case 3: To disallow indexing of specific folder.

User-agent: *
Disallow: /folder/

Case 4: Want to block specific folder but allowing indexing of specific file of that folder.
In the below example you can see we are allowing indexing of myfile.html but disallowing Googlebot to index “folder1”.

User-agent: Googlebot
Disallow: /folder1/
Allow: /folder1/myfile.html


Based on your requirement you include any number of folder or pages in robots.txt file.

Let’s have a look on the robots.txt file which we have created for our website.

User-agent: *
Allow: /wp-content/uploads/
Disallow: /wp-admin
Disallow: /demo
Disallow: /wp-includes
Disallow: /wp-content/plugins
Disallow: /wp-content/cache
Disallow: /wp-content/themes
Disallow: /wp-includes/js
Disallow: /trackback
Disallow: */trackback
Disallow: /*?*
Disallow: /*?
Disallow: /*~*
Disallow: /*~
Disallow: /backup
Disallow: /ax
Disallow: /cgi-bin


If you want to block specific page from indexing, you can use robots Meta tag. This tag need to be integrated in the header part of the page.

The below is the syntax of Meta robots tag.

meta name=”robots” content=”noindex”

Description: This code will block the page from index over search engine. But the link will be followed.

meta name=”robots” content=”noindex,nofollow”

Description: This code will block the page from indexing and the link will not be followed.