Today, I’ll discuss what is robots.txt in SEO and how will robots.txt improve search engine indexing? When we submit our website sitemap to Google Webmasters Console then Crawlers (generally we called them bots) check out the website and crawl URLs that we added to the sitemap.
But this procedure of adding sitemap or crawling Web Pages that limited to Google Bots only. Like there are many other search engines which are also popular and many people use it. So, if we just add a sitemap to Google Webmasters Search Console then only Google Bots can crawl your website.
What is Robots.txt in SEO?
There are many other search engines bots can’t crawl your website until we allow them. For that, you have to create a robots.txt file which just another text file. But it’s very important for every website. You can crawl particular part or pages of your website and ignore rest of it by robots.txt.
So, if you want to crawl only your top landing pages of your website and ignore rest of your webpage then you have to create robots.txt regarding it. Don’t worry, I’ll describe all important things that help your website for better indexing in all major search engines.
Why do we need robots.txt?
As I said, it is very important for every website because it instructs web crawlers to crawl your website. If you are using WordPress platform to create your website then you don’t have to worry about robots.txt because WordPress provides the imaginary robots.txt file.
You can check it and if you didn’t make any changes then you will get the same syntax for every WordPress website.
robots txt file for WordPress site;
User-agent: * Disallow: /wp-admin/ Allow: /wp-admin/admin-ajax.php
It’s a by default syntax provided by WordPress.
If you don’t want to disallow any part or URL of your website then you may not need robots.txt but if you own a website which is social networking or community management or eCommerce then you may not allow, admin access page, community pages, backend URLs, payment methods page etc. So, in this case, you definitely need a robots.txt file.
Here are some advantages of having robots.txt
- You don’t have to worry about duplicate content appears in SERP.
- Prevent Internal pages like archive, meta, blogroll, internal search pages & internal files like pdf, images from indexing by a search engine.
- You can specify the location of sitemaps.
This is the basic structure of robots.txt
User-agent: [name of user-agent]
Disallow: [URL string not to be crawled]
You can simply see a robots.txt file of every website by entering the domain and add “/robots.txt” after the domain name. If you find nothing or 404 error page that means you don’t have one.
When you enter this URL you will see;
User-agent: * Disallow: Disallow: /?s=* # Google Image Crawler Setup User-agent: Googlebot-image Disallow: #Website Sitemap Sitemap: https://www.lokesharyan.com/post-sitemap.xml Sitemap: https://www.lokesharyan.com/page-sitemap.xml Sitemap: https://www.lokesharyan.com/product-sitemap.xml Sitemap: https://www.lokesharyan.com/category-sitemap.xml Sitemap: https://www.lokesharyan.com/product_cat-sitemap.xml
Before understanding the structure of robots.txt you should understand it’s commands/syntax first.
Here is the fundamental syntax of robots.txt;
User-agent: User-agent means crawler. If you want your website crawled by specific crawlers only then you have to specify the name of crawlers. There are total 302 numbers of major web crawlers. Otherwise, if you want to allow all the search engine crawlers then simply add “*”
Disallow: It instructs web crawlers to not to crawl particular URL. But you should specify the list of all URLs that you don’t want to crawl by crawlers otherwise it will be crawled. In my case, I entered disallow: that means web crawlers crawl each and every page of my website.
Allow: This command used to give instructions to only Google bots. It tells Google bots to allow specific URL. But please be careful, if you allow each and every web crawlers to crawl your website then don’t use this command. You should use disallow command.
Crawl-delay: This is very useful command but Google bots do not acknowledge this command but you can set crawl rate in Google Search Console. When you have dynamic web page or webpage which actual content loads after few seconds and you want that page to index then you should use this command. That means you can manually set crawl delay of few milliseconds.
Sitemap: This is also an important command. Because it specifies the location of XML sitemaps. But please be careful, this command is acknowledged by only Google, Bing, Yahoo and Ask.
Examples of robots.txt:
Block all web crawlers from all URLs of the website
User-agent: * Disallow: /
Allow all web crawlers to crawl all URLs
User-agent: * Disallow:
Block all web crawlers from a specific webpage
User-agent: * Disallow: /terms-and-conditions/
Block all URLs that having /?s=
User-agent: * Disallow: /?s=*
How will robots.txt improve search engine indexing?
If you are an eCommerce business or complete SEO services or guest blogging services provider then you have some pages that describe your business or services in a good manner then you should give those landing pages more priority.
Like services, products, shop, about us etc. And give less priority to other landing pages or block unnecessary pages. Now, list out all pages that you don’t want to allow in the search engine and add those after disallow:
you should also block internal search pages by using the following syntax;
This syntax will block all URLs having ?s= into it.
Now, if you are an eCommerce business, then you should also index all your products images because images rank in SERP, for that insert the following syntax;
# Google Image Crawler Setup User-agent: Googlebot-image Disallow:
# means comment, so, text added after # is counted as a comment. This command will index all the images you have on your website by Googlebot-image crawler.
You should also consider specifying the location of your sitemaps. Like eCommerce website have lots of URLs and for better indexing, you should create multiple sitemaps for products, category, etc.
Now, if you specify your sitemaps’ location then it becomes easy for crawlers’ to crawl your sitemaps. For that, you should use the following syntax and make sure add all sitemaps you create.
#Website Sitemap Sitemap: https://www.lokesharyan.com/post-sitemap.xml Sitemap: https://www.lokesharyan.com/page-sitemap.xml Sitemap: https://www.lokesharyan.com/product-sitemap.xml Sitemap: https://www.lokesharyan.com/category-sitemap.xml Sitemap: https://www.lokesharyan.com/product_cat-sitemap.xml
Now, all unnecessary URLs get blocked by robots.txt and web crawlers will focus only on your top landing pages and images of your website. And this will help you to get better indexing by search engines.
One thing you should note that, if you add those URLs that you want to block by web crawlers in the sitemap then you may get an error in Google Search Console. So, you should remove those URLs from sitemap first then you can block those by robots.txt file.
I hope you like this. If you did then like and share if you love. Leave your comments below.
Subscribe newsletter/RSS to the latest updates.
See you in the next one.
Also published on Medium.