Robots.txt pitfalls: what I learned the hard way

3 points by pyeri 10 hours ago

This applies to sites indexed on Google that hope to gain organic traffic. As an indie blogger and SEO enthusiast, I foolishly updated my robots.txt file to prevent indexing of certain unwanted parts of my site, leading to subtle repercussions that I couldn't have foreseen.

A few days ago, while reading about SEO, I came across the concept of a "crawl budget." Apparently, Google allocates a specific crawl budget to your indexed site, and the more useless content it has to index and store on its servers, the more it affects your site—resulting in delays for new content indexing, favicon updates, and robots.txt crawling.

Being a minimalist and utilitarian, I decided to prevent indexing of the `/uploads/` directory on my site since it mostly contained images used in my articles. I thought blocking this "useless content" would free up more crawling budget for my primary content, i.e., articles. So, I added this directory to my site's robots.txt:

  # Group 1
  User-agent: *
  Disallow: /public/
  Disallow: /drafts/
  Disallow: /theme/
  Disallow: /page*
  Disallow: /uploads/

  Sitemap: https://prahladyeri.github.io/sitemap.xml

The way search engines work means there's typically a 5-7 day gap between updating the robots.txt file and crawlers processing it. After about a week, I noticed that my site's favicon disappeared from SERPs on mobile browsers! Instead, there was a bland (empty) icon in its place. That’s when I realized that my favicons also resided in the `/uploads/` directory. After I recently optimized the favicon format by switching from WEBP to PNG, Google was unable to crawl and index the new favicon at all!

Once I realized this mistake, I removed the blocking of `/uploads/` from the robots.txt and requested a recrawl. But who knows how long it will take for Google's systems to sync this change and start showing the site's favicon back in SERPs! Two lessons learned:

1. The robots.txt file is highly sensitive; avoid modifying it if possible. 2. Applying SEO is like steering an extremely large ship or vessel. You pull a lever now, and the ship only moves after several days!

dazc 9 hours ago

USE X-Robots-Tag: noindex to prevent files being indexed and let google determine how they crawl your site for themselves.

A nightmare scenario can result, otherwise, where you have content indexed but don't allow googlebot to crawl it. This does not end well.

https://developers.google.com/search/docs/crawling-indexing/...