Why Google Indexes Blocked Web Pages

Vernon September 6, 2024

0 2 minutes read

Google’s John Mueller answered a question about why Google indexes pages that are disallowed from crawling by robots.txt and why the it’s safe to ignore the related Search Console reports about those crawls.

Bot Traffic To Query Parameter URLs

The person asking the question documented that bots were creating links to non-existent query parameter URLs (?q=xyz) to pages with noindex meta tags that are also blocked in robots.txt. What prompted the question is that Google is crawling the links to those pages, getting blocked by robots.txt (without seeing a noindex robots meta tag) then getting reported in Google Search Console as “Indexed, though blocked by robots.txt.”

The person asked the following question:

“But here’s the big question: why would Google index pages when they can’t even see the content? What’s the advantage in that?”

Google’s John Mueller confirmed that if they can’t crawl the page they can’t see the noindex meta tag. He also makes an interesting mention of the site:search operator, advising to ignore the results because the “average” users won’t see those results.

He wrote:

“Yes, you’re correct: if we can’t crawl the page, we can’t see the noindex. That said, if we can’t crawl the pages, then there’s not a lot for us to index. So while you might see some of those pages with a targeted site:-query, the average user won’t see them, so I wouldn’t fuss over it. Noindex is also fine (without robots.txt disallow), it just means the URLs will end up being crawled (and end up in the Search Console report for crawled/not indexed — neither of these statuses cause issues to the rest of the site). The important part is that you don’t make them crawlable + indexable.”

Takeaways:

1. Mueller’s answer confirms the limitations in using the Site:search advanced search operator for diagnostic reasons. One of those reasons is because it’s not connected to the regular search index, it’s a separate thing altogether.

Google’s John Mueller commented on the site search operator in 2021:

“The short answer is that a site: query is not meant to be complete, nor used for diagnostics purposes.

A site query is a specific kind of search that limits the results to a certain website. It’s basically just the word site, a colon, and then the website’s domain.

This query limits the results to a specific website. It’s not meant to be a comprehensive collection of all the pages from that website.”

2. Noindex tag without using a robots.txt is fine for these kinds of situations where a bot is linking to non-existent pages that are getting discovered by Googlebot.

3. URLs with the noindex tag will generate a “crawled/not indexed” entry in Search Console and that those won’t have a negative effect on the rest of the website.

Read the question and answer on LinkedIn:

Why would Google index pages when they can’t even see the content?

Featured Image by Shutterstock/Krakenimages.com

Source link : Searchenginejournal.com