Google Removed The Robots FAQs Document
Earlier this week, Google removed its Robots.txt FAQ help document from its search developer documentation. When asked, John Mueller from Google replied to Alexis Rylko saying, “We update the documentation from time to time. Feel free to submit feedback if you feel something’s missing. Robots.txt is definitely still a thing.”
The Robots FAQ document lived over here: developers.google.com/search/docs/crawling-indexing/robots/robots-faq
That now redirects to the main Google robots.txt help page.
What did the Robots FAQ page say, well the Wayback Machine has a copy, so I will archive it here:
(Q) Does my website need a robots.txt file?
(A) No. When Googlebot visits a website, we first ask for permission to crawl by attempting to retrieve the robots.txt file. A website without a robots.txt file, robots meta tag, or X-Robots-Tag HTTP headers will generally be crawled and indexed normally.
(Q) Which method should I use to block crawlers?
(A) It depends. In short, there are good reasons to use each of these methods:
- robots.txt: Use it if crawling of your content is causing issues on your server. For example, you may want to disallow crawling of infinite calendar scripts. Don’t use the robots.txt to block private content (use server-side authentication instead), or handle canonicalization. To make sure that a URL is not indexed, use the robots meta tag or X-Robots-Tag HTTP header instead.
- robots meta tag: Use it if you need to control how an individual HTML page is shown in search results or to make sure that it’s not shown.
- X-Robots-Tag HTTP header: Use it if you need to control how content is shown in search results or to make sure that it’s not shown.
(Q) Can I use robots.txt, robots meta tag, or the X-Robots-Tag HTTP header to remove someone else’s site from search results?
(A) No. These methods are only applicable to sites where you can modify the code or add files. Learn more about how to remove information from Google.
(Q) How can I slow down Google’s crawling of my website?
(A) You can generally adjust the crawl rate setting in your Google Search Console account.
(Q) I use the same robots.txt for multiple websites. Can I use a full URL instead of a relative path?
(A) No. The rules in the robots.txt file (with exception of sitemap:) are only valid for relative paths.
(Q) Can I place the robots.txt file in a subdirectory?
(A) No. The file must be placed in the topmost directory of the website.
(Q) I want to block a private folder. Can I prevent other people from reading my robots.txt file?
(A) No. The robots.txt file may be read by various users. If folders or filenames of content aren’t meant for the public, don’t list them in the robots.txt file. It is not recommended to serve different robots.txt files based on the user agent or other attributes.
(Q) Do I have to include an allow rule to allow crawling?
(A) No, you do not need to include an allow rule. All URLs are implicitly allowed and the allow rule is used to override disallow rules in the same robots.txt file.
(Q) What happens if I have a mistake in my robots.txt file or use an unsupported rule?
(A) Web crawlers are generally very flexible and typically will not be swayed by minor mistakes in the robots.txt file. In general, the worst that can happen is that incorrect or unsupported rules will be ignored. Bear in mind though that Google can’t read minds when interpreting a robots.txt file; we have to interpret the robots.txt file we fetched. That said, if you are aware of problems in your robots.txt file, they’re usually easy to fix.
(Q) What program should I use to create a robots.txt file?
(A) You can use anything that creates a valid text file. Common programs used to create robots.txt files are Notepad, TextEdit, vi, or emacs. Read more about creating robots.txt files. After creating your file, validate it using the robots.txt Tester.
(Q) If I block Google from crawling a page using a robots.txt disallow rule, will it disappear from search results?
(A) Blocking Google from crawling a page is likely to remove the page from Google’s index.
However, robots.txt disallow does not guarantee that a page will not appear in results: Google may still decide, based on external information such as incoming links, that it is relevant and show the URL in the results. If you wish to explicitly block a page from being indexed, use the noindex robots meta tag or X-Robots-Tag HTTP header. In this case, don’t disallow the page in robots.txt, because the page must be crawled in order for the tag to be seen and obeyed. Learn how to control what you share with Google
(Q) How long will it take for changes in my robots.txt file to affect my search results?
(A) First, the cache of the robots.txt file must be refreshed (we generally cache the contents for up to one day). You can speed up this process by submitting your updated robots.txt to Google. Even after finding the change, crawling and indexing is a complicated process that can sometimes take quite some time for individual URLs, so it’s impossible to give an exact timeline. Also, keep in mind that even if your robots.txt file is disallowing access to a URL, that URL may remain visible in search results despite that fact that we can’t crawl it. If you wish to expedite removal of the pages you’ve blocked from Google, submit a removal request.
(Q) How can I temporarily suspend all crawling of my website?
(A) You can temporarily suspend all crawling by returning a 503 (service unavailable) HTTP status code for all URLs, including the robots.txt file. The robots.txt file will be retried periodically until it can be accessed again. We do not recommend changing your robots.txt file to disallow crawling.
(Q) My server is not case-sensitive. How can I disallow crawling of some folders completely?
(A) Rules in the robots.txt file are case-sensitive. In this case, it is recommended to make sure that only one version of the URL is indexed using canonicalization methods. Doing this allows you to have fewer lines in your robots.txt file, so it’s easier for you to manage it. If this isn’t possible, we recommended that you list the common combinations of the folder name, or to shorten it as much as possible, using only the first few characters instead of the full name. For instance, instead of listing all upper and lower-case permutations of /MyPrivateFolder, you could list the permutations of /MyP (if you are certain that no other, crawlable URLs exist with those first characters). Alternately, it may make sense to use a robots meta tag or X-Robots-Tag HTTP header instead, if crawling is not an issue.
(Q) I return 403 Forbidden for all URLs, including the robots.txt file. Why is the site still being crawled?
(A) The 403 Forbidden HTTP status code, as well as other 4xx HTTP status codes, is interpreted as the robots.txt file doesn’t exist. This means that crawlers will generally assume that they can crawl all URLs of the website. In order to block crawling of the website, the robots.txt must be returned with a 200 OK HTTP status code, and must contain an appropriate disallow rule.
(Q) Is the robots meta tag a replacement for the robots.txt file?
(A) No. The robots.txt file controls which pages are accessed. The robots meta tag controls whether a page is indexed, but to see this tag the page needs to be crawled. If crawling a page is problematic (for example, if the page causes a high load on the server), use the robots.txt file. If it is only a matter of whether or not a page is shown in search results, you can use the robots meta tag.
(Q) Can the robots meta tag be used to block a part of a page from being indexed?
(A) No, the robots meta tag is a page-level setting.
(Q) Can I use the robots meta tag outside of a
section?
(A) No, the robots meta tag needs to be in the
section of a page.
(Q) Does the robots meta tag disallow crawling?
(A) No. Even if the robots meta tag currently says noindex, we’ll need to recrawl that URL occasionally to check if the meta tag has changed.
(Q) How does the nofollow robots meta tag compare to the rel=”nofollow” link attribute?
(A) The nofollow robots meta tag applies to all links on a page. The rel=”nofollow” link attribute only applies to specific links on a page. For more information on the rel=”nofollow” link attribute, see our documentation on user-generated spam and the rel=”nofollow”.
(Q) How can I check the X-Robots-Tag for a URL?
(A) A simple way to view the server headers is to use the URL Inspection Tool feature in Google Search Console. To check the response headers of any URL, try searching for “server header checker”.
Google has discontinued the English version of the Robots FAQs (https://t.co/AJoGFD7if8), now redirecting to “Intro”. Versions in other languages remain unaffected at this time. Personally, I found the original page to be useful.
Are we on the path to leaving robots.txt? #SEO pic.twitter.com/ThgBl9K3Zi
— Alexis Rylko 🇺🇦 (@alekseo) November 22, 2023
We update the documentation from time to time. Feel free to submit feedback if you feel something’s missing. Robots.txt is definitely still a thing.
— I am John – ⭐ ⭐ SRS BZNS ONLY ⭐ ⭐ (@JohnMu) November 22, 2023
I assume maybe Google thinks it is redundant from what is already published on the other pages?
Forum discussion at X.
Source link : Seroundtable.com