Crawl Budget: What Is It and Does It Affect SEO?

Vernon September 25, 2023

0 12 minutes read

Crawl Budget: What Is It and Does It Affect SEO?

What Is Crawl Budget?

Crawl budget is the amount of time and resources search engine bots allocate to crawling your website and indexing its pages.

Think of it this way: you visit the Louvre, which has hundreds of exhibits across several floors.

Without a map, you could miss out on the art you really wanted to see.

To make the map you have in your hands, someone would have to walk through each room, marking which paintings were listed there.

In this metaphor, the Louvre mapmaker is like a Googlebot crawling a website.

A team of mapmakers constantly running around the Louvre would slow down the experience for spectators.

Similar to the way bots constantly crawling your website would slow it down.

So the mapmakers need a set time to update the map, just as the Googlebot needs to know how often it should crawl your website.

Do Crawl Budgets Affect SEO?

Crawl budget isn’t a ranking factor for technical SEO.

But if Googlebot experiences crawl errors that prevent it from reading and indexing your content, the chances of your pages showing up on search engine results pages (SERPs) are slim.

You don’t want Googlebots crawling your site all the time either.

Exceeding the allocated budget for crawling a website can cause slowdowns or errors in the system.

That can lead to pages being indexed late or not at all, resulting in lower search rankings.

Google uses hundreds of signals to decide where it should place your page. Crawling directs whether your page shows up—not where—and it has nothing to do with quality content.

How Does Google Determine the Crawl Budget?

Every website has a unique crawl budget controlled by two main elements: crawl demand and crawl limit.

It’s important to understand how they work and what goes on behind the scenes.

Crawl Demand

Crawl demand refers to how much desire Google has to crawl your website.

Two factors affect this demand—popularity and staleness.

Popularity

Google prioritizes pages with more backlinks or those that attract higher traffic. So if people are visiting your website or linking to it, Google’s algorithm gets signals that your website is worth more frequent crawls.

Backlinks especially help Google figure out which pages are worth crawling. If Google notices people are talking about your website, it wants to crawl it more to see what the hype is about.

Note the number of backlinks alone doesn’t matter—backlinks should be relevant and from authoritative sources.

You can use Semrush’s Backlink Analytics tool to see which pages draw the most backlinks and may attract Google’s attention. Enter the domain and click the “Indexed Pages” tab.

“Indexed Pages” tab in Backlink Analytics tool

Here you can see the pages with the most backlinks:

"Backlinks" column highlighted in "Indexed Pages" report

Staleness

Googlebot won’t crawl a page that hasn’t been updated in a while.

Google hasn’t revealed the frequency with which the search engine will crawl the website again. However, if the algorithm notices a general site update, bots temporarily increase the crawl budget.

For example, Googlebot crawls news websites frequently because they publish new content several times a day.

In this case, the website has high crawl demand.

Compare this to a website about the history of famous art pieces that isn’t updated as frequently.

Other actions that may signal to Google there are changes to crawl include:

Domain name change: When you change your website’s domain name, Google’s algorithm needs to update its index to reflect the new URL. It will crawl the website to understand the change and pass the ranking signals to the new domain.
URL structure change: If you modify your website’s URL structure by changing the directory hierarchy or removing or adding subdomains, Google’s bots need to recrawl the pages to index the new URLs correctly
Content updates: Significant updates to your website’s content, like rewriting a large portion of your pages, adding new pages, or removing outdated content can attract the algorithm’s attention and prompt it to recrawl your website
XML sitemap submission: Updating your XML sitemap and resubmitting it to Google Search Console can inform Google there are changes to crawl. That is particularly useful when you want to ensure Google indexes new or updated pages promptly.

Crawl Rate Limit

The crawl rate limit determines how fast the bot can access and download web pages from your site to prepare the content for serving on search results.

This is Google’s way of ensuring its crawl doesn’t overload your servers.

The crawl limit prevents the bot from bogging down your website with too many requests, which can cause performance issues.

If your site responds quickly, Google gets a green light to increase the limit and it can then use more resources to crawl it.

Similarly, if Google encounters server errors or your site slows down, the limit will fall and Googlebot will crawl the website less.

You can also change the crawl limit manually, although it’s important to do this with caution. Google suggests not limiting the crawl rate unless your server is slowing down.

To change your crawl rate, go to Site Settings on Search Console and adjust the bar. The adjustment typically takes up to two days to apply.

How Does the Crawling Process Work?

The crawling process uses bots to discover, crawl, analyze, and index web pages to provide users with the most relevant and high-quality search results.

It begins with a list of web addresses from past crawls and XML sitemaps site owners provide. Then Google uses web crawlers to visit these addresses, read the information, and follow links on those pages.

The crawlers revisit pages Google already has in the list to check if they’ve changed and also crawl new pages.

During this process, the bot prioritizes when and what to crawl based on crawl demand while ensuring the website can handle the server requests it makes.

It then processes and passes the pages it crawls successfully to Google for indexing on search results.

Most of the time, Google crawls smaller websites efficiently.

When it comes to large websites with millions of URLs, Google needs to prioritize when and what to crawl and how many resources it should dedicate.

How to Check Your Crawl Activity

Google Search Console provides complete information on your crawl activity, including crawl errors and crawl rate, in the Crawl Stats Report.

Your Crawl Stats Report helps you double check whether Google can access and index your content. It can also identify and fix any issues before your website’s visibility falls.

To access your Crawl Stats Report, log in to Search Console and click “Settings.”

“Settings” button highlighted in Search Console menu on the left

Open Crawl Stats report from Search Console "Settings" page

The Summary page gives you a lot of information. The main elements are:

Over-Time Charts

The over-time charts highlight your crawl data in the past 90 days.

Here’s what all this means:

Total crawl requests: The number of crawl requests Google made in the last 90 days.
Total download size: The total amount of data Google’s crawlers downloaded when accessing your website over a specific period.
Average response time: The amount of time it takes for your website’s server to respond to a request from a user’s browser.

Grouped Crawl Data

Grouped crawl data provides information on crawl requests.

The data divides these requests according to the URL file type, response, Googlebot type, and purpose of the request (discovery for a new page or refresh for an existing page).

Host Status

Host status shows your site’s general availability and whether Google can access it without any problems.

Here’s a breakdown:

Errors in the robots.txt section mean Google couldn’t crawl your site for any technical reason. Robots.txt fetch tells you the failure rate the bots experienced when they crawled your robots.txt file. While your site doesn’t need a robots.txt file, it must return the successful response 200 or 404 when asked for it. If Googlebot has a connection issue like a 503 HTTP status, it will stop crawling your site.
DNS resolution tells you when your DNS server didn’t recognize your hostname or didn’t respond during crawling. If you encounter an issue here, contact your registrar to verify that your site is set up correctly and your server has an active internet connection.
Server connectivity tells you when your server was unresponsive or didn’t provide the full response. If you’re seeing spikes or consistent connectivity issues here, you might need to talk to your provider about increasing capacity or fixing availability issues.

How to Check Your Website’s Crawlability with Site Audit

Use the Site Audit tool to get SemrushBots to crawl your website and check its health.

The tool flags over 140 issues—including duplicate content and broken links, you can identify and fix.

Here’s what you need to do:

Finalize Basic Settings

Go to the Site Audit tool and create a new project if this is your first audit.

"Create project" pop-up in Site Audit tool

Next select your site audit settings.

Crawl scope lets you narrow down your search.

You can also adjust the number of URLs to check per audit based on your subscription level and how frequently you’ll inspect the website.

Set "limit of checked paged" to audit in the settings

The more pages you crawl, the better visibility you get.

You also have the option to choose your crawl source.

Note the “Website” option will make the bots crawl your site as Google does, while a crawl utilizing “Sitemaps on site” uses the URLs in the sitemap of the robots.txt file.

Alternatively, click “URLs from file” to select which pages you want the bots to crawl.

Adjust Crawler Settings

Crawler Settings allow you to choose the kind of bot to crawl your website.

You can choose between Googlebot and SemrushBot and mobile and desktop versions.

Then select your Crawl-Delay settings. Bots crawl at a normal speed with “minimum delay” and prioritize the user experience in the “1 URL per 2 seconds” settings.

Lastly, select “Respect robots.txt” if you have the relevant file and need a specific crawl delay.

Allow/Disallow URLs

Customize your site audit with allow/disallow URL settings by entering the URLs in their corresponding boxes.

Remove URL Parameters

Removing URL parameters ensures the bots don’t waste the crawl budget by crawling the same page twice.

"List URL parameters to ignore during a crawl" settings page

Bypass Website Restrictions

If your website is still under development, use this setting to run an audit.

Schedule Your Audits

You can schedule how often you want the bot to audit your website.

"How often should we audit your website?" settings page

Schedule it frequently to ensure the tool checks your website’s health regularly.

Analyze the Results

The site audit overview summarizes all data the bots collected during the crawl.

“Crawled Pages” tells you how many pages were “Healthy,” “Broken,” and “Have issues.”

“Crawled Pages” widget in Site Audit overview report

You also get more in-depth insights in the “Crawlability” section.

“Crawlability” widget in Site Audit overview report

Here, you’ll find how much of your site’s crawl budget was wasted or had crawl budget issues.

8 Tips for Crawl Budget Optimization

Once you know your site’s issues, you can fix them and maximize its crawl efficiency. Here’s how to optimize your crawl budget:

Check Your Site Speed

Google’s crawling is limited by bandwidth, time, and availability of Googlebot resources. If a website’s server responds to requests quicker, Google might crawl more pages.

Increase your site speed to improve user experience(UX) and help Google crawl your site faster.

Head over to the Site Performance results from your site audit to check your page load times and any performance issues.

“Site Performance” widget in Site Audit overview report

Here’s what the Site Performance page looks like:

There are many ways to improve page speed, including:

Optimizing your images: Use online tools like Optimizilla to reduce file size without pixelating them. Ensure proper sizing of images and choose the appropriate file type (like JPEG, PNG, or GIF).
Minimizing your code and scripts: Scan your website’s code to remove anything unnecessary or redundant. Also, consolidate multiple files into one and use browser caching to store data on the user’s computer so it doesn’t need to reload every time.
Using a content delivery network (CDN): A CDN is a network of servers in different parts of the world. When someone tries to access your website, the CDN automatically sends them the content from the server closest to them, which means it loads faster. You’ll need to sign up for a CDN service like Amazon Cloudfront, upload your website’s content to their servers, and configure your website to use the CDN’s URLs instead of your own.

Increase Internal Linking and Page Sitemaps

Adding a logical internal link structure and sitemap gives bots direction on which pages to index.

Internal links provide a natural path for search engine crawlers to discover and rank new pages, while sitemaps help them understand the structure and hierarchy of your content.

Imagine your website as a tree, with the homepage at the top and the branches representing the different categories. Each branch should lead to smaller leaves (pages) related to it, creating a clear and logical structure for your website that’s easy for users to navigate.

Check the Internal Linking section of your Site Audit to get an idea of how well you’ve done this.

“Internal Linking” widget in Site Audit overview report

When you have this data, here’s how you can improve your internal linking structure:

The bars in Internal Linking Structure tell you which of your pages are strong, mediocre, and require improvement. Add more internal links to the pages in the last two categories.
Create topic clusters by grouping related pages under one topic and linking them to each other. Let’s say you run a blog about healthy living. Your main pillar page could be “Healthy Living” and your topic clusters could be “Nutrition,” “Fitness,” and “Mental Health.”
Google will most likely crawl pages with plenty of internal links. It’s the ones that are new and don’t have enough internal links pointing to them that may face issues. For example, if you’ve recently added an article on “Marketing Predictions for 2024,” link back to it from your most important pages to attract Google’s attention and traffic.

Hide URLs You Don’t Want in Search Results

Mark checkout pages, ad landing pages, login pages, and thank you pages as “noindex” so they don’t appear in search engine results. Like other meta tags, the “noindex” tag goes into a web page’s <head>.

This tag is especially useful for duplicate pages or other pages you want to keep private.

While the bots will still crawl these pages to see the tag, they’ll take note and limit crawling them in the future.

To tell search engines not to index a specific page, use the robots meta tags with a “noindex” attribute.

Place the following tag into the <head> of your page:

<meta name="robots" content="noindex">

Reduce Redirects

Redirects automatically send bots from one web page to another. If the bot visits a page that has been moved or deleted, the redirect takes it to the new location of the page or to a page that explains why the original page is no longer available.

Redirects require more resources from the bot since they’re visiting more pages than necessary.

Go to the Issues tab on your Site Audit.

The Warnings section will tell you how many URLs have a temporary redirect.

An example of "Why and how to fix it" section for a temporary redirect

Most browsers will follow a maximum of five redirect chains in one crawl.

If you’re using redirects, make sure they’re logical.

Let’s say a sports website has an article about a football game from last year with the URL “example.com/football-game-2022.”

However, this year’s game has a different URL: “example.com/football-game-2023.”

To avoid confusion, the website may set up a redirect so anyone trying to access “example.com/football-game-2022” is automatically redirected to “example.com/football-game-2023.”

This redirect would be logical.

On the other hand, the website might have multiple unnecessary redirects, like redirecting from “example.com/football-game-2022” to “example.com/football-game-recaps” and then to “example.com/football-game-2023.”

That would create a redirect chain that wastes resources and may harm the website’s crawl efficiency.

Fix Broken Links

Broken links are pages the bots can’t access. They harm your crawlability and user experience.

When bots can’t access a page, they can’t crawl and index it. Too many of these errors slow the crawling process and eat into your crawl budget.

A 404 error on Amazon, for example, looks like this.

Image source

You can identify broken links as 4xx errors in the Issues tab of your Site Audit.

4xx errors in the Issues tab of Site Audit

Click the number of pages to check for broken ones. Then replace all links leading to the broken page with another page.

Eliminate Duplicate Content

Bots consider webpages duplicate if the content is 85% identical. Reduce duplicate content to prevent bots from getting confused and crawling the same page multiple times, wasting time and resources.

Duplicate content affects indexing too since search engines may consider it a manipulation tactic.

The Issues tab highlights any duplicate content problems.

duplicated content errors in the Issues tab of Site Audit

Click the page number to check the list of pages with these issues.

"5 pages have duplicate content issues" page in Site Audit

Some ways to fix duplicate content issues include:

Make the content unique by organizing it into a topic cluster or assigning a focus keyword
Add the canonical tag “rel=canonical” into the HTML code of one of these pages to tell Google which page you want to turn up in search results
Add a 301 redirect from the duplicate page to the original (not the best idea for reducing redirects)

Use Robots.txt

A robots.txt file is a text file you place in the root directory of your website to tell search engine bots which pages or sections of the site it should not crawl or index.

Robots.txt files help you block unimportant or private pages like login pages. You don’t want bots to index these pages and waste their resources, so it’s best to tell the bots what they need to do.

Here’s what a simple robots.txt file looks like:

All pages after Disallow specify the pages you don’t want indexed.

To create a robots.txt file, use a robots.txt generator tool. You can also make one yourself.

First open a .txt document with any text editor or web browser and name the document robots.txt.

Then add directives, which are multiple lines of instruction.

Sections of robots.txt file explained in text

Sections of robots.txt file explained in an example

When you have a robots.txt file, save it and upload it to your site. The uploading process depends on your web hosting and the site’s file structure.

Reach out to your hosting provider or search online for help on how to do this. For example, search for “upload robots.txt file to Shopify” to get specific instructions.

Avoid Orphan Pages

Orphan pages don’t have any internal or external links pointing to them. These pages can be hard for crawlers to find, so it’s important to link to them from other areas of your website.

Once you find the list, link to these pages from other high-authority domains on your website.

You can also link related blog posts, product pages, and category landing pages to make it easier for the crawlers to find them.

Maximize Your Crawl Budget with Semrush

While Google can typically crawl small sites easily, larger and more complex websites should adopt a proactive approach to manage their search performance.

This means regularly monitoring and optimizing various technical aspects of a site to ensure it remains accessible, well-structured, and highly visible.

No matter your business’s size, use Semrush’s Site Audit tool to gauge website health, optimize crawlability, and correct errors before they cause damage. That way, your site will remain a strong player in the SERPs.

Source link : Semrush.com