SEOSEO News

How to Create a Proper Sitemap and Why You Need One


Wandering down cobblestone streets may feel amazing, but when you need to get to a certain place, it’s better to find it on a map and follow navigation tips. The same applies to search robots—most of the time they explore your site by naturally following links. The problem is, crawlers may never reach some pages this way: either because your website is too big or because pages have no links pointing to them. That is why sitemaps exist. 

In this post, we’ll explore the benefits of an XML sitemap for a website, discuss different types of sitemaps, and learn how to create a proper sitemap. We have also included a bonus infographic at the end of this post, summarizing all sitemapping best practices covered.

What is a sitemap?

A sitemap is a file that contains a list of all the URLs on a website that crawlers and users should be aware of. 

Search crawlers rely on sitemaps to ensure they don’t miss any important content. A sitemap serves as a roadmap for search engine crawlers, guiding them to areas of the website they may have never visited before. It essentially helps search engines navigate and index the website’s content more efficiently. 

It also informs search engines like Google about the pages and files you consider important on your site. It also provides additional metadata about each file, such as the last modified **** and the frequency of changes. Let’s examine sitemap features in the next section.

Sitemap types

There are two main types of sitemaps: HTML and XML sitemaps. Each serve a different purpose, with XML sitemaps being crucial for SEO and indexing, while HTML sitemaps focus on improving user navigation and accessibility. Let’s look at these types in more detail.

HTML sitemap

An HTML sitemap is a web page that contains a list of links, usually to the most important sections and pages of a website. Here are some nice examples of well-designed HTML sitemaps: DHL and Lufthansa

HTML sitemap

The HTML sitemap is designed mainly to help people (not robots) in navigating the website quickly, providing an overview of its main sections.

XML sitemap

An XML sitemap is an XML file, typically names sitemap.xml, located in the root folder of a website. It specifies links, page modification dates, and other parameters relevant to search engines. Since all the parameters are marked with special tags, XML files resemble a website’s HTML code:

XML sitemap code

While XML sitemaps may not look visually appealing, they hold significant SEO value. Creating an XML sitemap allows crawlers to gain a comprehensive understanding of your website’s structure, easily discover new content and much more. 

In addition to the standard XML sitemap that lists website URLs, Google allows the creation of custom sitemaps for image and video content, as well as news sitemaps. Let’s take a closer look at these subtypes.

Image sitemaps

An image sitemap is specifically designed to provide search engines with information about the images on a website.

They are useful in informing Google about the images on your site that the search engine might not otherwise discover, including images reached through JavaScript code. Image sitemaps are also highly beneficial for websites that rely on media content, such as stock photos and libraries. Image sitemaps also increase the website’s chances of being featured in image search results.

Image sitemap code

News sitemaps

News sitemaps are tailored for websites that publish news articles. They provide search engines with information about new articles, including URLs, publication dates, titles, and other relevant metadata.

News sitemaps help search engines identify and index time-sensitive news content more effectively, potentially improving its visibility in news-related search results.

It’s crucial for Google to discover news articles promptly. To make sure users get up-to-**** information, Google only allows inclusion of fresh articles published within the last two days in a news sitemap. It’s also true that a single news sitemap can include no more than 1000 articles, but older content should be removed. The last—and probably the most important condition—is that your website should be registered with Google News.

News sitemap code

Video sitemaps

Creating a video sitemap helps Google find and understand the video content on your site, especially newly added content or videos that may not be easily discovered by crawlers. 

In a video sitemap, you can specify details such as video length, rating, family-friendliness, and more. 

Video sitemap code

According to Google, there is no need to list videos that are unrelated to the content of the host page. For example, videos that serve as minor additions or are unrelated to the main textual content should not be included. Also, all files referenced in the video sitemap should not be disallowed for crawling by robots.txt rules and should not be blocked by firewalls or similar mechanisms. These files must also be accessible without metafiles and requiring login credentials.

Sitemap XML tags and their settings

I’ve already mentioned that along with website URLs, an XML sitemap features various tags that specify page parameters. Here’s an excerpt from our blog’s XML sitemap for you to see how the tags are organized.

How XML sitemap code looks like

Now, let’s go through every one of them for you to understand which purpose the tags serve and how to use them properly. 

  • <urlset> is a mandatory element that encapsulates a sitemap and indicates which version of the XML Sitemap protocol standard is used (via the xmlns attribute). Protocol standard is also called namespace
  • <url> is another mandatory element that marks a block with all information pertaining to a single URL
  • <loc> is the last mandatory element that indicates the page URL itself. All listed URLs should be fully-qualified—contain site protocol (HTTP or HTTPS) as well as the www prefix if it is included in your domain name
  • <changefreq> defines how often a page can change. You can set this parameter as always, hourly, daily, weekly, monthly, yearly or never
  • <priority> sets the crawling priority of a page (valid values range from 0.0 to 1.0)
  • <lastmod> indicates the last time the page content was updated

Make sure that you use the same syntax when specifying a URL. Also, sitemap files should be UTF-8 encoded.

Now, in the past, Google consulted <changefreq> and <priority> tags to define which pages should be prioritized during website crawling. Therefore, webmasters had to carefully set these parameters for each page. This hasn’t been the case for years now. John Mueller and Gary Illyes have confirmed that Google now ignores <priority> tags with Gary Illyes calling the tag “a bag of noise”. 

Gary Illyes on priority and lastmod setting in Twitter

The same goes for the <changefreq> tag—John Mueller claimed it is not taken into account as well. 

The only optional tag that still counts is <lastmod>—Google may consult it if the tag is precise enough. By precise Gary Illyes probably means that the tag should be only updated when some significant changes were made to the content. Updating the tag to fool Google into thinking that your content is fresh won’t cut it. 

When used properly, <lastmod> helps Google understand when the content was last updated and whether it needs to be recrawled. Besides, the tag helps the search engine figure out who was the original publisher.

You can find more complex sitemap examples and full documentations about them at sitemaps.org.

All tags required for use in your image/news/video sitemaps are described in this documentation:

Websites that need a sitemap

According to Google, a sitemap may be necessary in the following situations:

  • Your website is large: Managing interlinking on a large site can be challenging. There is also a greater chance that Googlebot fails to discover some of your new pages. An XML sitemap can remedy this issue by helping Google discover deeply rooted content. 
  • Your website is new and lacks backlinks: Web crawlers rely on links from other websites to crawl and discover new pages. Without external links to your pages, it becomes more difficult for Googlebot to find them. But thanks to sitemaps, you won’t be waiting for ages for Google to learn that your content exists.
  • Your website has tons of fresh media content: Google can consider additional information from sitemaps for search purposes, particularly if your site contains rich media or is featured in Google News.

But for smaller websites with approximately 500 pages or fewer, thorough internal linking, and minimal media files or news pages that are crucial for rankings, a sitemap may not be necessary. In cases like these, the decision to create a sitemap is ultimately up to you.

What are the benefits of having an XML sitemap?

As mentioned above, it is recommended to create a sitemap if you have a large website with numerous media files, or if you’re starting a new project. But what should you do if your website doesn’t fall into any of these categories?

According to Google, adding a sitemap to your website can always be beneficial, and you will not be penalized for having one. It’s also true that creating a sitemap can bring you plenty of other tangible benefits. 

Let’s look at those benefits in greater detail:

Faster indexation

XML sitemaps help search engines understand which pages you want to be indexed. By including a URL in the sitemap, you signal to Google that the page is of high quality. This will help search engines discover new pages more quickly, resulting in a faster indexation process and improved website visibility in search results. 

Mind that Google can choose to ignore your request. To get indexed, the page needs to comply with Google’s quality standards. 

Better indexation of deep pages

Sitemaps play a crucial role in facilitating the indexation of deep pages, which are several clicks away from the homepage and may not be easily discoverable through regular website navigation. 

By including deep pages in the sitemap, you ensure that search engines are aware of their existence and can find them more easily. 

This is especially beneficial if your website has complex structures, such as pages behind form submissions or those accessible only through search functionality.

Monitoring of indexed pages

While a sitemap itself does not directly help in monitoring indexed pages, it can indirectly assist in the process by providing a reference point for comparing the indexed pages with those listed in the sitemap. 

When combined with Google Search Console, you can determine which indexed URLs are covered in the sitemap.

By periodically comparing your sitemap with the actual indexed pages, you can identify any discrepancies or missing pages. If you notice that certain pages listed in the sitemap are not indexed, it may indicate issues with crawling or indexing, such as technical errors, blocking directives, or content quality problems.

Including regional versions of your pages

To inform Google about the regional versions of your pages, you can list them in your sitemap along with special hreflang attributes. 

This is not the only method to properly organize a multilingual website, but some webmasters find it is the easiest way.

Including regional versions of your pages to sitemap

Which pages to include in a sitemap

This section of our site-mapping crash course is extremely important, so read carefully!

There is a common misconception that to assist Google in crawling and indexing your site, you need to include all your website pages in an XML sitemap. However, it’s actually the opposite. Google doesn’t need to see all the irrelevant pages that may exist on your site. Instead, you only need to inform Google about the high-quality pages that you believe deserve a high ranking. By including specific pages in your XML sitemap, you are asking Google to prioritize them. It’s possible that Google may follow your advice, but that’s a different story entirely. Below, we will list the pages that should (and should not) be included in your sitemap.

Pages returning 200 OK code

As a rule of thumb, all the pages you add to a list must be 200 OK pages filled with high-quality, user-focused content. This means you should exclude all pages that do not meet these criteria.

Check this easy-to-follow guide on HTTP status codes if you need assistance understanding the different status codes.

In addition, it’s important to highlight the main types of 200 OK pages that should definitely be included in a sitemap:

  • Home Page
  • Main menu pages
  • Category pages
  • Subcategories
  • Blog pages
  • Internal pages
  • Contact pages
  • Pages with brand information

These pages are crucial for your rankings and overall business success. This is why you must ensure that Google is aware of them.

Pages that should not be included in the sitemap

3XX pages

A 3XX response code indicates that a page has been redirected to a new address. Pages like these shouldn’t be in your XML sitemap.

Instead, focus on including the final destination URLs in your sitemap, as they serve the content after the redirect. This helps search engines understand your website’s structure and properly index and rank your content.

4XX pages

4XX response codes mean that the requested page does not exist or has restricted access. Exclude these pages from your sitemap. 

404 are deleted pages, so if they were removed on purpose, keep them away from your XML sitemap. 

Soft 404 pages are also not suitable for SEO. These are pages that return a 200 level success status code, but have little or no content, contain redirects to the homepage, or include 404 pages blocked by robots.txt. It’s highly recommended to take the time to fix this issue.

Remember to create a custom 404 page to ensure a smooth user experience. 

Another popular 4XX status code is 401, which indicates that Google is “not authorized” to access the page. These kinds of pages are normally intended for logged-in users and should not be indexed by Google. For this reason, it isn’t worth including these pages in your sitemap. 

5XX pages

5XX status codes indicate an issue with your web server. The most common is the 503 Service Unavailable error, which says that the server is temporarily down. This can occur if the web server was under maintenance or was overloaded. 

If the error occurs only once, there is no cause for concern as it is likely due to scheduled web server maintenance. But if the problem persists, you’ll have to figure out what’s causing the issue and resolve it immediately. Otherwise, Google may conclude that your website is poorly maintained. 

For 5XX pages, it is not a matter of adding them to your sitemap or not, but rather fixing the issue to ensure they have a response code of 200 OK.

To easily identify all such pages, you can check the website audit report in SE Ranking. Once you launch the check, navigate to the HTTP Status Code section of the Issues Report.

HTTP Status Code section of the Issues Report in SE Ranking

Non-indexable pages

Every website has a number of utility pages that are important for users, but not for search engines—login pages, pages available upon logging in, sorting and filtering pages, etc. A common practice is to block such pages with the robots.txt file, so that Google can’t access them. Or one may let Google crawl the page but restrict its indexing with special directives (noindex or all).

Naturally, all these pages shouldn’t be on your XML sitemap. If a page cannot be indexed, but is featured on your sitemap, it not only confuses Google, but also wastes your crawl budget. The same goes for pages blocked by robots.txt—Google won’t crawl them.

It is also a good idea to audit your site to make sure the noindex directive and robots.txt file are used properly, and you have not accidentally blocked the wrong pages.

To find all such website pages, go to the Crawling section of the SE Ranking’s Issues Report.

Blocked by noindex pages detected by SE Ranking's website audit

By clicking on the number of respective pages, you’ll see a full list of URLs marked as noindex or blocked by robots.txt along with other important parameters such as the number of internal links pointing to the page, all the robots meta tags and x-robots tags, and page title. You’ll also immediately see whether the page is currently included in your XML sitemap.

If you’re not yet an SE Ranking user, you can test the platform out for free under the 14-day trial. 

Non-canonical pages

A canonical tag is used to mark which of two or more similar pages is the main one. Google is supposed to index and rank the preferred page and ignore duplicate or near-duplicate ones. Similar pages marked with canonical tags can have totally different URLs or similar URLs (e.g. URLs generated in the process of sorting and filtering, URLs with UTM tags and tracking parameters). 

Sometimes, canonical tags are also used to make a page accessible through both HTTP and HTTPS as well as both with and without the www prefix. In this case, the main version of the page (for instance the HTTPS non-www variation) is marked with rel=canonical. However, most websites prefer server-side redirects in this case, as they are more straightforward and guarantee that only the preferred website variation will get indexed.

RUN A WEBSITE AUDIT

Score your website in 2 minutes.

Enter any website URL to get a detailed report on tech issues and suggested solutions.

Canonical tags work differently than redirecting, they are more like recommendations than directives, and Google can follow or ignore them. This is why having non-canonical URLs in a sitemap is really confusing from the search engines’ viewpoint. If the page is not marked as canonical, it tells Google you don’t want to have it indexed. At the same time, you add the page to your XML sitemap encouraging Google to index the page. As a result, Google can index all the page versions, and you’ll have to deal with keyword cannibalization.

Check our guide on keyword cannibalization to learn how it can hinder your site performance and how to detect and prevent it.

Alternatively, the search engine may choose to index the non-canonical URL, which is something you don’t want as well.

How to create an XML sitemap file

We’ve finally reached the most practical section of our post; now let’s look at some of the most common approaches to creating a sitemap.

CMS plugins

The easiest way to generate an XML sitemap is by utilizing your CMS. Since a CMS stores information about all the website pages and any adjustments you make, it can automatically feed all the data into a dynamic sitemap. 

Some CMS platforms have built-in sitemap-generating capabilities. This is the case for Magento, Shopify, Wix, and Squarespace. For other popular CMS platforms like WordPress, Joomla, Drupal, or OpenCart, you will need to use special plugins.

In the table above, see a list of the most popular sitemap plugins for various CMS platforms. When picking the right one for your site, pay special attention to the plugin’s range of features. SEO-friendly solutions should allow you to exclude inappropriate pages like 404, redirected, noindex, canonicalized pages from your XML sitemap. Also, pay attention to customization capabilities, as you want to be able to easily modify the list of pages included in the file.

Sitemap generators

If your site is not built on a CMS or the CMS options are not suitable, you can use dedicated sitemap tools to generate an XML file with all the necessary pages. Doing this is typically easier than installing an additional plugin.

There are both free and paid options available in the market, so carefully study each tool’s capabilities. While you should be able to generate a customizable dynamic sitemap with a paid generator, most free solutions are too basic and lack important features. Free generators often create a static sitemap that includes all site pages, including canonicalized, noindex, and redirect URLs.

If using a paid solution or CMS features is not an option, consider creating a sitemap using SE Ranking’s Website Audit tool. This tool can generate a static sitemap for your website based on the latest crawl in just a few minutes. By default, the tool only includes 200 pages in an XML sitemap while omitting 4xx, 3xx, noindex, and other pages that you typically want to exclude.

How to generate XML map in SE Ranking

Keep your crawl limits in mind when generating your XML sitemap with SE Ranking’s Website Audit tool. Make sure that the number of pages on your website exceeds your set crawl limits in order to avoid missing important pages on your sitemap.

If none of the ready-made solutions work for your website, you can create a custom XML sitemap. Since this requires coding skills, you’ll need a developer to craft a proper dynamic sitemap for your website.

Static vs dynamic sitemaps

By now you may be wondering how much time you’ll have to spend updating your sitemap every time you publish a new page or revamp an existing one. Keeping your XML sitemap up-to-**** can really be a daunting task if you choose to create a static sitemap. The good news is that you can easily avoid all the hassle by creating a dynamic sitemap instead. 

This kind of sitemap updates automatically the moment you make any changes to your website. For example, whenever you delete a page, and it becomes 404, the page would be removed from the sitemap. If you mark a page as noindex or add it to the robots.txt file, it would also be deleted from the sitemap. On the other hand, whenever you create a new page and mark it as canonical—it will immediately be added to your sitemap. It really is that easy—all you need is to properly set up the tool that will be generating your dynamic sitemaps.

Other basic rules for sitemap creation

Before submitting your sitemap to GSC, there are a few additional fundamental steps to cover. Let’s explore them.

Combining image, video and news sitemaps

In some cases, your site content may fall into multiple categories, such as news articles with embedded images and videos. To handle this, you can combine image, video and news sitemaps into a single larger file. 

To do this:

1️⃣ For each sitemap you want to use in a common sitemap, specify the respective namespace that declares the supported tags. This is done with the xmlns attribute of the urlset tag.

2️⃣ To combine sitemaps, add the tags from the desired sitemaps, one after the other, to the appropriate <url> tag, as described in each sitemap documentation.

Combining image, video and news sitemaps

Refer to the Google Documentation for more details.

Managing your sitemaps with a sitemap index file

Search engines have limitations on crawling XML sitempas. The maximum file size when uncompressed should be 50 MB, containing no more than 50,000 URLs. Google has imposed these limitations to ensure that your web server does not get overloaded when serving very large files. If you have a huge website and a huge sitemap file, you’ll need to split it into several smaller ones. Then, you can create a sitemap index file that lists these smaller sitemaps and submit the index file to Google. 

To accomplish this:

  1. Create several separate sitemap files and compress them using a tool like gzip, which can save bandwidth if needed. Compressed files will have the .gz extension, for example, sitemap.xml.gz.
  1. Add all the separate sitemaps to the sitemap index file, as shown in in the example below:
Sitemap index file

The example shows a sitemap index file in XML format listing two sitemaps.

For more information, consult Google’s Documentation.

Adding a sitemap to the robots.txt file

The safest and easiest way to inform search engines about your sitemap is by adding a reference to it in your robots.txt file. The robots.txt file is typically rooted in the root directory of your web server. Simply include the sitemap command, as shown in the screenshot below. 

How to add a sitemap to the robots.txt file

This practice enables search engines to discover your website and understand it more efficiently, leading to better indexing and visibility of your web pages.

Keep in mind that by making your sitemaps public in the robots.txt file, they can be easily found by anyone.

Refer to our comprehensive guide on how to create a robots.txt file for all the necessary information about it. 

How to submit your sitemap to Google

Depending on the tool you used to generate your sitemap, you may need to manually add it to your site, or it may already be there—this would be the case for dynamic lists generated by a CMS. Normally, whenever an XML sitemap is automatically added to a website, it is located at yoursite.com/sitemap.xml. 

If you’ve used one of the tools that generated an XML sitemap file, you’ll have to manually upload it to your website’s root folder. You can do this using your cPanel or via an FTP client, for example, Total Commander or FileZilla. A good idea would be to check the sitemap’s validity before uploading, especially if the file was created manually—use one of the free tools like this one to make sure your XML sitemap is operating as you intend.

Referencing a sitemap in robots.txt

Once a valid sitemap is on your website, you can submit it to Google Search Console (GSC). To do so, go to the Sitemaps report, click Add a new sitemap, enter your sitemap URL, and hit the Submit button.

How to submit your sitemap to Google

Soon, you’ll see if Google was able to properly process your sitemap in the Status column—if everything went well, the status will be Success. If a website sitemap was parsed, but Google detected some errors, the status would be Has errors. Finally, if Google fails to crawl your sitemap, you’ll end up with the Couldn’t fetch status. 

Make sure to fix all the sitemap errors so that your sitemap status is Success.

Sitemap status Success in GSC

In the same table of your Sitemap report, you’ll see the number of discovered URLs—ideally, it should equal the total number of URLs added to your sitemap.

Finally, by clicking the three dots button, you’ll get to the Page indexing report that will help you better understand how Google crawls and indexes your site. Studying the report will help you remove some low-quality pages from your XML sitemap and add pages you might have missed.

Sitemap errors

There are some important sitemap errors to consider. Here’s how to address them should they occur:

  1. Invalid XML format: Sitemaps must adhere to the XML format rules. Make sure your sitemap is properly structured with the correct XML syntax. Use XML validators, as mentioned earlier, to validate your sitemap.
  2. URLs blocked by robots.txt, not accessible, or not followed: Depending on whether you want to index these URLs, you’ll have to either lift the block or remove them from your sitemap. 
  3. Missing or incorrect URLs: Verify that all URLs included in your sitemap are correct and accessible. Check for any missing or broken links and update them accordingly. Also, include each URL’s canonical version to prevent duplicate content issues.
  4. Large sitemap size: If your sitemap exceeds the maximum allowed size (usually 50 MB or 50,000 URLs), divide it into multiple sitemaps and create a sitemap index file to reference them.
  5. Path mismatch: Ensure consistency in your sitemap regarding the presence of “www” in the URLs. If your sitemap is located at https://example.com/sitemap.xml, none of the URLs it features should include www. If your sitemap is located at https://www.example.com/sitemap.xml, all of the URLs it lists should include www.

We have a dedicated guide that covers common sitemap errors, which can help you further in resolving these issues.

Regularly monitoring your sitemap, validating its structure, and addressing any errors or issues promptly will help search engines effectively crawl and index your website.

To conclude

Congrats! You’ve finished our crash course on SEO site-mapping. We’ve additionally put all the sitemap best practices in the infographic which you can find below. Follow these best practices, and you’ll have no problems with Google finding and crawling all the quality pages of your site.

Bonus: Infographic on creating a proper sitemap

Infographic on creating a proper sitemap





Source link

Related Articles

Back to top button
error

Enjoy Our Website? Please share :) Thank you!