Challenge:
In Sitecore 10.2 with SXA, enhance the Sitemap with the following:
- Limit Sitemap content size to 50MB (uncompressed) and URL count to 50,000. If the limit is exceeded, split into multiple Sitemaps and reference them in a Sitemap Index. If you are on Sitecore 10.3, regarding this enhancement, check out the post – Split SXA Sitemap into Multiple Sitemaps if the Size Limit is Exceeded.
- Exclude the pages from the Sitemap that have the NOINDEX robots’ meta tag.
- Exclude the pages from the Sitemap that are disallowed via robots.txt.
Solution:
The enhancement discussed in this post is done in Sitecore 10.2 but should work in Sitecore 10.2 Update 1 as well.
SXA Sitemap settings can be found as part of /sitecore/content/<Tenant>/<site>/Settings.
Please read the Sitecore documentation Configure a sitemap to ensure the Sitemap is working properly. Also, we know the purpose of each field of the Search Engines Sitemap section from the Settings item. Validate if the Sitemap is working properly at /sitemap.xml of your site.
Let’s enhance the Sitemap in Sitecore 10.2.
Sitemap Limits
Create a new template named “Sitemap Limits” at a suitable location. Add the two fields – “Sitemap max URLs count” and “Sitemap max size” as shown below.
Add the standard template with values as shown below.
Add an item of template “Sitemap Limits” under /sitecore/content/<Tenant>/<Site>/Settings, with the name “Sitemap Limits”. Hence, we have /sitecore/content/<Tenant>/<Site>/Settings/Sitemap Limits.
Update the field values of “Sitemap max URLs count” and “Sitemap max size”, based on your project requirement of Sitemap limits. As per Sitemap size limits best practices, we have set a 50000 URL limit and 50MB content size limit in the standard value.
The field “Sitemap max size” accepts values with KB, MB, or Bytes. If in bytes, provide only the numeric value.
Meta NOINDEX
Create a new template named “_Meta NOINDEX” at a suitable location. Add a field named “Meta Robots NOINDEX” as shown below.
Inherit the “_Meta NOINDEX” template in page templates like /sitecore/templates/Project/<Tenant>/Page.
<meta name=”robots” content=”noindex”> is a meta tag used in the HTML head of a webpage to instruct search engines not to index that page. It prevents the page from appearing in search engine results. For example, if we have a temporary or duplicate page that you don’t want to be indexed, you can use this meta tag to communicate that directive to search engine crawlers.
We can come up with a custom rendering that can use the field “Meta Robots NOINDEX” value whether to output <meta name=”robots” content=”noindex”> on the page or not. Here, we use this field value in Custom Sitemap implementation to exclude the page from the Sitemap if it is checked.
Out of the box, the page also gets excluded from the Sitemap, if the “do not include” value is set on the field “Change frequency” under “Sitemap Settings” on the page item itself. We can even use this field “Change frequency” to add <meta name=”robots” content=”noindex”> on the page if the “do not include” value is set via custom rendering depending on the project preferences.
Overriden classes, new classes, and a config.
We override the following classes.
- Sitecore.XA.Feature.SiteMetadata.Sitemap.SitemapGenerator
- Sitecore.XA.Feature.SiteMetadata.Pipelines.HttpRequestBegin.SitemapHandler
- Sitecore.XA.Foundation.SitecoreExtensions.Pipelines.PreprocessRequest.FilterUrlFilesAndExtensions
We added two classes SiteMapValidator and RobotTxtChecker. These are used by the overridden version of the above classes.
Add the following overridden classes and the new classes code to a suitable project in your solution. Please check the comments for more details. Links on the file names lead to the GitHub code. Feel free to update the namespace as per your project.
CustomSitemapGenerator.cs
using Sitecore.Data.Items; using Sitecore.Links.UrlBuilders; using Sitecore.XA.Feature.SiteMetadata.Sitemap; using System; using System.Collections.Generic; using System.Collections.Specialized; using System.Linq; using System.Text; using Sitecore.Data; using Sitecore.Globalization; using Sitecore.Links; using Sitecore.Security.Accounts; using Sitecore.XA.Feature.SiteMetadata.Enums; using Sitecore.XA.Foundation.SitecoreExtensions.Extensions; using System.IO; using System.Xml.Linq; using System.Collections; using System.Xml; namespace SitecoreThinker.Feature.SEO.Sitemap { public class CustomSitemapGenerator : SitemapGenerator { public Hashtable GenerateSitemapIndex(Item homeItem, NameValueCollection externalSitemaps, SitemapLinkOptions sitemapLinkOptions) { Hashtable hashtable = this.BuildMultilanguageNestedSitemap(this.ChildrenSearch(homeItem).Where<Sitecore.Data.Items.Item>((Func<Sitecore.Data.Items.Item, bool>)(i => i.Security.CanRead((Account)this.Context.User))), sitemapLinkOptions); NameValueCollection siteMapsURLs = new NameValueCollection(); siteMapsURLs.Merge(externalSitemaps); foreach (var key in hashtable.Keys.Cast<string>().OrderBy(key => key).ToList()) //Prepares Sitemap index using the URLs of sitemaps { var defaultUrlBuilderOptions = LinkManager.GetDefaultUrlBuilderOptions(); defaultUrlBuilderOptions.AlwaysIncludeServerUrl = true; siteMapsURLs.Add((string)key, $"{LinkManager.GetItemUrl(Sitecore.Context.Database.GetItem(this.Context.Site.StartPath), defaultUrlBuilderOptions)}{Convert.ToString(key)}"); } hashtable["sitemap.xml"] = BuildSitemapIndex(siteMapsURLs); return hashtable; } protected Hashtable BuildMultilanguageNestedSitemap( IEnumerable<Sitecore.Data.Items.Item> childrenTree, SitemapLinkOptions options) { Hashtable hashtable = new Hashtable(); ItemUrlBuilderOptions urlOptions1 = this.GetUrlOptions(); SitemapLinkOptions options1 = new SitemapLinkOptions(options.Scheme, urlOptions1, options.TargetHostname); ItemUrlBuilderOptions urlOptions2 = (ItemUrlBuilderOptions)urlOptions1.Clone(); urlOptions2.LanguageEmbedding = new LanguageEmbedding?(LanguageEmbedding.Always); SitemapLinkOptions options2 = new SitemapLinkOptions(options.Scheme, urlOptions2, options.TargetHostname); ItemUrlBuilderOptions urlOptions3 = (ItemUrlBuilderOptions)options2.UrlOptions.Clone(); urlOptions3.LanguageEmbedding = new LanguageEmbedding?(LanguageEmbedding.Never); SitemapLinkOptions options3 = new SitemapLinkOptions(options.Scheme, urlOptions3, options.TargetHostname); List<XElement> pages = new List<XElement>(); LanguageEmbedding? languageEmbedding = options1.UrlOptions.LanguageEmbedding; HashSet<ID> idSet = new HashSet<ID>(); foreach (Item obj1 in childrenTree) { if (IsItemNoIndexedMarked(obj1)) continue; SitemapChangeFrequency sitemapChangeFrequency = obj1.Fields[Sitecore.XA.Feature.SiteMetadata.Templates.Sitemap._Sitemap.Fields.ChangeFrequency].ToEnum<SitemapChangeFrequency>(); if (sitemapChangeFrequency != SitemapChangeFrequency.DoNotInclude) { List<XElement> alternateUrls = new List<XElement>(); foreach (Language language in obj1.Languages) { Item obj2 = obj1.Database.GetItem(obj1.ID, language); if (obj2 != null && obj2.Versions.Count > 0) { options2.UrlOptions.Language = language; string fullLink = this.GetFullLink(obj2, options2); if (!IsDisallowedInRobotstxt(fullLink)) { XElement xelement = this.BuildAlternateLinkElement(fullLink, language.CultureInfo.Name); alternateUrls.Add(xelement); } } } if (alternateUrls.Count == 1) { if (this.Context.Site.Language == obj1.Language.Name) options1.UrlOptions.LanguageEmbedding = new LanguageEmbedding?(LanguageEmbedding.Never); else options1.UrlOptions.LanguageEmbedding = new LanguageEmbedding?(LanguageEmbedding.Always); alternateUrls.Clear(); } else if (alternateUrls.Count >= 2) { options1.UrlOptions.LanguageEmbedding = new LanguageEmbedding?(LanguageEmbedding.Always); string fullLink = this.GetFullLink(obj1, options3); if (!IsDisallowedInRobotstxt(fullLink)) { XElement xelement = this.BuildAlternateLinkElement(fullLink, "x-default"); alternateUrls.Insert(0, xelement); } } options1.UrlOptions.Language = obj1.Language; string fullLink1 = this.GetFullLink(obj1, options1); string updatedDate = this.GetUpdatedDate(obj1); string lowerInvariant = sitemapChangeFrequency.ToString().ToLowerInvariant(); string priority = this.GetPriority(obj1); if (alternateUrls.Count >= 2 && !idSet.Contains(obj1.ID)) { options1.UrlOptions.LanguageEmbedding = new LanguageEmbedding?(LanguageEmbedding.Never); string fullLink2 = this.GetFullLink(obj1, options1); if (!IsDisallowedInRobotstxt(fullLink2)) pages.Add(this.BuildPageElement(fullLink2, updatedDate, lowerInvariant, priority, (IEnumerable<XElement>)alternateUrls)); idSet.Add(obj1.ID); } if (!IsDisallowedInRobotstxt(fullLink1)) { XElement xelement1 = this.BuildPageElement(fullLink1, updatedDate, lowerInvariant, priority, (IEnumerable<XElement>)alternateUrls); pages.Add(xelement1); } } } int sitemapCount = (int)Math.Ceiling((double)pages.Count / SiteMapValidator.MaxURLsPerSiteMap); for (int i = 0; i < sitemapCount; i++) { List<XElement> sitemapUrls = pages.Skip(i * SiteMapValidator.MaxURLsPerSiteMap).Take(SiteMapValidator.MaxURLsPerSiteMap).ToList(); string sitemapPath = $"sitemap{i + 1}.xml"; PrepareSiteMap(hashtable, sitemapUrls, sitemapPath); } return hashtable; } private void PrepareSiteMap(Hashtable hashtable, List<XElement> sitemapUrls, string sitemapPath) { XDocument xdocument = this.BuildXmlDocument((IEnumerable<XElement>)sitemapUrls); StringBuilder stringBuilder = new StringBuilder(); using (TextWriter textWriter = (TextWriter)new StringWriter(stringBuilder)) xdocument.Save(textWriter); this.FixDeclaration(stringBuilder); string nestedSiteMap = this.FixEncoding(stringBuilder); if (SiteMapValidator.IsSiteMapSizeValid(nestedSiteMap)) hashtable[sitemapPath] = nestedSiteMap; else { string[] nestedSiteMaps = SplitSitemap(nestedSiteMap).ToArray(); for (int i = 0; i < nestedSiteMaps.Length; i++) { hashtable[$"{sitemapPath.Replace(".xml", $"_{i + 1}.xml")}"] = nestedSiteMaps[i]; } } } public List<string> SplitSitemap(string originalSitemap) { //return the same original sitemap back if its size is within the given limit List<string> sitemapSegments = new List<string>(); if (Encoding.UTF8.GetBytes(originalSitemap).Length <= SiteMapValidator.MaxSiteMapSizeInBytes) { sitemapSegments.Add(originalSitemap); return sitemapSegments; } //If not within the size limit, split it. StringBuilder currentSegment = new StringBuilder(); using (StringReader stringReader = new StringReader(originalSitemap)) using (XmlReader xmlReader = XmlReader.Create(stringReader)) { while (xmlReader.Read()) { if (xmlReader.NodeType == XmlNodeType.Element && xmlReader.Name == "urlset") { if (currentSegment.Length > 0) { // Close the previous <urlset> tag currentSegment.AppendLine("</urlset>"); sitemapSegments.Add(currentSegment.ToString()); currentSegment.Clear(); } currentSegment.AppendLine("<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"no\"?>"); currentSegment.AppendLine("<urlset xmlns:xhtml=\" xmlns=\" } else if (xmlReader.NodeType == XmlNodeType.Element && xmlReader.Name == "url") { if (currentSegment.Length == 0) { throw new InvalidOperationException("Invalid sitemap structure"); } string urlElement = xmlReader.ReadOuterXml(); // Calculate the size of the new URL element, including existing elements StringBuilder tempCurrentSegment = new StringBuilder(); tempCurrentSegment.Append(currentSegment); tempCurrentSegment.AppendLine(urlElement); tempCurrentSegment.AppendLine("</urlset>"); if (Encoding.UTF8.GetBytes(tempCurrentSegment.ToString()).Length > SiteMapValidator.MaxSiteMapSizeInBytes) { // Close the previous <urlset> tag currentSegment.AppendLine("</urlset>"); sitemapSegments.Add(currentSegment.ToString()); currentSegment.Clear(); tempCurrentSegment.Clear(); // Start a new <urlset> tag currentSegment.AppendLine("<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"no\"?>"); currentSegment.AppendLine("<urlset xmlns:xhtml=\" xmlns=\" } currentSegment.AppendLine(urlElement); } } } if (currentSegment.Length > 0) { // Close the last <urlset> tag currentSegment.AppendLine("</urlset>"); sitemapSegments.Add(currentSegment.ToString()); } return sitemapSegments; } //override the existing method to filter the pages with no-index and if disallwed in the robots.txt protected override StringBuilder BuildMultilanguageSitemap(IEnumerable<Item> childrenTree, SitemapLinkOptions options) { ItemUrlBuilderOptions urlOptions = GetUrlOptions(); SitemapLinkOptions sitemapLinkOptions = new SitemapLinkOptions(options.Scheme, urlOptions, options.TargetHostname); ItemUrlBuilderOptions itemUrlBuilderOptions = (ItemUrlBuilderOptions)urlOptions.Clone(); itemUrlBuilderOptions.LanguageEmbedding = LanguageEmbedding.Always; SitemapLinkOptions sitemapLinkOptions2 = new SitemapLinkOptions(options.Scheme, itemUrlBuilderOptions, options.TargetHostname); ItemUrlBuilderOptions itemUrlBuilderOptions2 = (ItemUrlBuilderOptions)sitemapLinkOptions2.UrlOptions.Clone(); itemUrlBuilderOptions2.LanguageEmbedding = LanguageEmbedding.Never; SitemapLinkOptions options2 = new SitemapLinkOptions(options.Scheme, itemUrlBuilderOptions2, options.TargetHostname); List<XElement> list = new List<XElement>(); _ = sitemapLinkOptions.UrlOptions.LanguageEmbedding; HashSet<ID> hashSet = new HashSet<ID>(); foreach (Item item5 in childrenTree) { if (IsItemNoIndexedMarked(item5)) continue; SitemapChangeFrequency sitemapChangeFrequency = item5.Fields[Sitecore.XA.Feature.SiteMetadata.Templates.Sitemap._Sitemap.Fields.ChangeFrequency].ToEnum<SitemapChangeFrequency>(); if (sitemapChangeFrequency == SitemapChangeFrequency.DoNotInclude) { continue; } List<XElement> list2 = new List<XElement>(); Language[] languages = item5.Languages; foreach (Language language in languages) { Item item = item5.Database.GetItem(item5.ID, language); if (item != null && item.Versions.Count > 0) { sitemapLinkOptions2.UrlOptions.Language = language; string fullLink = GetFullLink(item, sitemapLinkOptions2); string name = language.CultureInfo.Name; if (!IsDisallowedInRobotstxt(fullLink)) { XElement item2 = BuildAlternateLinkElement(fullLink, name); list2.Add(item2); } } } if (list2.Count == 1) { if (Context.Site.Language == item5.Language.Name) { sitemapLinkOptions.UrlOptions.LanguageEmbedding = LanguageEmbedding.Never; } else { sitemapLinkOptions.UrlOptions.LanguageEmbedding = LanguageEmbedding.Always; } list2.Clear(); } else if (list2.Count >= 2) { sitemapLinkOptions.UrlOptions.LanguageEmbedding = LanguageEmbedding.Always; string fullLink2 = GetFullLink(item5, options2); string hreflang = "x-default"; if (!IsDisallowedInRobotstxt(fullLink2)) { XElement item3 = BuildAlternateLinkElement(fullLink2, hreflang); list2.Insert(0, item3); } } sitemapLinkOptions.UrlOptions.Language = item5.Language; string fullLink3 = GetFullLink(item5, sitemapLinkOptions); string updatedDate = GetUpdatedDate(item5); string changefreq = sitemapChangeFrequency.ToString().ToLowerInvariant(); string priority = GetPriority(item5); if (list2.Count >= 2 && !hashSet.Contains(item5.ID)) { sitemapLinkOptions.UrlOptions.LanguageEmbedding = LanguageEmbedding.Never; string fullLink4 = GetFullLink(item5, sitemapLinkOptions); if (!IsDisallowedInRobotstxt(fullLink4)) list.Add(BuildPageElement(fullLink4, updatedDate, changefreq, priority, list2)); hashSet.Add(item5.ID); } if (!IsDisallowedInRobotstxt(fullLink3)) { XElement item4 = BuildPageElement(fullLink3, updatedDate, changefreq, priority, list2); list.Add(item4); } } XDocument xDocument = BuildXmlDocument(list); StringBuilder stringBuilder = new StringBuilder(); using (TextWriter textWriter = new StringWriter(stringBuilder)) { xDocument.Save(textWriter); } FixDeclaration(stringBuilder); return stringBuilder; } private string _robotstxtcontent = string.Empty; public string Robotstxtcontent { get { if (string.IsNullOrEmpty(_robotstxtcontent)) _robotstxtcontent = Sitecore.Context.Database.GetItem(Sitecore.Context.Site.StartPath.Replace("/Home", "/Settings"))["RobotsContent"]; return _robotstxtcontent; } } private bool IsDisallowedInRobotstxt(string fullURL) { if (string.IsNullOrEmpty(fullURL)) return false; return RobotTxtChecker.IsUrlDisallowed(Robotstxtcontent, fullURL); } private bool IsItemNoIndexedMarked(Item item) { return item["Meta Robots NOINDEX"] == "1"; } } }
CustomSitemapHandler.cs
using Sitecore.Data.Items; using Sitecore.Diagnostics; using Sitecore.Pipelines.HttpRequest; using Sitecore.XA.Feature.SiteMetadata.Enums; using Sitecore.XA.Feature.SiteMetadata.Pipelines.HttpRequestBegin; using Sitecore.XA.Foundation.SitecoreExtensions.Extensions; using Sitecore.XA.Foundation.SitecoreExtensions.Utils; using Microsoft.Extensions.DependencyInjection; using System; using System.IO; using System.Threading; using System.Threading.Tasks; using System.Web; using Sitecore.DependencyInjection; using System.Collections.Specialized; using Sitecore.Web; using System.Collections; using Sitecore.IO; using Sitecore.XA.Feature.SiteMetadata.Sitemap; using SitecoreThinker.Feature.SEO.Sitemap; namespace SitecoreThinker.Feature.SEO.Pipelines { public class CustomSitemapHandler : SitemapHandler { public override void Process(HttpRequestArgs args) { Uri url = HttpContext.Current.Request.Url; if (!IsSiteMapRequest(url)) return; if (this.CurrentSite == null || !this.IsUrlValidForSitemapFiles(url)) { Log.Info("SitemapHandler (sitemap.xml) : " + string.Format("cannot resolve site or url ({0})", (object)url), (object)this); } else { Item settingsItem = this.GetSettingsItem(); SitemapStatus sitemapStatus = settingsItem != null ? settingsItem.Fields[Sitecore.XA.Feature.SiteMetadata.Templates.Sitemap._SitemapSettings.Fields.SitemapMode].ToEnum<SitemapStatus>() : SitemapStatus.Inactive; string sitemap; switch (sitemapStatus) { case SitemapStatus.Inactive: Log.Info("SitemapHandler (sitemap.xml) : " + string.Format("sitemap is off (status : {0})", (object)sitemapStatus), (object)this); return; case SitemapStatus.StoredInCache: sitemap = this.GetSitemapFromCache(); if (string.IsNullOrEmpty(sitemap)) { sitemap = this.GetSitemap(settingsItem); if (!SiteMapValidator.IsSiteMapValid(sitemap)) { Hashtable siteMapIndexAndSiteMaps = this.GetSitemapIndexAndSiteMaps(settingsItem); sitemap = (string)siteMapIndexAndSiteMaps["sitemap.xml"]; foreach (var key in siteMapIndexAndSiteMaps.Keys) { if (string.Equals((string)key, GetSiteMapFileName(), StringComparison.OrdinalIgnoreCase)) sitemap = Convert.ToString(siteMapIndexAndSiteMaps[key]); this.StoreSitemapInCache(Convert.ToString(siteMapIndexAndSiteMaps[key]), this.CacheKey.Replace("sitemap.xml", Convert.ToString(key))); } } else { this.StoreSitemapInCache(sitemap, this.CacheKey); } break; } break; case SitemapStatus.StoredInFile: sitemap = this.GetSitemapFromFile(); if (string.IsNullOrEmpty(sitemap)) { sitemap = this.GetNestedSitemapFromFile($"{this.CurrentSite.Name}{this.GetSiteMapFileName()}"); //serve from the file if exists if (string.IsNullOrEmpty(sitemap)) { sitemap = this.GetSitemap(settingsItem); //Default if (!SiteMapValidator.IsSiteMapValid(sitemap)) { Hashtable siteMapIndexAndSiteMaps = this.GetSitemapIndexAndSiteMaps(settingsItem); sitemap = (string)siteMapIndexAndSiteMaps["sitemap.xml"]; foreach (var key in siteMapIndexAndSiteMaps.Keys) { if (string.Equals((string)key, GetSiteMapFileName(), StringComparison.OrdinalIgnoreCase)) sitemap = Convert.ToString(siteMapIndexAndSiteMaps[key]); string filePath = Path.Combine(TempFolder.Folder, this.CurrentSite.Name + key); Task.Factory.StartNew((Action)(() => this.SaveSitemapToFile(filePath, Convert.ToString(siteMapIndexAndSiteMaps[key]))), CancellationToken.None, TaskCreationOptions.None, TaskScheduler.Default); } } } break; } break; default: Log.Info("SitemapHandler (sitemap.xml) : unknown error", (object)this); return; } this.SetResponse(args.HttpContext.Response, (object)sitemap); args.AbortPipeline(); } } protected virtual bool IsSiteMapRequest(Uri url) { if (!url.PathAndQuery.EndsWith("/sitemap.xml", StringComparison.OrdinalIgnoreCase) && !url.PathAndQuery.EndsWith("/local-sitemap.xml", StringComparison.OrdinalIgnoreCase) && !SiteMapValidator.IsNestedSiteMap(HttpContext.Current.Request.Url.PathAndQuery)) return false; return true; } protected override bool IsUrlValidForSitemapFiles(Uri url) { if (base.IsUrlValidForSitemapFiles(url)) return true; string vurl = HttpContext.Current.Request.Url.PathAndQuery; int lastIndex = vurl.LastIndexOf("/"); if (lastIndex < 0) return false; if (vurl.Length > vurl.LastIndexOf("/") + 1) { string sitemapFileName = vurl.Substring(vurl.LastIndexOf("/") + 1); return UrlUtils.IsUrlValidForFile(url, this.CurrentSite, $"/{sitemapFileName}"); } return false; } protected Hashtable GetSitemapIndexAndSiteMaps(Item settings) { Hashtable siteMapIndexAndSiteMaps = new Hashtable(); Uri url = HttpContext.Current.Request.Url; CustomSitemapGenerator service = (CustomSitemapGenerator)ServiceProviderServiceExtensions.GetService<ISitemapGenerator>(ServiceLocator.ServiceProvider); //Build SiteMap having the local sitemap and external sitemap urls merged NameValueCollection urlParameters = WebUtil.ParseUrlParameters(settings[Sitecore.XA.Feature.SiteMetadata.Templates.Sitemap._SitemapSettings.Fields.ExternalSitemaps]); siteMapIndexAndSiteMaps = service.GenerateSitemapIndex(this.GetHomeItem(), urlParameters, this.GetLinkBuilderOptions()); return siteMapIndexAndSiteMaps; } protected string GetSiteMapFileName() { string url = HttpContext.Current.Request.Url.PathAndQuery; int lastIndex = url.LastIndexOf("/"); if (lastIndex < 0) return ""; return url.Substring(url.LastIndexOf("/") + 1); } protected string GetNestedSitemapFromFile(string fileName) { string sitemapFromFile = (string)null; if (FileUtil.Exists(Path.Combine(TempFolder.Folder, fileName))) { using (StreamReader streamReader = new StreamReader((Stream)FileUtil.OpenRead(Path.Combine(TempFolder.Folder, fileName)))) sitemapFromFile = streamReader.ReadToEnd(); } return sitemapFromFile; } } }
CustomFilterUrlFilesAndExtensions.cs
using SitecoreThinker.Feature.SEO.Sitemap; using Microsoft.Extensions.DependencyInjection; using Sitecore.DependencyInjection; using Sitecore.Diagnostics; using Sitecore.Pipelines.PreprocessRequest; using Sitecore.XA.Foundation.Abstractions.Configuration; using Sitecore.XA.Foundation.SitecoreExtensions; using System; using System.Collections.Generic; using System.IO; using System.Linq; using System.Web; namespace SitecoreThinker.Feature.SEO.Pipelines.PreprocessRequest { public class FilterUrlFilesAndExtensions : FilterUrlExtensions { public FilterUrlFilesAndExtensions( string allowed, string blocked, string streamFiles, string doNotStreamFiles) : base(allowed, blocked, streamFiles, doNotStreamFiles) { } public override void Process(PreprocessRequestArgs args) { string requestFilePath = this.GetRequestFilePath(); IEnumerable<string> AllowedFileNames = ServiceLocator.ServiceProvider.GetService<IConfiguration<SitecoreExtensionsConfiguration>>().GetConfiguration().AllowedFileNames; if (AllowedFileNames.Contains<string>(requestFilePath)) return; if (SiteMapValidator.IsNestedSiteMap(HttpContext.Current.Request.Url.PathAndQuery)) //check for the nested sitemap files return; base.Process(args); } protected virtual string GetRequestFilePath() { try { return Path.GetFileName(HttpContext.Current.Request.FilePath); } catch (Exception ex) { Log.Error(ex.Message, ex, (object)this); return string.Empty; } } } }
SiteMapValidator.cs
using Sitecore; using Sitecore.Data.Items; using System.Text; using System.Text.RegularExpressions; namespace SitecoreThinker.Feature.SEO.Sitemap { public static class SiteMapValidator { public static int MaxURLsPerSiteMap { get { string sitemapLimitPath = Sitecore.Context.Site?.StartPath?.Replace("/Home", "/Settings/Sitemap Limits"); if (!string.IsNullOrWhiteSpace(sitemapLimitPath)) { Item siteMapLimitItem = Sitecore.Context.Database.GetItem(sitemapLimitPath); if (siteMapLimitItem != null) { string maxURLsValue = siteMapLimitItem["Sitemap max URLs count"]; int maxURLs; if (int.TryParse(maxURLsValue, out maxURLs)) return maxURLs; } } return 50000; } } public static long MaxSiteMapSizeInBytes { get { long defaultMaxSize = StringUtil.ParseSizeString("50MB"); string sitemapLimitPath = Sitecore.Context.Site?.StartPath?.Replace("/Home", "/Settings/Sitemap Limits"); if (!string.IsNullOrWhiteSpace(sitemapLimitPath)) { Item siteMapLimitItem = Sitecore.Context.Database.GetItem(sitemapLimitPath); if (siteMapLimitItem != null) { string maxSizeInMBValue = siteMapLimitItem["Sitemap max size"]; long setSize = StringUtil.ParseSizeString(maxSizeInMBValue); return setSize <= defaultMaxSize ? setSize : defaultMaxSize; } } return defaultMaxSize; } } public static bool IsSiteMapURLsLimitValid(string siteMap) { return Regex.Matches(siteMap, "<loc>").Count <= MaxURLsPerSiteMap; } public static bool IsSiteMapSizeValid(string siteMap) { double size = Encoding.UTF8.GetByteCount(siteMap); if (size > MaxSiteMapSizeInBytes) return false; return true; } public static bool IsSiteMapValid(string siteMap) { return IsSiteMapSizeValid(siteMap) && IsSiteMapURLsLimitValid(siteMap); } public static bool IsNestedSiteMap(string url) { Regex rg = new Regex(@"(?i)(sitemap)(\d+)((_)(\d+))*(.xml)"); string sitemapFileName = url.Substring(url.LastIndexOf("/") + 1); Match match = rg.Match(url); return match.Success; } } }
RobotTxtChecker.cs
using System; using System.Text.RegularExpressions; namespace SitecoreThinker.Feature.SEO.Sitemap { public class RobotTxtChecker { public static bool IsUrlDisallowed(string robotsTxtContent, string urlToCheck) { try { if (string.IsNullOrEmpty(robotsTxtContent)) // If the robots.txt content is null or empty, assume crawling is allowed { return false; } Uri urlUri = new Uri(urlToCheck); // Parse the URL to get the absolute path string urlAbsolutePath = urlUri.AbsolutePath; string[] lines = robotsTxtContent.Split('\n'); foreach (string line in lines) { if (line.Trim().StartsWith("Disallow:", StringComparison.OrdinalIgnoreCase)) { string disallowedPath = line.Substring("Disallow:".Length).Trim(); string regexPattern = WildcardToRegex(disallowedPath); // Convert the disallowed path to a regex pattern if (Regex.IsMatch(urlAbsolutePath, regexPattern, RegexOptions.IgnoreCase)) // Check if the URL matches the regex pattern { return true; // Crawling is disallowed } } } return false; // If no Disallow rule matches, assume crawling is allowed } catch (Exception ex) { Console.WriteLine($"Error occurred in IsUrlDisallowed(): {ex.Message}"); return false; } } private static string WildcardToRegex(string wildcard) { string escapedWildcard = Regex.Escape(wildcard); // Escape characters that have special meaning in regular expressions string regexPattern = escapedWildcard.Replace("\\*", ".*?"); // Replace escaped asterisks with a pattern that matches any characters (non-greedy) if (regexPattern.EndsWith("/")) // Handle trailing slash separately to allow for child pages { // If the pattern ends with a slash, allow for no characters or any characters after the slash regexPattern += "(.*)?"; } return $"^{regexPattern}$"; } } }
Provide the following Sitecore patch config – SitecoreThinker.Feature.SEO.config.
<?xml version="1.0"?> <configuration xmlns:patch=" <sitecore> <pipelines> <httpRequestBegin> <processor patch:instead="*[@type="Sitecore.XA.Feature.SiteMetadata.Pipelines.HttpRequestBegin.SitemapHandler, Sitecore.XA.Feature.SiteMetadata"]" type="SitecoreThinker.Feature.SEO.Pipelines.CustomSitemapHandler, SitecoreThinker.Feature.SEO" resolve="true"> <CacheExpiration>30</CacheExpiration> </processor> </httpRequestBegin> <preprocessRequest> <processor patch:instead="*[@type="Sitecore.XA.Foundation.SitecoreExtensions.Pipelines.PreprocessRequest.FilterUrlFilesAndExtensions, Sitecore.XA.Foundation.SitecoreExtensions"]" type="SitecoreThinker.Feature.SEO.Pipelines.PreprocessRequest.FilterUrlFilesAndExtensions, SitecoreThinker.Feature.SEO"> <param desc="Allowed extensions (comma separated)">aspx, ashx, asmx</param> <param desc="Blocked extensions (comma separated)">*</param> <param desc="Blocked extensions that stream files (comma separated)">*</param> <param desc="Blocked extensions that do not stream files (comma separated)"/> </processor> </preprocessRequest> </pipelines> <services> <register patch:instead="*[@implementationType="Sitecore.XA.Feature.SiteMetadata.Sitemap.SitemapGenerator, Sitecore.XA.Feature.SiteMetadata"]" serviceType="Sitecore.XA.Feature.SiteMetadata.Sitemap.ISitemapGenerator, Sitecore.XA.Feature.SiteMetadata" implementationType="SitecoreThinker.Feature.SEO.Sitemap.CustomSitemapGenerator, SitecoreThinker.Feature.SEO" lifetime="Transient"/> </services> </sitecore> </configuration>
Consider installing the Nuget packages from this list. Build the solution and deploy it.
Demo
For demo purposes, we considered a limited number of pages. Hence, we have set different values in the “Sitemap Limits” item.
Sitemap when the URL count limit is set to 10, and the content size is 50MB.
Sitemap when URL count limit is set to 10, and content size is 2KB.
Page with Meta Robots NOINDEX field checked is excluded from the Sitemap.
Per the Disallowed entries in Robots.txt, pages are excluded in the Sitemap.
If certain URLs are intentionally excluded from crawling/indexing using robots.txt, we might choose not to include them in the sitemap to avoid redundancy.
Including URLs in a sitemap that are disallowed in robots.txt won’t harm the SEO directly, but it may not provide any additional benefit.
Please conduct comprehensive testing on the code using diverse robots.txt content and URLs to ensure its correctness and reliability in various scenarios. Make necessary code adjustments based on the testing results.
Good to read the Sitecore documentation Prioritize a page in the search engine sitemap to manage sitemap settings at the page item level.
Hope this helps. Happy Sitecore Learning!