Wandering down cobblestone streets may feel amazing, but when you need to get to a certain place, it’s better to find it on a map and follow navigation tips. The same applies to search robots—most of the time they explore your site by naturally following links. The problem is, crawlers may never reach some pages this way: either because your website is too big or because pages have no links pointing to them.
That is why sitemaps exist. To make sure they don’t miss any important pages, search crawlers occasionally consult a sitemap—it helps them discover areas of a website they’ve never visited before.
What we are calling a sitemap
A sitemap is a file with a list of all website pages both crawlers and users need to be aware of. It is similar to a book’s table of contents, except, the sections are links.
There are 2 main types of sitemaps: HTML and XML sitemaps.
An HTML sitemap is a web page that lists links. Usually, these are links to the most important sections and pages of the website. Here are some nice examples of HTML sitemaps: DHL, Lufthansa, SmartFares.
The HTML sitemap is designed mainly for people and not robots and helps quickly navigate across the main sections of the site.
An XML sitemap is an XML file (e.g. sitemap.xml) located in the website’s root folder that specifies links, page modification dates, and other parameters that matter for search engines. Since all the parameters are marked with special tags, XML files look pretty similar to a website’s HTML code:
An XML Sitemap may look unappealing, but there’s great SEO value in it. The file helps crawlers get a holistic view of your website, better understand its structure, quickly discover new content, and much more.
In this post, we’ll go through the list of benefits an XML sitemap can bring to a website, will talk about different sitemap types, and finally will figure out how to create a proper sitemap.
What are the benefits of having an XML sitemap?
It is recommended to have a sitemap if you run a huge website or if you’re just starting a new project. In the first case, a sitemap will help Google discover some deeply rooted content. Meanwhile, with a brand-new website, thanks to a sitemap you won’t be waiting for ages for Google to learn that your content even exists. But what if your website is neither large nor new? Should you still consider having a sitemap?
According to Google, you can always benefit from adding a sitemap to your website and never get penalized for having one. Besides, a sitemap can bring you plenty of other tangible benefits:
- XML sitemaps help search engines understand which pages you would like to have indexed—by adding a URL to a sitemap you signal Google that the page is a quality one. Mind though, that Google may as well ignore your request—to get indexed, a page needs to comply with Google quality standards.
- A sitemap can help your website recover if its web pages were hit by the Google Panda update (especially useful for large websites).
- Sitemaps help you control indexing of certain pages in Google Search Console.
- You can tell Google about the regional versions of your pages by listing them in your sitemap along with special hreflang attributes. This is not the only way to properly organize a multilingual website, but some webmasters believe it is the easiest way.
- An XML sitemap is your legal helper in confirming your content rights as it mentions the page publication and update time.
How many sitemaps do you need?
Before creating a sitemap, you need to understand how many sitemaps you need. Normally, one is enough. Still, there’s a number of cases when you’d have to create several sitemaps.
Splitting large sitemaps
Search engines will only crawl a sitemap with a maximum file size of 50MB when uncompressed and containing no more than 50,000 URLs. Google has imposed such limitations for a reason—they ensure that your web server does not get overloaded when serving very large files.
It is also recommended to compress sitemap files using a tool such as gzip to save bandwidth. When a sitemap is compressed, the .gz extension is added to the filename—e.g. sitemap.xml.gz.
So, if you have a huge website and, thus, a huge sitemap file, you’ll have to break it into several smaller ones. Otherwise, you’ll get a Sitemap file size error when submitting a sitemap to your Google Search Console.
Multiple sitemaps for different website sections
It is also a good practice for e-commerce websites to distribute website pages belonging to different categories across several sitemaps. For example, you can split product pages, category pages, blog posts, etc. It allows webmasters to notice if some type of pages have indexing issues. Also, as product pages are updated more often than others, with multiple sitemaps you’ll only have to update one product sitemap instead of revamping the sitemap for the whole website.
Finally, using smaller sitemap files for different website sections is also good technical-wise. Your sitemaps file size may not exceed 50 MB and the number of listed URLs can be under 50,000, but the more pages you list—the more unnecessary strain they put on your web server. This can lead to truncated responses or timeouts, and thus crawling errors may occur. So, by using smaller sitemaps, you can prevent such issues.
Video, image, and news sitemaps
In addition to sitemaps listing website URLs, Google allows creating custom sitemaps for your image and video content as well as news sitemaps.
The latter will obviously come in handy for news websites—since such websites handle time-sensitive content, it’s crucial for Google to discover news articles as fast as possible. To make sure users get up-to-date information, Google allows only including fresh articles into a news sitemap—they should be published within the last two days. Besides, you can include no more than 1000 articles into a single news sitemap, but since you’re supposed to remove older content from it, that is not a big deal. The last—and probably the most important—condition is that your website should be registered with Google News.
Speaking of Image and Video sitemaps, they will be of great use for websites that heavily rely on media content, e.g. stock photos, libraries, or streaming platforms. Image sitemaps increase website chances for getting featured in image search and a Video sitemap helps Google rank your video content.
Here you can provide Google with additional information on your media—for example, you can indicate the image title and caption. For a video, you can specify its length, rating, family-friendliness, and more.
Still, most websites don’t really need separate image and video sitemaps—to ensure Google sports your critical image and video content, you can simply add their URLs to your regular sitemaps.
Which pages to include in a sitemap?
This part of our site-mapping crash course is extremely important, so read carefully!
One common misconception is that to help Google crawl and index your site, you need to include all your website pages into a sitemap. In fact, it’s the other way around. Google doesn’t need to see all the garbage pages you may have on your site—you only need to tell it about high-quality juicy pages that you believe deserve ranking high. By including some pages into your sitemap, you ask Google to focus on them. It may or may not follow your advice, but that’s a different story.
As a rule of a thumb, all the pages you add to a sitemap have to be 200 OK pages filled with high-quality content that serves the users. That means you should exclude all pages that do not meet these criteria while bearing in mind some exceptions.
4XX pages in XML sitemap
4XX response codes mean that the requested page does not exist or has restricted access, so in most cases, you don’t want to include such pages into your sitemap.
4xx that shouldn’t be on your sitemap
404 are deleted pages, so if such pages were removed on purpose, keep them away from your sitemap. The same goes for soft 404 pages that were removed but still return a 200-level success status code. Normally those are pages with little or no content, redirects to the homepage, or 404 pages blocked by robots.txt. Soft 404s are generally no good for your SEO, so spend some time fixing this issue.
Remember to create a custom 404 page to ensure a smooth user experience.
Another popular 4xx status code is 401—it means that Google is “not authorized” to access the page. Normally, such pages are intended for logged-in users, and you don’t want Google to index them. Therefore, you don’t need these pages in your sitemap.
4xx you may need in your sitemap
In some cases, you may actually want Google to crawl and index a 401 page. For instance, it happens that you had password-protected a page under development and then forgot to lift the restrictions when the page went live. Also, sometimes webmasters restrict access to certain pages to protect them from bad bots or spammers. In such cases, you need to enable a DNS lookup to verify that a web crawler accessing your page is the Googlebot.
3XX pages in an XML sitemap
3xx are redirect pages, and you may or may not want them on your sitemap depending on the redirect type.
3xx that shouldn’t be on your sitemap
A 301 response code means that a page has been permanently redirected to a new address and the original page no longer exists. Therefore, such pages shouldn’t be in your XML sitemap. The only thing to remember in this case is to include the destination URL in the sitemap.
3xx you may need in your sitemap
302 pages are temporarily redirected pages. For example, such a redirect is often used for A/B testing—this is when part of the users are sent to the test a URL. In this case, you want to keep the original page indexed, so obviously, it should stay in your sitemap. The test page, on the other hand, shouldn’t get indexed because it will be a near-duplicate of the original page. So, you need to mark the original URL as canonical and keep the test URL away from your sitemap, just like all non-canonical pages.
5XX pages in XML sitemap
5XX status codes mean that there’s a problem at your web server’s end. The most common out of 5xx codes is a 503 Service Unavailable error that says the server is temporarily down. It may occur because the web server was under maintenance or got overloaded.
If the error was spotted just once, there’s nothing for you to worry about as it was probably due to scheduled web server maintenance. If, on the other hand, the problem persists, you’ll have to figure out what’s causing it and fix the issue asap—otherwise, Google may conclude that your website is poorly maintained.
With 5xx pages it’s not really about adding them to your sitemap or not, but about fixing an issue to make sure pages’ have a response code of 200 OK.
Every website has a number of utility pages that are important for users, but not for search engines—login pages, pages available upon logging in, sorting and filtering pages, etc. A common practice is to block such pages with the robots.txt file, so that Google can’t access them. Or one may let Google crawl the page, but restrict its indexing with special directives (noindex or all).
Naturally, all these pages shouldn’t be on your sitemap. If a page cannot be indexed, but is featured on your sitemap, it not only confuses Google, but also wastes your crawl budget. The same goes for pages blocked by robots.txt—Google won’t be able to crawl them.
If you want Google to deindex a page and apply a noindex tag on it, make sure not to add this page to robots.txt file. You can also keep it on your sitemap before the page falls out of the index.
It is also a good idea to audit your site to make sure the noindex directive and robots.txt file are used properly, and you have not accidentally blocked the wrong pages.
You can easily find all such pages using SE Ranking’s Website Audit tool. After launching an audit, you’ll have to go to the Crawling section of the Issues Report.
By clicking on the number of respective pages, you’ll see a full list of URLs marked as noindex or blocked by robots.txt along with other important parameters such as the number of internal links pointing to the page, all the robots meta tags and x-robots tags, and page title. You’ll also immediately see whether the page is currently included in your sitemap.
If you’re not yet an SE Ranking user, you can test the platform out for free under the 14-day trial.
A canonical tag is used to mark which of two or more similar pages is the main one. Google is supposed to index and rank the preferred page and ignore duplicate or near-duplicate ones. Similar pages marked with canonical tags can have totally different URLs or similar URLs (e.g. URLs generated in the process of sorting and filtering, URLs with UTM tags and tracking parameters).
Sometimes, canonical tags are also used to make a page accessible through both HTTP and HTTPS as well as both with and without the www prefix. In this case, the main version of the page (for instance the HTTPS non-www variation) is marked with rel=canonical. However, most websites prefer server-side redirects in this case, as they are more straightforward and guarantee that only the preferred website variation will get indexed.
Canonical tags work differently than redirecting, they are more like recommendations than directives, and Google can follow or ignore them. This is why having non-canonical URLs in a sitemap is really confusing from the search engines’ viewpoint. If the page is not marked as canonical, it tells Google you don’t want to have it indexed. At the same time, you add the page to your sitemap encouraging Google to index the page. As a result, Google can index all the page versions, and you’ll have to deal with keyword cannibalization. Or the search engine may choose to index the non-canonical URL, which is something you don’t want as well.
Pages to keep away from your sitemap
Now, let’s quickly summarize, which pages you shouldn’t include in your sitemap:
- deleted 404 and soft 404 pages, password-protected 401 pages
- permanently redirected 301 pages
- non-indexable and non-canonical pages—this category includes login pages and pagination pages, filtering and sorting pages, site search result pages and Parameter or session ID-based URLs, etc.
By only including high-quality pages to your sitemap, you increase your overall site quality in the eyes of Google, which should positively impact your rankings.
Sitemap XML tags and their settings
I’ve already mentioned that along with website URLs, a sitemap features various tags that specify page parameters. Here’s an excerpt from our blog’s XML sitemap for you to see how the tags are organized.
Now, let’s go through every one of them for you to understand which purpose the tags serve and how to use them properly.
- <urlset> is a mandatory element that encapsulates a sitemap and indicates which version of the XML Sitemap protocol standard is used (via the xmlns attribute). Protocol standard is also called namespace
- <url> is another mandatory element that marks a block with all information pertaining to a single URL
- <loc> is the last mandatory element that indicates the page URL itself. All listed URLs should be fully-qualified—contain site protocol (HTTP or HTTPS) as well as the www prefix if it is included in your domain name
- <changefreq> defines how often a page can change. You can set this parameter as always, hourly, daily, weekly, monthly, yearly or never
- <priority> sets the crawling priority of a page (valid values range from 0.0 to 1.0)
- <lastmod> indicates the last time the page content was updated
Make sure that you use the same syntax when specifying a URL. Also, sitemap files should be UTF-8 encoded.
Now, in the past, Google consulted <changefreq> and <priority> tags to define which pages should be prioritized during website crawling. Therefore, webmasters had to carefully set these parameters for each page. This hasn’t been the case for years now. John Mueller and Gary Illyes have confirmed that Google now ignores <priority> tags with Gary Illyes calling the tag “a bag of noise”.
The same goes for the <changefreq> tag—John Mueller claimed it is not taken into account as well.
The only optional tag that still counts is <lastmod>—Google may consult it if the tag is precise enough. By precise Gary Illyes probably means that the tag should be only updated when some significant changes were made to the content. Updating the tag to fool Google into thinking that your content is fresh won’t cut it.
When used properly, <lastmod> helps Google understand when the content was last updated and whether it needs to be recrawled. Besides, the tag helps the search engine figure out who was the original publisher.
Static vs dynamic sitemaps
By now you may be wondering how much time you’ll have to spend updating your sitemap every time you publish a new page or revamp an existing one. Keeping your sitemap up-to-date can really be a daunting task if you choose to create a static sitemap. The good news is that you can easily avoid all the hassle by creating a dynamic sitemap instead.
This kind of sitemap updates automatically the moment you make any changes to your website. For example, whenever you delete a page, and it becomes 404, the page would be removed from the sitemap. If you mark a page as noindex or add it to the robots.txt file, it would also be deleted from the sitemap. On the other hand, whenever you create a new page and mark it as canonical—it will immediately be added to your sitemap. It really is that easy—all you need is to properly set up the tool that will be generating your dynamic sitemaps.
How to create an XML sitemap?
Finally, we’ve come to the most practical part of our post—let’s figure out how you can actually generate a sitemap.
The easiest way would be to have a sitemap generated by your CMS. Since a CMS contains information about all the website pages and all the adjustments you make, it can feed all the data into a dynamic sitemap.
Some CMS have sitemap-generating capabilities from the get-go—this is the case for Magento, Shopify, Wix, and Squarespace. With other popular CMS such as WordPress, Joomla, Drupal, or OpenCart you’ll have to use special plugins.
In the table above, I’ve listed some popular sitemap plugins for different CMSs. When picking the one for your site, pay attention to the plugin’s array of features: SEO-friendly solutions would allow you to exclude 404, redirected, noindex, canonicalized, and other inappropriate pages from your sitemap. Besides, mind customization capabilities—you want to be able to easily amend the list of pages included in the sitemap.
If your website is not CMS-based, you can use one of the dedicated sitemap generator tools. There are plenty of both free and paid options available in the market, so again, make sure to carefully study the tool’s capabilities. The thing is that while you should be able to generate a customizable dynamic sitemap with one of the paid generators, most free solutions are too basic and lack some crucial features. So, you may end up with a static sitemap that features all pages of your site including canonicalized, noindex, and redirect URLs.
Therefore, if using one of the paid solutions and CMS features is not an option, I advise you to generate a sitemap using SE Ranking’s Website Audit tool. In just a few minutes the tool will create a static sitemap for you based on the latest website crawl. By default, the tool only includes 200 pages in a sitemap while omitting 4xx, 3xx, noindex, and other pages that you normally want to leave out.
While generating a sitemap with Website Audit mind your crawl limits—the number of pages on your website should exceed your set crawl limits. Otherwise, some important pages may not be included in your sitemap.
Finally, if none of the ready-made solutions work for your website, you can create a custom sitemap. Surely, this requires some coding skills, so you’ll need a developer who will craft a proper dynamic sitemap for you.
How to submit your sitemap to Google
Depending on the tool you used to generate your sitemap, you may need to manually add it to your site, or it may already be there—this would be the case for dynamic sitemaps generated by a CMS. Normally, whenever a sitemap is automatically added to a website, it is located at yoursite.com/sitemap.xml.
If you’ve used one of the tools that generated an XML sitemap file, you’ll have to manually upload it to your website’s root folder. You can do this using your cPanel or via an FTP client, for example, Total Commander or FileZilla. A good idea would be to check the sitemap’s validity before uploading, especially if the file was created manually—use one of the free tools like this one to make sure your sitemap is operating as you intend.
It also a good practice to add the reference to your sitemap to your Robots.txt file—you can find it in the root directory of your web server.
Once a valid sitemap is on your website, you can submit it to Google Search Console (GSC). To do so, go to the Sitemaps report, click Add a new sitemap, enter your sitemap URL, and hit the Submit button.
Soon, you’ll see if Google was able to properly process your sitemap in the Status column—if everything went well, the status will be Success. If a sitemap was parsed, but Google detected some errors, the status would be Has errors. Finally, if Google fails to crawl your sitemap, you’ll end up with the Couldn’t fetch status.
Make sure to fix all the sitemap errors so that your sitemap status is Success. I’ve compiled a separate guide featuring common sitemap errors to help you out with this matter.
In the same table of your Sitemap report, you’ll see the number of discovered URLs—ideally, it should equal the total number of URLs added to your sitemap.
Finally, by clicking the icon next to the number of discovered URLs you’ll get to the Index Coverage report that will help you better understand how Google crawls and indexes your site. Studying the report will help you remove some low-quality pages from your sitemap and add pages you might have missed. To learn how to do this, consult our sitemap polishing guide.
Submitting several sitemaps to Google
If you decided that you need several sitemap files, you can still submit them all to Google at once. For this purpose, you’ll have to list all your sitemaps in a single file.
The file is called sitemap index, and it helps Google easily find all your sitemaps. You can include up to 50,000 URLs into a single file and the other requirement is the file size. Like you might have guessed, it should not exceed 50 MB.
Once you have your file, submit it to Google in the same way as would have submitted a regular sitemap file.
Congrats! You’ve finished our crash course on SEO site-mapping. Follow site-mapping best practices, and you’ll have no problems with Google finding and crawling all the quality pages of your site. And if you want to get the most out of your sitemap, take a look at this guide on polishing your sitemap.