Not many are aware of it, but the job of the index coverage status report is to evaluate how Google perceives your website’s content. This report can provide us with some great hints, all you need is to read between the lines.
However, sometimes it can be an overwhelming experience.
How can you cut through the data clutter and focus only on the things that matter?
SEO’s are too much obsessed about what John Mulle, Gary Illyes or other Googlers and ex-Googlers are saying in hope that they will accidentally reveal the secret formula to rank well – isn’t that so?
Without underestimating their huge help to the SEO and webmasters community, going over Google Search Console, especially the index coverage report, can reveal what Google really thinks about your website (in details), and to actually hear it straight from the horse’s mouth.
If you want to amplify your understanding of how to find duplicate content – keep reading.
The old crawl errors report
The old crawl errors on the Search Console was divided into two main sections:
- Site Errors
- URL Errors
While it gave us a general idea about the errors over the site, the URLs that were displayed there were too often outdated and was lacking a lot of context.
For example, is wasn’t uncommon to see thousands of 404 URLs that were created just because spammy websites were linking to them with typos. Links that we could obviously just ignore (as long as they’re indeed 404’s)
However, the new index coverage which was spotted at beta version back in September 2017 is a whole different ball game.
Let me show you why and how.
But first, how does Google identify and consider as duplicate content?
According to Google documentation (see the irony here?):
“Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin.”
Google makes it clear that they are fully aware that most types of duplicate content aren’t deceptive, which means it was created without any intention. However, it still doesn’t mean you don’t need to audit it thoroughly and manage it properly.
Does duplicate content affect SEO?
The short answer is YES, duplicate content can affect the SEO of your website.
We need to distinguish between two types of duplicate content:
- Internal duplicate content – in this case, the duplicate content can be found on the same site.
According to Google, on-site duplicate content (or in general – low-value-add URLs) can negatively affect a site’s crawling and indexing, and in other words – the site’s crawl budget.
- External duplicate content – in this case, the duplicate content can be found on different external websites (that we don’t own). At scale, duplicate content that is taken from other websites can affect how Google evaluates your domain (I encourage you to read in detail about how Google’s core updates work).
In this post, we will focus on how to identify internal duplicate content and thin content.
How to find (and fix duplicate content) using the index coverage
I will go over the most important and common types found within the index coverage, what you can conclude from the data you have, and how to resolve the issues, step by step.
1. indexed, not submitted in sitemap
In case your site contains a sitemap.xml (which it should), one of the first and the most important checks I’m conducting is going over the valid tab and check for index, not submitted in sitemap:
First, you should ask yourself, “Why aren’t these pages in the sitemap in the first place?”
It’s worth mentioning here that if your sitemap.xml is not dynamic (which means pages can be added on the fly) it is completely common to see some pages left out.
However, If within “indexed, not in sitemap” the number reaches hundreds, thousands (or more) you must audit these URLs. You are bound to find within them many types of duplicate and thin content.
Follow these steps check all the URLs within your list
Step #1 – Go over the report and export it to Excel:
Step #2 – Align the URLs by name, and see if any of the following patterns appear:
- Are there any URLs containing folders that you are not familiar with?
- Are there any URLs containing parameters that you are not familiar with?
- Are there any URLs containing suffix you’re not familiar with? (for example – index.html)
Step #3 – Decide what to do – keep or drop?
- For every parameter or folder – check these pages, are they unique or empty? What is the purpose of these pages? Are they serving any purpose for users?
- Check for every folder, parameter or suffix if they have organic traffic with Google Analytics and check for clicks and impressions on Google Search Console.
- Check if any of these pages have any external links
Step #4 –
What to do with the URL? (of course, it is only a simplification – you should perform this step with extra caution and care)
- If it doesn’t have any links (external nor internal) or traffic – you can block it with the robots.txt file or ask to remove it from the index using Meta Robots noindex.
- If it’s a duplication (identical or near-identical) but have some link equity (external or internal) make sure to 301 redirect or deploy a canonical tag directing it to its original page
- In case you find the pages as valid (which means they should be indexed) add them to the sitemap.xml
Bear in mind, if your site doesn’t have a sitemap.xml – all the indexed pages will be “indexed, not in sitemap”. In case you don’t have sitemap.xml, I encourage you to create one.
2. Crawled – currently not indexed
According to John Mueller, either it can happen due to accidental auto-generating too many URLs or due to a poor internal linking structure.
In his own word:
“It could also be a matter of the content on your website maybe not being seen as absolutely critical for our search results. So if you’re auto generating content, if you’re taking content from a database and just putting it all online, then that might be something where we look at that and say well there’s a lot of content here but the pages are very similar or they’re very similar to other things that we already have indexed, it’s probably not worthwhile to kind of jump in and pick all of these pages up and put them into your search results.”
How to handle those URLs?
In this case, it is more likely that you wouldn’t want these pages either indexed nor crawled by Google in the first place.
The most common practice is completely blocking them using the robots.txt file. This will save our crawl budget, unlike Meta robots that will still require Google to crawl these pages.
3. Duplicate, Google chose different canonical than user
Google thinks another URL makes a better canonical and encourage you to change it as well.
From my experience, and sometimes Google’s canonical of choice won´t be the most relevant. So don’t rely blindly on the algorithm.
Inspect the given URL with the URL inspection tool, another amazing addition, that gives us in detail how Google evaluate your URL
Although some URLs will be categorized as “Duplicate, Google choose a different canonical than user”, many times it won’t display the Google selected version, for example:
So what you should do?
If Google’s selected canonical is different, it is likely to do with mixed signals. Check the following:
- Internal links
- Canonical tags
For example, if the canonical tag for URL A is pointing to URL B, however, most of the internal links and the sitemap.xml are all pointing to URL A, it can get messy and confusing.
Make sure to keep consistency!
4. Duplicate without user-selected canonical
Use inspect URL to see the Google selected canonical URL and explicitly mark the desired canonical version for it.
Similar to #3, this is duplicate content per se, but we haven’t handled it yet.
What should you do?
Deploying a canonical tag or using 301 redirects to the relevant URLs are the most common practices.
Bonus – using the ‘URL parameters’ tool
This is an old feature in Google Search Console, which you can still find under “legacy tools and reports”:
Go over the main parameters (listed by affected URLs). Conduct a site search for the number of indexed URLs containing these parameters
Note how many indexed pages you find under each parameter
Step #2 –
Check the parameters if they have any importance, if they change the content, if they have any organic traffic or links, and the bottom line – if you should keep them.
For example, pagination parameters are highly common and many times we would want to keep them
Step #3 –
Decide what to do with the indexed parameters – canonicalizing them to point to the main right version, block them or redirect them.
You can also ask Google to ignore them and to choose the repressive URL:
These are your most important blind spots you should watch out for when it comes to thin/duplicate content using the index coverage:
- indexed, not submitted in sitemap
- Crawled – currently not indexed
- Duplicate, Google chose different canonical than user
- Duplicate without user-selected canonical
Why you should do it in the first place?
- Improve how Google asses the general quality of your site’s content
- Optimize the crawl budget
Now it’s your turn – how often do you use these features? Did you find it useful? Waiting for your comments!