How to Find (and fix) Duplicate Content Using the Index Coverage

Not many are aware of it, but the job of the index coverage status report is to evaluate how Google perceives your website’s content. This report can provide us with some great hints, all you need is to read between the lines.

However, sometimes it can be an overwhelming experience.

How can you cut through the data clutter and focus only on the things that matter?

SEO’s are too much obsessed about what John Mulle, Gary Illyes or other Googlers and ex-Googlers are saying in hope that they will accidentally reveal the secret formula to rank well – isn’t that so?

Without underestimating their huge help to the SEO and webmasters community, going over Google Search Console, especially the index coverage report, can reveal what Google really thinks about your website (in details), and to actually hear it straight from the horse’s mouth. 

If you want to amplify your understanding of how to find duplicate content – keep reading.

The old crawl errors report

The old crawl errors on the Search Console was divided into two main sections:

  • Site Errors
  • URL Errors
Old Google Search Console Site Errors
Old Google Search Console URL errors

While it gave us a general idea about the errors over the site, the URLs that were displayed there were too often outdated and was lacking a lot of context.

For example, is wasn’t uncommon to see thousands of 404 URLs that were created just because spammy websites were linking to them with typos. Links that we could obviously just ignore (as long as they’re indeed 404’s)

However, the new index coverage which was spotted at beta version back in September 2017 is a whole different ball game.

Let me show you why and how.

But first, how does Google identify and consider as duplicate content?

According to Google documentation (see the irony here?):

“Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin.”

Google makes it clear that they are fully aware that most types of duplicate content aren’t deceptive, which means it was created without any intention. However, it still doesn’t mean you don’t need to audit it thoroughly and manage it properly.

Does duplicate content affect SEO?

The short answer is YES, duplicate content can affect the SEO of your website.

We need to distinguish between two types of duplicate content:

  • Internal duplicate content – in this case, the duplicate content can be found on the same site.

According to Google, on-site duplicate content (or in general –  low-value-add URLs) can negatively affect a site’s crawling and indexing, and in other words – the site’s crawl budget.

  • External duplicate content – in this case, the duplicate content can be found on different external websites (that we don’t own). At scale, duplicate content that is taken from other websites can affect how Google evaluates your domain (I encourage you to read in detail about how Google’s core updates work).

In this post, we will focus on how to identify internal duplicate content and thin content.

How to find (and fix duplicate content) using the index coverage

I will go over the most important and common types found within the index coverage, what you can conclude from the data you have, and how to resolve the issues, step by step.

1. indexed, not submitted in sitemap

In case your site contains a sitemap.xml (which it should), one of the first and the most important checks I’m conducting is going over the valid tab and check for index, not submitted in sitemap:

valid tab in Google Search Console (submitted and indexed; indexed, not submitted in sitemap)

First, you should ask yourself, “Why aren’t these pages in the sitemap in the first place?”

It’s worth mentioning here that if your sitemap.xml is not dynamic (which means pages can be added on the fly) it is completely common to see some pages left out.

However, If within “indexed, not in sitemap” the number reaches hundreds, thousands (or more) you must audit these URLs.  You are bound to find within them many types of duplicate and thin content.

Follow these steps check all the URLs within your list

Step #1 – Go over the report and export it to Excel:

export Indexed, not submitted in sitemap

Step #2 – Align the URLs by name, and see if any of the following patterns appear:

  • Are there any URLs containing folders that you are not familiar with?
  • Are there any URLs containing parameters that you are not familiar with?
  • Are there any URLs containing suffix you’re not familiar with? (for example – index.html)

Step #3 – Decide what to do – keep or drop?

  • For every parameter or folder – check these pages, are they unique or empty? What is the purpose of these pages? Are they serving any purpose for users?
  • Check for every folder, parameter or suffix if they have organic traffic with Google Analytics and check for clicks and impressions on Google Search Console.
  • Check if any of these pages have any external links

Step #4 –

What to do with the URL? (of course, it is only a simplification – you should perform this step with extra caution and care)

  • If it doesn’t have any links (external nor internal) or traffic – you can block it with the robots.txt file or ask to remove it from the index using Meta Robots noindex.
  • If it’s a duplication (identical or near-identical) but have some link equity (external or internal) make sure to 301 redirect or deploy a canonical tag directing it to its original page
  • In case you find the pages as valid (which means they should be indexed)  add them to the sitemap.xml

Bear in mind, if your site doesn’t have a sitemap.xml – all the indexed pages will be “indexed, not in sitemap”. In case you don’t have sitemap.xml, I encourage you to create one.

2. Crawled – currently not indexed

According to John Mueller, either it can happen due to accidental auto-generating too many URLs or due to a poor internal linking structure.

In his own word:

“It could also be a matter of the content on your website maybe not being seen as absolutely critical for our search results. So if you’re auto generating content, if you’re taking content from a database and just putting it all online, then that might be something where we look at that and say well there’s a lot of content here but the pages are very similar or they’re very similar to other things that we already have indexed, it’s probably not worthwhile to kind of jump in and pick all of these pages up and put them into your search results.”

How to handle those URLs?

In this case, it is more likely that you wouldn’t want these pages either indexed nor crawled by Google in the first place.

The most common practice is completely blocking them using the robots.txt file. This will save our crawl budget, unlike Meta robots that will still require Google to crawl these pages.

3. Duplicate, Google chose different canonical than user

Google thinks another URL makes a better canonical and encourage you to change it as well.

From my experience, and sometimes Google’s canonical of choice won´t be the most relevant. So don’t rely blindly on the algorithm.

Inspect the given URL with the URL inspection tool, another amazing addition, that gives us in detail how Google evaluate your URL

The downside?

Although some URLs will be categorized as “Duplicate, Google choose a different canonical than user”, many times it won’t display the Google selected version, for example:

Google-selected canonical in Search Console

So what you should do?

If Google’s selected canonical is different, it is likely to do with mixed signals. Check the following:

  • Sitemap.xml
  • Internal links
  • Canonical tags

For example, if the canonical tag for URL A is pointing to URL B, however, most of the internal links and the sitemap.xml are all pointing to URL A, it can get messy and confusing.

Make sure to keep consistency!

4. Duplicate without user-selected canonical

Use inspect URL to see the Google selected canonical URL and explicitly mark the desired canonical version for it.

Similar to #3, this is duplicate content per se, but we haven’t handled it yet.

What should you do?

Deploying a canonical tag or using 301 redirects to the relevant URLs are the most common practices.

Bonus – using the ‘URL parameters’ tool

This is an old feature in Google Search Console, which you can still find under “legacy tools and reports”:

URL parameters in Old Google Search Console

Step #1-

Go over the main parameters (listed by affected URLs). Conduct a site search for the number of indexed URLs containing these parameters

For example:

Site:tldrseo.com inurl:”color”

Note how many indexed pages you find under each parameter

Step #2 –

Check the parameters if they have any importance, if they change the content, if they have any organic traffic or links, and the bottom line – if you should keep them.

For example, pagination parameters are highly common and many times we would want to keep them

Step #3 –

Decide what to do with the indexed parameters – canonicalizing them to point to the main right version, block them or redirect them.

You can also ask Google to ignore them and to choose the repressive URL:

Does this parameter change page content seen by the user

TL;DR

These are your most important blind spots you should watch out for when it comes to thin/duplicate content using the index coverage:

  • indexed, not submitted in sitemap
  • Crawled – currently not indexed
  • Duplicate, Google chose different canonical than user
  • Duplicate without user-selected canonical

Why you should do it in the first place?

  • Improve how Google asses the general quality of your site’s content
  • Optimize the crawl budget

Now it’s your turn – how often do you use these features? Did you find it useful? Waiting for your comments!

Posted by Roey Skif

4 comments

  1. Will Gary

    Thanks for the great read. I am running into an issue on one of my around 50 sites and no changes of yet seem to resolve the issue. When I run the sitemap, every unique URL ends with a trailing /. however, when I look at the actual page on the admin side, there is no trailing / in the permalink (using permalink manager plus plugin) Additionally the seo plugin we are using (Rank Math) is auto populating the canonical url field to match the sitemap url, including a trailing /. What this means to us is that a large portion of the site has pages (the ones ending with trailing /) that shows in console as “not on Google” with the message “duplicate, selected url is not defined as canonical” with it identifying the same URL but without a trailing /. If I then inspect the URL without the trailing / it will show as “On Google” but with the “indexed but not in sitemap” message also being shown.

    The permalink plugin will not allow a trailing slash to be manually added so that it will match the URL from the site map. So, we have manually gone into each page from the sitemap and insured that the canonical url matches the URL which appears in the permalink field, yet the sitemap still adds a trailing / to the URL, resulting in no change actually occurring in terms of the search console results.

    Can you please explain how this is impacting our sites performance and in the eyes of Google (does it see us as having duplicate content? Are we being negatively impacted). Additionally any known fixes would b HUGELY appreciated.

    Thanks again for the great content and I’ll keep my fingers crossed that you know of a solution haha!

    1. Roey Skif

      Hi Will,
      Thanks for the great feedback
      That’s a good question

      I wouldn’t be worried about it – just ensure that you have self-referral canonical for all the URL’s without the / (301 redirects will work as well)
      Later, make sure to update the sitemap.xml

      Google by default will probably avoid indexing these pages – but to send the right signal and no rely solely on Google, the canonical will work on this case

      1. Will Gary

        Roey,

        Thanks for getting back to me. It sounds like in your eyes the corrective actions we have taken to make the canonical URL match the pages URL (omitting the trailing slash ) was the appropriate action to take.

        I’m happy to report that of our 2 most valuable keywords for this site, 2 days after corrective action was taken, we saw a jump from the bottom of page 1, top of page 2 range to positions 5 & 6.

        The sitemap is still adding a trailing / to all Urls and we are working to correct that as well, and hopefully once resolved we will see further gains.

        Thanks again for your feedback, while it only validated our corrective actions, it was great to get your input. Looking forward to more great, informative content from you!

        1. Roey Skif

          Glad it helped, you’re welcome Will! 🙂

Leave a Reply