TL;DR
I’ve conducted this research as a follow-up to the Stone Temple case study back in 2015. 3 years is a very long time in SEO years – And yes, in 2019 it still works like a charm:
Let’s get into more details.
Why Google doesn’t recommend adding noindex in robots.txt?
First, it doesn’t appear in any official Google documentation. Google recommends blocking indexation only by using Meta Robots directive or x-robots.
Second, Google’s very own John Muller mentioned more than once he would really avoid using it.
https://twitter.com/JohnMu/status/638644112359604224?ref_src=twsrc%5Etfw%7Ctwcamp%5Etweetembed%7Ctwterm%5E638644112359604224&ref_url=https%3A%2F%2Fwww.seroundtable.com%2Fgoogle-do-not-use-noindex-in-robots-txt-20873.html
On the other hand, it does not come against Google’s webmaster guidelines, or at least, it has never been stated so.
My best guess is that this directive simply might not be working 100% of the times, and if Google themselves cannot vouch for such an important directive (noindex), they won’t recommend it.
That’s it.
Also, here’s Gary Illyes explanation:
Technically, robots.txt is for crawling. The meta tags are for indexing. During indexing they'd be applied at the same stage so there's no good reason to have both of them
— Gary "鯨理" Illyes (@methode) April 18, 2019
They shouldn’t prevent you from experimenting with it in some instances.
Also, Gary hinted that Google might change how noindex in robots.txt work. For now, it’s working but I promise to keep you updated.
It is also important to note that Google is the only search engine that supports this directive (even if not officially).
Why I’ve decided to carry out this experiment?
Truth to be told, I was short in time, that’s my excuse.
Let’s get a bit more into the details –
A local SEO client on mine had thousands of indexed pages targeting every variation of every possible city in the US.
For example, if it were a plumber it would have looked something like this: plumber in NYC, plumber in Chicago, plumber in Washington D.C. etc… you got the idea.
The thing was, that besides my client’s three genuine brick and mortar local branches, all the other cities were just duplicated landing pages to target every given local keyword. Needles to say it didn’t work so well.
So after doing my research, I came up with a final list of around 3,000 redundant pages that needs to get noindexed. Another challenge was that the entire website had a flat URL infrastructure with no hierarchy. Something like this:
- example.com/plumber-in-nyc/
- example.com/plumber-in-chicago/
- example.com/plumber-in-austin/
- example.com/plumber-in- cincinnati/
etc.
It would have been much easier if all the local pages were under the same directory, something like this:
- example.com/cities/nyc/
- example.com/cities/chicago/
- example.com/cities/austin/
- example.com/cities/cincinnati/
If this were the case, I would have advised to Disallow the entire /cities/ folder in the robots.txt file and allow only the few relevant listings that we would like to keep (the easiest and fastest solution).
But, unfortunately, this wasn’t the case.
I had to think outside of the box, and quick. Why? Because, I had less then 24 hours before going on a vacation, and de-indexing thousands of pages can take quite some time, at least a few weeks.
Also, bear in mind that this specific client was very limited in its development resources, so to find a developer that will solve this within a 24-hour notice was like asking a giraffe to grow wings.
So what were my main options?
- Asking the developers to go on every page, one by one, and update the Meta Robots directive to noindex,nofollow on WordPress – out of the question, and not realistic
- Sending the developers a long list of HTACESS x robots tag and implement noindex,nofollow via the HTTP header. The downside – besides that fact that a few thousand lines in the HTACESS file can affect the site’s performance, is that this task is highly sensitive and every small mistake might bring the whole website down – that’s something I can’t allow while I’m away.
- Disallow vs. noindex in robots.txt – first, it’s important to emphasize on the difference between the two:When disallowing via robots.txt, it doesn’t mean the URL will be removed from the index. It means that Google will simply stop crawling the URL and won’t waste any more crawl budget on it.On the other corner, noindex means that Google will entirely remove the URL from the index, but it still can be crawled.In case a specific URL has both a disallow rule and a noindex rule (whether it is via meta robots or a robots.txt file), Googlebot won’t be able to see noindex as there is a restriction to crawl this page!So the sum up, I chose to go with the untraditional approach of noindex via the robots.txt for the following reasons:
- I was short on time and had nothing to lose –worst case scenario, nothing will happen.
- I was very curious to make this experiment 🙂
Shall we continue?
Why do I think it is important?
Every seasoned SEO who has some experience with clients knows that no matter how big they are, development resources can sometimes be hard to find and mostly – development tasks take time to implement.
So, when I think about a given SEO recommendation, I must also take into account the time and resources it will require. For example, uploading a robots.txt file is one of the easiest tasks to handle (shouldn’t take less then a minute with access to the site’s server) vs. deploying meta robots tag in the website code for specific pages which can be tricky and will require a developer.
Updating a robots.txt file is one of the most cost-effective changes I know. No matter how much work I have on my side, it will take zero development resources.
If we can leverage it, and as long as it’s working and it’s not going against the Google Webmaster guidelines, why the heck not?
What is the difference between the two SEO experiments?
The two case studies are not exactly similar from several aspects:
- Stone Temple had experimented with 13 different domains, but it was only on 13 single URLs within these given domains. While on our experiment we’ve had only one domain, but with 957 different pages Disallowed in the robots.txt
- Compared to 2015, in 2019 we got the new Google Search Console in hand, so it will be interesting to see how Google treats these URL’s in the new index coverage.
- As I’ve mentioned, the last experiment had been carried out in November 2015. I haven’t found any update since then. It seems like the perfect time to put it into a test again.
Noindex in robots.txt – the Experiment
We mapped 957 URLs from the same main domain.
Then I added a Noindex line for every URL under the folder in question
It looked like this:
Noindex: /example-url/
Take into consideration that unlike the Disallow in robots.txt, that you can work by folders and wildcard (like *) the Noindex in the robots.txt suppose to handle only specific URL’s. Also, as I’ve already mentioned, the URL hierarchy was flat, so it wasn’t useful anyway
We uploaded the new robots.txt file on February the 20th. And waited.
The Results
As I’ve mentioned above, one of the key differences between the two experiments is that the new Google Search Console wasn’t available 3 years ago. Therefore, it was interesting to see how the index coverage treats these directives in the robots.txt
First, it was interesting to see how Google refers to noindex as submitted URL blocked by robots.txt:
Moreover, I tested a few given URL’s with the new inspect URL:
It is interesting to see how Google treats the disallow as a directive that doesn’t allow Googlebot to crawl the page (as a kind reminder – noindex is not the same as disallow). Maybe Google won’t show this option because they don’t want to encourage it or support it.
Ok, and in practice? Have these pages been removed from the index? (or at least some of them?)
Take a look at the following graph:
Important takeaways from the data:
- The new robots.txt file was live on February 22. The first batch of pages removed from the index was on March 1.
- On March 23, we have reached 814 pages that were removed from the index. In other words, in one month we have managed to remove 85% of the pages in question. Not bad, right??It is important to note that it was probably more usefully and much faster to also have a sitemap.xml containing the URLs we wanted to remove. This way, Googlebot can have a quick access to crawl these URLs, before they were removed from the index.
- But wait, there is more. On April 2, the graph decreased from 814 pages to 660. Why? To be honest, I am not quite sure yet. It makes it even more difficult to understand which pages are no-indexed and if some of them got re-indexed because we’re just got sample ULS so we don’t have the full data. Does it mean that some of the URLs got re-indexed? I find it VERY hard to believe, but I’ll keep tracking it during the following weeks and promise to update accordingly.
- After Google updated the graph, the number increased to 905 , which is almost 95% of the submitted URL’s have been removed of the index
- It is also important to mention I’ve manually checked the URLs on the list to make sure they were removed from the index and not just blocked by robots.txt (as with Disallow)How can you check it?Search for site operator with the specific URLIf you see this:It usually means that the URL is in the index, but it is blocked by robots.txtIf you can’t see it in the search results at all, it means it had been removed from the index
What is your experience?
In the next few weeks and months, I’ll be put these results into test again, with a few more websites and perhaps a different perceptive.
I encourage you to conduct your own experiments on this subject, it will be really interesting to see more results, feel free to share
Hello Roey! Recently we had a similar issue with a client. It took us more than a month to have pages deindexed. Wonderful reminder about disallow/noindex differences.
Thanks a lot, always interesting to do SEO experiments 🙂
Hi there,
I appreciate you putting in the graph about “submitted URL blocked by robots.txt” but I’d be interested to see if you have another graph of the Excluded by No Index report in GSC.
Good point Julie
In fact there is nothing to show, 0 pages have been affected
Great read indeed, but from now on google is ending all support for noindex in robots.txt
https://searchengineland.com/google-to-stop-supporting-noindex-directive-in-robots-txt-319003
Thanks, Amit!
And yeah, I got the memo too, it doesn’t mean we can keep on going with those experiments 🙂