In the following post, I have curated the most frequently asked questions about crawling and indexing based on Google’s recommendations and guidelines throughout the years.
First, let’s understand: what is crawling, what is indexing, and what are the differences between them?
What is crawling and how does the Google crawler work?
Search engines need to find out what pages are out there on the web, so they use web crawlers – a software that helps them discover webpages.
Google’s crawlers are constantly following internal and external links on webpages, and add each new discovered page to their endless list of known pages.
What is the list of known pages?
Each page on that list is there because Google had crawled at least once.
Bear in mind, the definition of crawling is valid for both finding a new page, and for updating a page which was already in Google’s servers.
What is indexing and how does Google indexing work?
After the process of crawling, in which a webpage is discovered, Google keeps the page in the search index. According to Google, the search index contains hundreds of billions of webpages.
During the process of indexing, search engines pick up certain signals to understand the content of a given webpage – this includes analyzing and rendering the code, text, images, schema markup etc. The main goal – to understand what the pages are about in order to rank them organically for a given search query.
A webpage can appear in the search results page only after it is indexed, meaning that every indexed webpage must have been crawled at least once.
An opposite scenario cannot occur – A crawled webpage won’t necessarily appear on the SERP as it can still be blocked for indexing using Meta robots tag, noindex, follow.
How often does Google crawl and index?
The short answer is – it depends.
According to John Muller, Googlebot has a limit for the number pages it can crawl on a given domain per pay. This number is subject to change – depending on the size and the type of the website, Googlebot automatically calculates the crawl rate.
Of course, not all pages are equal; Googlebot will prefer important pages such as homepages and main category pages. According to Muller, Google will crawl these top pages every few days and maybe even more.
My take – it all depends. A big news publisher can expect Googlebot to come back a few times on the same day, and even crawl it multiple times within one hour. However, if we are talking about a small business with a four-page website, I wouldn’t expect the same scenario, nor it would make much sense.
To conclude, it depends on how often you update your website (freshness) and on its authority (don’t expect the same crawl rate as The New York Times if you upload 200 new articles a day, but have no authority and backlinks)
How can I increase Google’s crawl rate for my website?
Manually, there is nothing you can do to increase your website the crawl rate.
In the old Google Search Console (FKA Google Webmaster Tools) you could only limit the crawl rate, but even Google don’t recommend it (unless there are server load issues that are definitely caused by Googlebot).
Still, several aspects can help you improve the site’s crawl rate:
- Content – I will just start by saying that adding 100,000 spanned/thin content pages are not going to be a good solution for you.However, if your content is truly valuable and you make sure to update it regularly, Google will naturally understand that the site needs to be updated more often. If you are a news website with hourly updates versus a small locksmith store in Nashville, it will be easy to assume that the first will be crawled more.
- Link building – think about it this way (and sorry for the over simplification), hypothetically, what if your website suddenly gains backlinks from the homepages (!) of CNN, Reddit and Wikipedia? Ranking aside, every time Googlebot will crawl these popular homepages, it will also crawl and follow these imaginary links to your website. As simple as that. So, even though you (or me) will never gain links from these websites (at least for now), any given link from an authority website counts as a crawl frequency booster.
- Crawl budget optimization – crawl budget optimization is not something to overlook. Find the areas on your website that might cause Googlebot to spend unnecessary crawl resources:
- Handling duplicate content pages
- Removing thin content pages
- Handling URL parameters
- Improving pages with unusually high load time
Content King had created a great guide explaining how to optimize for crawl budget.
How to find out when was the last time Google crawled my site?
- Cached version – While checking the page’s cached version is the most common answer, Google had stated more than once over the last years that it doesn’t update the cache on each crawl and that cache is not a mirror to the Googlebot crawl rate.In other words, the cached version date is not a good indicator to check for the last Google crawl, it only tells you when was the last time the page was indexed or re-indexed – nothing more. So I would not recommend relying on this.
- URL inspection – the URL inspection tool in Google Search Console is highly useful, also to check for when the pages were crawled.
- Log files – This one is a bit on the technical side. Analyzing your log files will give you by far the most accurate data about your crawled pages, and so much more. You can also see how often every page on your website is being crawled, and of course – when was the last time and date Google (or other search engines for that matter) paid a visit and crawled specific pages.You don’t need to be a rocket scientist in order to read and analyze log files, I use Screaming frog log analyzer, and it is a highly intuitive tool to work with.
How to request Google to crawl your site?
- URL inspection – after checking any given URL, you can click “request indexing” to get Google to re-crawl and index the URL:
- In case Bing is your thing (hey, that’s a nice rhyme) – they recently announced a new feature that allows you to submit up to 10,000 URL for indexation per day :
What is crawl delay in robots.txt and is it useful?
The main goals behind crawl delay directive is to:
- Limit the number of server requests
- Limit the load on the server
- Specify how long the crawler needs to wait (in seconds).
While limiting the load on the server might sound like a good idea, most of the modern web crawlers nowadays are able to handle a great amount of traffic per second, with or without the crawl delay (as servers are really dynamic)
Let’s have a look at a robots.txt crawl delay example:
The crawl-delay directive for this robots.txt files is NOT supported by Google – only by Yahoo, Bing and Yandex.
According to Bing, even though the crawl-delay directive is supported, in most cases it’s still not considered a good idea:
“This means the higher your crawl delay is, the fewer pages BingBot will crawl. As crawling fewer pages may result in getting less content indexed, we usually do not recommend it, although we also understand that different web sites may have different bandwidth constraints.”
According to Yandex, it may speed up the crawling process:
“The Yandex search robot supports fractional values for Crawl-Delay, such as “0.5”. This doesn’t mean that the search robot will access your site every half a second, but it may speed up the site processing.”
The bottom-line: unless you are optimizing your website for the Russian market (Yandex search engine), there is no real need or reason to add crawl-delay to the robots.txt file.
It can even decrease your performance with other search engines, so stay alert and check your robots.txt file. As far as Google is concerned, their bots will simply ignore it altogether.
Google Crawl Stats – what if the number of pages crawled per day had dropped or increased drastically?
I’ll start by saying it will be a good idea to monitor your site’s crawl stats in Search Console at least once a month.
Should you be worried about a big increase or a decrease in the graph? Well, it depends
For example, if you see this spike:
Let’s start by ruling out with it’s Not:
It doesn’t mean a Google algorithm update.
However, it can be related to mobile first index switch (read more on the follow-up question).
And what if it’s not related?
Generally, you shouldn’t worry about it as long as it’s only a spike.
If you see a significant steady increase (not just a spike), check the following:
- Had the number of indexed pages increased? While it’s not always a bad thing, if thousands of thin content pages had been indexed, you should monitor and handle them properly.
- Check the robots.txt to see if any change were made (maybe an important disallow that was removed)
If you see a significant decrease, please check the following:
- Check the robots.txt to see if any change were made, in this case – a line that accidentally blocks the entire website or an entire folder, like this:Disallow: /
- Check for the Meta Robots tag to rest assure it’s not noindexing or nofollowing any important pages
- Run a check with the development team to see if any important implementations happened – anything the can affect how Google crawls the website. For instance, migrating the website to Angular without taking the right measures.
What about mobile first index? Does it crawl and index differently?
With Google’s mobile first indexing, it is important to mention that there are no two different indexes. There is still only one index to rule them all.
You don’t need to do anything in order to move to mobile first index, you will just see a notification in the Search Console. In addition, if you check your log files, you should look for a significant increase in the crawl rate from Smartphone Googlebot.
After it will be stabilized, you should expect to see more or less the same crawl rate, only it will be divided into two user agents – desktop and mobile
- Googlebot = desktop
- Smartphone Googlebot = mobile
Moz perfectly breaks it down on the following chart:
Indexing and crawling are two concepts that are often treated as interchangeable, even by seasoned marketers. I tried my best to make things as simple as possible (without oversimplifying)
Still feeling concerned about indexing and crawling?
Don’t be shy, would love to hear your thoughts in the comments