SEO for PDF files is not a “sexy” theme or buzzword such as let’s say, “Google AMP”, “voice search” or “mobile first index” for instance.
But in many instances, depending on your website, you might be surprised just how much potential organic traffic you are missing out on.
Let’s dig in.
Does your website contain pdf files?
Step 1 –
The first question asked should be: “do I have any indexed pdf files on my website? “
Right, so how do you search for pdf files on google?
Just add the following operators: Site:yoursite.com “filetype:pdf”
Step 2 –
Check your Google Search Console -> Search Analytics
And filter pages (URL) by .pdf
As you can see from the number of clicks and impressions in here – there is a lot of potential (but even more – missed potential):
Does google index and crawl pdf files?
Following the graph above, the answer is a big YES.
Even more interesting, in a recent tweet by John Muller, he explained that Google actually converts PDF (and other similar files) into HTML for indexing purposes.
FWIW we convert PDFs & other similar document types into HTML for indexing too, so theoretically there wouldn’t be too much difference.
— 🍌 John 🍌 (@JohnMu) August 30, 2018
Moreover, Googlebot actually follows links on PDF files which also pass link juice, so you should definitely take it into account and optimize your PDF documents, with both internal links and external links where’s needed to relevant external sources.
Which websites are likely to receive organic traffic from PDF files?
Generally speaking, almost every given website.
However, from my experience the following types have lots of PDF’s:
• Hospitals – tons of valuable information is hiding there
• Manufactures sites with user manuals (which are almost always in PDF format)
• Almost any company that offer “white papers” or eBooks as a part of their digital marketing efforts
Should you block or noindex your PDF files?
Well, it depends.
Like any other SEO starting point, you should first ask yourself:
Does these pages deliver any value to my users?
Are they unique?
One of my clients is in the luxury diamonds industry, and their company provides a GIA certificate for every given diamond. These pages are very essential for potential buyers to see the diamond authenticity and grading.
However, these pages have almost zero unique value. Take into account that there are dozens of thousands of these pages, and you have got yourself a prescription for SEO potential.
Even more important, the text on those PDF files was not readable, which means Google seen them as images.
How can you tell this? Try searching for a sentence or a word on the document itself. If you get nothing, Google will probably see this page as a soft 404, and that is something you should definitely like to avoid.
How to prevent google from crawling or indexing pdf files?
First, we need to distinguish between the two.
If the PDF documents are already indexed, we can keep Google out using the robots.txt file.
You can either block the specific folder or add the following line:
Just be careful before going ahead blocking stuff with robots.txt
The other option (and a bit more tricky to implement by yourself) will be deploying x-robots noindex,nofollow which should be deployed in the .htacess file:
<Files ~ “\.pdf$”>
Header set X-Robots-Tag “noindex, nofollow”
This will not only prevent Google from crawling your website, but will also keep it from indexing the given pages.
How to track organic traffic to your pdf files
Most likely, you will not see any traffic to your pdf files on Google Analytics, so you will need to configure a few tweaks before.
There are two main options to setup pdf monitoring with Google Analytics:
However, something that is not explained there is the canonicalization issue.
Once we use parameters or UTMs to link both internally and externally, we might expect to see some duplicate versions of the same file, which might also result in loss of precious link equity.
So what is the solution in this case? Is there a possibility to deploy canonical tags (or any other Meta tags) to a pdf format?
Canonical tag in html header, a format which Google supports.
2. The second (and most recommended and up to date) option will be implementing event tags using Google Tag Manager. While this technique will require you using the Google Tag Manager tool, but don’t let it scare you, it is quite straightforward and it won’t require you to tag hundreds or thousands of pdf documents on your website.
The downside, in both instances, is that you will not see this data aligned with the organic data – which means it will be either counted as events, or as campaigns. Therefore, while it can be still feasible to merge the data into Excel sheets, you will not be able to see it in the Google Analytics organic view.
So, what is the best way to display PDFs on website?
My best advice in here is to have both. The user will always have the option to download the PDF version, but on the other hand will also have an access to the “normal” HTML file.
Let’s take Moz beginners guide as an example.
While it is providing the user with the full HTML version of the PDF, it also enables to download the PDF as a reference – win-win!
Another good example is with Apple’s iPad manuals
Their site gives us an option to browse their manuals by products, and from there to download the PDF’s directly.
This is very smart because you can see from their top searched keywords that it’s exactly what the users are searching for and will probably land on this page rather on the pdf file itself:
I have written an in-depth post on branded keywords SEO – if you are a big brand, your users are already looking for you. Don’t forget to give the users what they are looking for.
Also, remember – with HTML files, you are much more marketing oriented:
• You can do a remarketing campaign for the users, and show them new items
• You can collect email signups for your newsletter
• The choice is yours 🙂
What is the recommended pdf file size?
While I have not found a specific recommendation regarding the ideal PDF size, I did bump into a page speed recommendation by John Muller, telling us that we should aim to less than 2-3 seconds per page:
There’s no limit per page. Make sure they load fast, for your users. I often check https://t.co/s55K8Lrdmo and aim for <2-3 secs
— 🍌 John 🍌 (@JohnMu) November 26, 2016
You can also compress the file size using tools such as I love PDF
Well, dear PDF – we love you too, but in our own terms.
- Check if your site has pdf files with significant organic traffic and powerful backlinks
- Make sure you monitor and track them properly via Google Analytics
- Check what kind of PDF files are indexed and consider if you should keep them or block them / remove them from the index
- Even while Google converts PDF files into HTML pages, it is highly recommended that your landing page will be in HTML format and from there to also allow downloading the PDF file as well
- In most cases, I recommend making an effort to have the content in both formats – as the HTML version will be the point of reference for Google