In order for Google and other search engines to index your pages, they need to know they exist and where.
This is usually accomplished in one of two ways:
- The crawler follows a link from another page.
- The crawler finds the URL listed in your XML sitemap.
A page without any links to it is called an orphan page.
Because search engines can’t find an orphan page through other links on the end, orphan pages often go unindexed and never show up in search results.
Even if your orphan pages are listed in your XML sitemap, they are still a problem for SEO.
With no internal links, no authority is passed to the pages, and search engines have no semantic or structural context in which to evaluate the page.
Without any way of knowing where the page fits into your site as a whole, it can be more difficult to determine which queries the page is relevant for.
In this post, we’ll explore how to find orphan pages on your site.
1. Identify Your Crawlable Pages
First, you’ll need a list of all of the URLs that currently can be reached by crawling your site’s links.
You will need an SEO spider to do this. I recommend ScreamingFrog.
Whatever crawler you use, make sure it is set to crawl only pages that are indexable by search engines, meaning that it should not crawl pages that are noindexed or pages that are hidden from search engines by robots.txt.
Start the crawl from the homepage of the site, making sure to use the canonical URL, including the proper https or http and www versus non-www.
Once you have crawled your site, export the URLs to a spreadsheet like this:
2. Resolve 2 Common Causes of Orphan Pages
Before checking any tools or sources to find orphan pages, there are two common causes of orphan pages that should be immediately addressed and dealt with.
What both of these causes have in common is that they are essentially page duplicates that should automatically redirect consistently to only one URL.
If they don’t, it’s likely that some versions of the page are not linked to and as a result are orphans.
In this case, the fact that they are orphans isn’t the primary issue, the fact that they are duplicates is.
Still, these will come up later while you are looking for orphan pages, and need to be dealt with, so it’s a good idea to get these out of the way beforehand.
Non-canonical https/http or www/non-www
Every public page on your site should ideally use http or https consistently (preferably https), and www or non-www consistently.
To check if this is the case, try typing all of these variations of your site’s homepage into your browser:
All four variations should redirect automatically to the exact same URL.
You should test this on a few other pages of your site, and check your site’s .htaccess file to make sure that redirects for these are set up properly.
Here is how to force https in .htaccess. If you do this, verify that every page on your site has SSL capabilities, or your users will get a scary browser warning.
Here is how to force www or non-www. Again, verify that this won’t create any server errors.
Another thing to watch out for is consistent use of trailing slashes.
For example, these two URLs may produce the same content, but the URLs are not identical:
Check a few pages on your site both with and without the trailing slash, and make sure that they redirect automatically to the same URL, and that they do so consistently.
Verify that this is set up properly in .htaccess. Here is how to force a trailing slash in .htaccess.
3. Get a List of URLs from Google Analytics
Crawlers, by definition, will have a difficult time finding orphan pages.
So using any SEO tool to find one is bound to be problematic.
The best place to start looking for orphan pages, then, is your own Google Analytics data (or any other analytics packages you use).
As long as the pages in question have Google Analytics installed, if the page has ever been visited, there is a record of it somewhere in Google Analytics.
To get a comprehensive list of URLs, from the left sidebar, select “All Pages” under “Site Content” from the “Behavior” section:
Since our orphan pages are difficult to find, the number of times they have been visited is likely to be quite low.
Click “Pageviews” so that the arrow is pointing upward, indicating that the list of URIs is sorted in ascending order from least to most pageviews.
This will move the pages most likely to be orphans to the top:
To make sure our list is as comprehensive as possible, go to the time range at the top right and set the starting date back to a time before Google Analytics was in place, and click the “Apply button:
Now we will need to expand our list of URLs as much as possible.
In the bottom right, click the “Show rows” dropdown menu and select the highest number of rows.
Our biggest obstacle is that Analytics can only list up to 5,000 URLs at a time:
If you have more than this, you will have to export 5,000 pages at a time until you have all you Google Analytics visitor data.
However, we are sorting pageviews by ascending, so our list should hopefully include all, and will most likely include most orphan URLs that have had a visitor.
It will likely take a bit of time Analytics to fetch all of the data. Be patient and don’t try to rush things or you will risk crashing your browser.
Once the URLs are loaded, head up to the top right, select export, and export a Google Sheet, Excel file, or CSV spreadsheet to get your URLs.
Now copy the URLs from your exported analytics file into your orphan page spreadsheet, like so:
We will need to get these into URL format in order for them to be useful. To do this, insert a new column and paste down the homepage URL, like so:
And use the concat() formula to combine these together into a URL in the next column over:
Then just drag the formula down to get the full list of URLs:
4. Identify Your Orphan URLs
To identify our orphan URLs, we will need to compare the list of “Crawlable URLs” and the list of “Analytics URLs” in our spreadsheet.
In our hypothetical example, it’s obvious that https://example.com/11 is an orphan page, but in reality you will almost always have far more URLs to sift through, and we will need to automate the process of identifying our orphan URLs.
To do this, we need a formula that checks if each URL in our Analytics list is also found in our list of Crawlable URLs.
Here is an example of a formula that will accomplish this:
The “match” formula we have used in cell E2 here is:
This formula checks if the URL in cell D2 is in the range $A$2:$A$11. (If you’re not too familiar with spreadsheets, the dollar signs are there to make sure that when we drag the formula down the column, the range won’t change.)
The value “0” tells Google Sheets that the columns aren’t necessarily sorted. (See the Google Sheets documentation.)
If there is a match, the formula returns its position in the range, which in this case is the first position in the range.
What we’re more interested in, however, is if there isn’t a match.
As you can see, the formula returns the error #N/A for https://example.com/11, because it is not found in our list of Crawlable URLs. This means it is an orphan page.
To get a list of our orphan pages, then, all we need to do is sort our “Match” column to collect all of the #N/A results in one place.
We can then copy our list of orphan URLs and paste them to a new sheet where we can address how to fix them.
5. Other Places to Look for Orphan URLs
You can repeat this process for identifying orphan URLs using data sources other than Google Analytics.
Any of the following tools will have a list of pages crawled from your site:
- Moz Link Explorer
- Raven Tools
I would not recommend signing up for any of them exclusively to look for orphan pages, because they will need to somehow crawl these pages in order to find them.
However, it is possible that in some cases these tools will find pages that aren’t directly crawlable because they were found using other means, usually at some point in history when the page was crawlable:
Also, it’s a good idea to work with your dev team to see if they can get the complete list of URLs on the site directly from the server, since this should be the most complete list available anywhere.
Finally, you can get a list of URLs from the Google Search Console’s Search Analytics report.
Even though these pages are obviously indexed if they are showing up here, you may still find pages that aren’t crawlable from your internal links that will need to be fixed.
Orphan pages can’t be indexed by search engines if they don’t show up in your sitemap – and they can create other SEO issues even if they do.
Use the methods outlined in this post to find your orphan pages and get this issue resolved.
Featured Image: E2M Solutions
All screenshots taken by author, November 2018
Subscribe to SEJ
Get our daily newsletter from SEJ’s Founder Loren Baker about the latest news in the industry!