ip.board SEO: Improving crawling efficiency

October 20, 2021

No matter how good your content is, how accurate your keywords are or how precise your microdata is, inefficient crawling reduces the number of pages Google will read and store from your site.

Search engines need to look at and store as many pages that exist on the internet as possible. There are currently an estimated 4.5 billion web pages active today. That's a lot of work for Google.

It cannot look and store every page, so it needs to decide what to keep and how long it will spend on your site indexing pages.

Right now, Invision Community is not very good at helping Google understand what is important and how to get there quickly. This blog article runs through the changes we've made to improve crawling efficiency dramatically, starting with Invision Community 4.6.8, our November release.

The short version

This entry will get a little technical. The short version is that we remove a lot of pages from Google's view, including user profiles and filters that create faceted pages and remove a lot of redirect links to reduce the crawl depth and reduce the volume of thin content of little value. Instead, we want Google to focus wholly on topics, posts and other key user-generated content.

Let's now take a deep dive into what crawl budget is, the current problem, the solution and finally look at a before and after analysis. Note, I use the terms "Google" and "search engines" interchangeably. I know that there are many wonderful search engines available but most understand what Google is and does.

Crawl depth and budget

In terms of crawl efficiency, there are two metrics to think about: crawl depth and crawl budget. The crawl budget is the number of links Google (and other search engines) will spider per day. The time spent on your site and the number of links examined depend on multiple factors, including site age, site freshness and more. For example, Google may choose to look at fewer than 100 links per day from your site, whereas Twitter may see hundreds of thousands of links indexed per day.

Crawl depth is essentially how many links Google has to follow to index the page. The fewer links to get to a page, is better. Generally speaking, Google will reduce indexing links more than 5 to 6 clicks deep.

The current problem #1: Crawl depth

A community generates a lot of linked content. Many of these links, such as permalinks to specific posts and redirects to scroll to new posts in a topic, are very useful for logged in members but less so to spiders. These links are easy to spot; just look for "&do=getNewComment" or "&do=getLastComment" in the URL. Indeed, even guests would struggle to use these convenience links given the lack of unread tracking until logged in. Although they offer no clear advantage to guests and search engines, they are prolific, and following the links results in a redirect which increases the crawl depth for content such as topics.

The current problem #2: Crawl budget and faceted content

A single user profile page can have around 150 redirect links to existing content. User profiles are linked from many pages. A single page of a topic will have around 25 links to user profiles. That's potentially 3,750 links Google has to crawl before deciding if any of it should be stored. Even sites with a healthy crawl budget will see a lot of their budget eaten up by links that add nothing new to the search index. These links are also very deep into the site, adding to the overall average crawl depth, which can signal search engines to reduce your crawl budget.

Filters are a valuable tool to sort lists of data in particular ways. For example, when viewing a list of topics, you can filter by the number of replies or when the topic was created. Unfortunately, these filters are a problem for search engines as they create faceted navigation, which creates duplicate pages.

The solution

There is a straightforward solution to solve all of the problems outlined above. We can ask that Google avoids indexing certain pages. We can help by using a mix of hints and directives to ensure pages without valuable content are ignored and by reducing the number of links to get to the content. We have used "noindex" in the past, but this still eats up the crawl budget as Google has to crawl the page to learn we do not want it stored in the index.

Fortunately, Google has a hint directive called "nofollow", which you can apply in the code that wraps a link. This sends a strong hint that this link should not be read at all. However, Google may wish to follow it anyway, which means that we need to use a special file that contains firm instructions for Google on what to follow and index.

This file is called robots.txt. We can use this file to write rules to ensure search engines don't waste their valuable time looking at links that do not have valuable content; that create faceted navigational issues and links that lead to a redirect.

Invision Community will now create a dynamic robots.txt file with rules optimised for your community, or you can create custom rules if you prefer.

The new robots.txt generator in Invision Community

Analysis: Before and after

I took a benchmark crawl using a popular SEO site audit tool of my test community with 50 members and around 20,000 posts, most of which were populated from RSS feeds, so they have actual content, including links, etc. There are approximately 5,000 topics visible to guests.

Once I had implemented the "nofollow" changes, removed a lot of the redirect links for guests and added an optimised robots.txt file, I completed another crawl.

Let's compare the data from the before and after.

First up, the raw numbers show a stark difference.

Before our changes, the audit tool crawled 176,175 links, of which nearly 23% were redirect links. After, just 6,389 links were crawled, with only 0.4% being redirection links. This is a dramatic reduction in both crawl budget and crawl depth. Simply by guiding Google away from thin content like profiles, leaderboards, online lists and redirect links, we can ask it to focus on content such as topics and posts.

Note: You may notice a large drop in "Blocked by Robots.txt" in the 'after' crawl despite using a robots.txt for the first time. The calculation here also includes sharer images and other external links which are blocked by those sites robots.txt files. I added nofollow to the external links for the 'after' crawl so they were not fetched and then blocked externally.

As we can see in this before, the crawl depth has a low peak between 5 and 7 levels deep, with a strong peak at 10+.

After, the peak crawl depth is just 3. This will send a strong signal to Google that your site is optimised and worth crawling more often.

Let's look at a crawl visualisation before we made these changes. It's easy to see how most content was found via table filters, which led to a redirect (the red dots), dramatically increasing crawl depth and reducing crawl efficiency.