Skip to content

A Website in the Dark

There are any number of reasons that Google is unable to effectively crawl and index your site. (Indexing is the process Google uses to get information about your site, compare it to others, and then rank it.) Webpages might have keyword, metadata, and status issues, or related signals that create access issues for Googlebot — what Google uses to “crawl” sites for content — resulting in either your site’s content not showing up or rendering the wrong content, giving searchers the dreaded “404 Page Not Found” error. That was the issue for Wiley Efficient Learning — poor indexing was keeping their website off the search radar.

Uncovering the Roadblocks

The only way to truly gauge where the obstacles were – especially given the overall expansiveness of the Wiley site — was through a comprehensive Site Crawl & Indexation Audit process.

The exhaustive exploration included, but was not limited to:

URL statuses across the site. One of the first things we look at when crawling large websites is the spread of response codes that pop up for URLs across the website. These status codes are important because they can determine if and how your website is crawled by Google. While reviewing, we found a wide range of status code breakdowns, which has the capability to stop crawling.

Robots.txt files. These are instructions for crawlers, like Googlebot. When crawlers show up to your website to navigate your website, they first encounter your website’s robots.txt file. This file gives crawlers instructions about which pages they are allowed to visit and where your sitemaps are located. For Wiley, we found “broken” robot.txt files resulting in approximately 78 unindexed pages, all of which had valuable content.

Sitemaps. Sitemaps are used to tell search engines which of the pages on your site are most important. The rule of thumb is that you want pages only with a 200 status code (a response code that indicates that a request has succeeded) and that exclude pages that cannot be indexed. In Sitemap searching, Rebel found some sitemaps incorrectly redirecting back to the home page, hundreds of URLs excluded from sitemaps or orphaned, and duplicate and non-indexable URLs.

Canonical tags. Canonical tags are used to suggest to Google which URL is the correct version of a page that should be shown in Google. This is helpful when exact or near duplicates of a page exist across the website. For, we found 500 missing tags and several non-indexable tags.

Framework issues. In the bigger picture, the Wiley site was built using JavaScript, which can be read by Google, although it takes additional resources from Google. The JavaScript-built site also takes longer to load and presents a risk that content is not actually indexed.

The point of the Site Crawl & Indexation Audit is that most of us simply don’t know what we don’t know. The audit uncovered hundreds of “hidden” obstacles to an optimized, Google-friendly Wiley Efficient Learning site.