When Google looks over a website, this is known as crawling. This is the process wherein a search engine’s bots go through a website, indexing all the links it can find.
As such, understanding how this process works and using it to your advantage is a crucial skill worth learning. Since crawling your website is an internal process, this can all be done on-page and with in-house website optimisation.
Let’s look at 5 important ways to help work with your crawl budget.
Control Your Robots.txt File
Often, if there is a robots.txt file present, crawlbots will use this to determine which pages it can and cannot look at. Because there is a limited crawl budget (the amount of pages that will be indexed), it is worth deactivating those pages that are not relevant. You can see this crawl budget (how often your website is crawled) in Google Search Console.
When it comes to optimising your robots.txt file, this is a simple case of knowing which pages should not be optimised. In short, any page that nobody will search for (such as anything created by internal searching, back-office URLs and other obscure pages) can be included here.
Create An XML Sitemap
As any SEO company will tell you, a robots.txt file is essential in today’s world. Many people, however, still overlook the importance of an XML sitemap.
In addition to simply listing all of the available URLs on a given site, sitemaps can offer additional meta data on each. By selecting age appropriateness, highlighting video or image content and even offering respective data about this on-page media, Google can more accurately store the pages it crawls, improving your search results.
By and large, many argue that XML files are only required for larger websites. However, there’s no harm in having one, especially if you feel your website has a greater than average number of pages.
Disallow Incoming Links To Unwanted Pages
Even if you’ve set up your robots.txt file, pages may still be crawled if they are being linked to. This is because these external links are nonetheless drawing search engine’s to that page. As such, if unwanted pages keep coming up, you need to ensure Google understands these links aren’t valuable.
The easiest way to do this is to use noindex tags. Place these in the head and HTTP header response of each respective page to ensure it isn’t picked up by crawlerbots. Of course, if such pages are ranking, you can use canonical tags to redirect the link power instead.
Be Careful With JavaScript
JavaScript is a popular option for many websites, as it allows for more dynamic elements. However, despite great efforts on the part of search engines, such elements aren’t always readable by crawlerbots.
If you’re entire page has JavaScript content, important information might not be picked up by Google. This is something Google has made some great improvements on, but it is something you should still be cautious about.
There are a few ways around this. One way is to ensure some amount of content is still within the basic HTML code (which Google can read), as this is where it most easily picks up keywords and other content to determine your rankings and position.
Alternatively, you can look for applications that pre-render the JavaScript elements into something more readable – these work by doing this before the user (or crawlerbot) arrives on the page. This, again, isn’t guaranteed. JavaScript requires a little experimentation, but it is nonetheless important to get right.
Manually Submit Pages
Finally, when all else fails, you can always submit pages manually. Google’s submit URL function is easy to use and, while it doesn’t guarantee your page will be crawled or indexed, it certainly increases the chances.
You can use this when your landing pages have been recently updated, for instance, and you want to ensure the changes are noticed. Some websites get crawled daily: others do not. As such, using this tool can help when you make some important changes and want to get the page crawled again.