By definition, deep crawl describes the ability of a search engine crawler to independently search subpages of a website for content and to index it. With some search engines, the indexing of content is limited to a certain directory depth of the website. The deep crawl ensures that subpages of a website are also rated and indexed as independent pages. A deep crawl can take several hours for large websites.
Goal of the deep crawl
According to sportingology, a deep crawl aims to track down deep links in order to penetrate deeper and deeper into the website structure. In this way, the crawler succeeds in searching through all subpages of a domain, evaluating them as independent pages and then indexing them. Subpages can rank independently thanks to their indexing and are displayed as independent components of a website when a search query is made.
How does a deep crawl work?
Basically, the deep crawl ensures that websites are deeply indexed. The crawler searches for deep links on various website levels that it can follow. Deep links refer to references that link deeper subpages with one another and do not refer to the home page of a website.
For example: The encyclopedia article example.com/lexikon/crawler/, which is a subpage of this website, refers to the subpage example.com/lexikon/algorithmus/. This reference represents such a “deep link” that Google follows. In the lexicon entry of “Algorithm” there are now further links that are examined by the crawler. In this way, Google shuffles from side to side and gets an overview of the website.
The indexing of the sub-pages takes place in the well-known way. The source code of the website is read out by Google and evaluated on the basis of the ranking factors. This is where the ranking criteria of the respective search engine take effect. The website is rated and ideally rewarded with a good ranking for relevant search queries.
Role of the crawl budget in the deep crawl
Thanks to the deep crawl, even deeper subpages of complex and extensive websites have the chance to achieve a good placement in the results lists of a search engine. Whether Google crawls a single subpage at all can depend on several factors. One of them is the crawl budget. This value indicates the maximum number of pages that Google searches on a domain. Websites that have a few thousand URLs usually have no problem with an insufficient crawl budget.
The situation is different with extensive websites. Usually Google only crawls a part of the subpages here. The site operator then has to fear that important traffic will be missed due to a lack of indexing of the subpages. Targeted measures can be used to counteract the problem and thus prevent Google from overlooking important content during a deep crawl. This includes, for example, the development of a flat page structure or the exclusion of unimportant subpages from crawling.
What is the difference between deep crawls and fresh crawls?
With the Fresh Crawl and Deep Crawl, Google implements two different update cycles. The Google Deep Crawl examines websites intensively and follows every link in order to capture the entire content of a website as possible. The crawling of large websites takes several hours and is often carried out in parts over several days. It can be assumed that the deep crawl does not scan new web pages immediately.
On the other hand, there is the fresh crawl, which permanently crawls websites. The goal of Google is to constantly search the Internet for new content and to add relevant, current content to its own index as quickly as possible. In order to scour as many websites as possible in a short time, these are only superficially recorded and checked for changes. How often a web presence is visited by the web crawler depends on factors such as the strength of the brand or its update rate.
New indexing structure through Google Caffeine
In order to keep the index even more up-to-date, Google activated a revised database and index structure in 2010. Instead of dividing websites into levels – as was customary until then – Google has since used a network-like structure. But what are the advantages of this new infrastructure?
Since then, Google has been able to crawl and index websites and information much faster and, above all, more flexibly. With hundreds of gigabytes of data added every day, this update makes a lot of sense. In addition, it is only through the new Caffeine structure that content from social media and news platforms can be accessed in real time. Google users benefit from permanently updated lists of results.