What happens when a search engine spider visits your website?
The search engine spiders or web crawlers are the software agents or robots that crawl the millions of
websites or cyberspace for some specific purposes. Their main purpose is to gather information from a
website in order to understand its structure and validity. These spiders crawl through a website to
discover, index and rank the information or content presented. These spiders form the basis of the
important search engines like Google, Yahoo, Bing and Alta Vista. These search engines make use of their
bots to locate the web pages containing relevant information through out the web. That's how, whenever
an end user enters a search query, he is immediately provided with the result pages containing the relevant
information. This article will introduce you to the work process of a web crawler i.e. what outcomes do the
crawling of search engine spiders bring to your website?
A search engine spider crawls your website in three different modes:
- It discovers and get the information presented on the web pages.
- It grades all the words on each page and stores the results in a large database.
-
Compares the search query with the stored results and fetches the information that
it considers as the most appropriate one.
Web Crawling :
It is the process by which the crawlers discover the new as well as updated pages
of a website that they have to supplement into the index. A search engine spider begins its crawling on a
website with the listed URLs those they have gathered from the earlier crawling and more ever through the
links that they find while crawling. Robots tend to reject those URLs that they find are cheating or
misleading the end users like hidden text, over stuffing of keywords, domain or sub domains having with
the more or less similar content. When the bots locate a page they pick up all the links present on a page
and queue them for later crawling. Through this technique, a spider can reach out to the every page of a
website. Through this only method, new web pages get changed into the existing one and the dead links are
noticed down. One should keep a constant check on its URLs and should immediately eliminate the duplicates,
in order to prevent the crawlers from locating the similar pages again and again.
Indexing:
Bots store the text of the pages they find in a large index database. The index is graded in alphabetical order
and the every entry made to the index stores in itself the listed documents having the specific terms (keywords)
and also the location of these terms (keywords) i.e. how often the keywords appear on a web page in comparison to
the other words. In addition to this, they also process the keywords employed in the key content tags like ALT
tags and ALT attributes. Bots are unable to process the content with rich and dynamic media files. To speed up and
improve the search results, bots tend to ignore the words they call stop words (viz. is , on, the, an, a, why etc.)
Bots can also not index punctuations and multiple spaces.
Relevancy and Page Ranking :
Whenever the end users enter the search queries, bots look through huge index
database to provide the users with the result pages which they believe are the most appropriate and relevant.
Several factors are considered to determine the relevancy, of which the Page Rank of a given page is the one.
Page Ranking is computed on the basis of various important factors as; links from the other sites, popularity
of the page, positioning of search items on a given page and more over how closely the search terms are placed.
To make the page ranking system more authentic and relevant, bots keep a close check on the spam links and the
many other tactics employed by the spammers.
The above mentioned techniques not only improvise the quality and performance of the websites but also help
the users to search more efficiently. In other words, we can say that web crawling is responsible for providing
the users with the exact answers to their queries and also determine the order in which results should be
presented.