Without spiders, the vast richness of the Web would be all but inaccessible to most users
An Internet spider is a program designed to “crawl” over the World Wide Web, the portion of the Internet most familiar to general users, and retrieve locations of Web pages. It is sometimes referred to as a web crawler. Many search engines use web crawlers to obtain links, which are ﬁled away in an index. When a user asks for information on a particular subject, the search engine pulls up pages retrieved by the Internet spider. Without spiders, the vast richness of the Web would be all but inaccessible to most users, rather as the Library of Congress would be if the books were not organized. Some search engines are human-based, meaning that they rely on humans to submit links and other information, which the search engine categorizes, catalogues, and indexes.
Most search engines today use a combination of human and crawler input. Crawler-based engines send out spiders, which are actually computer programs that have sometimes been likened to viruses because of their ability to move between, and insert themselves into, other areas in cyberspace.
Spiders visit Web sites, record the information there, read the meta tags that identify a site according to subjects, and follow the site’s links to other pages. Because of the many links between pages, a spider can start at almost any point on the Web and keep moving. Eventually it returns the data gathered on its journey to the search engine’s central depository of information, where it is organized and stored. Periodically the crawler will revisit the sites to check for changed information, but until it does so, the material in the search engine’s index remains the same.
It is for this reason that a search at any time may yield “dead” Web pages, or ones that can no longer be found. No two search engines are exactly the same, the reason being (among other things) a difference in the choice of algorithm by which the indices are searched.
Algorithms can be adjusted to scan for the frequency of certain keywords, and even to circumvent attempts at keyword stufﬁng or “spamdexing,” the insertion of irrelevant search terms intended simply to draw trafﬁc to a site.