Does anyone know how an Internet Spider works? I understand the whole concept that the spider goes out and "crawls" the internet sites and looks for meta tag, html code, and things of that nature. I know that it follows links from one page to another, depending on how deep the "spider" software is told to search. My question is how does it know where to search? How does it acutally find the pages?
Since URL's are just convenient handles that reslove to numeric IP addresses, I would guess the spider starts with a valid IP number and just incrementally steps through the sequence of possible addresses until it finds a page, then parses the text for more URL's to follow. When it's seen all it can see, it can start back at the next IP number in the sequence.
That sounds efficient on the surface, Pressly, but it's not. What if no machine sits at that IP? What if (like my home computer) I have a firewall that won't respond to an unapproved request? Then your process hangs there waiting for a response until it times out.
Also, many firewall systems will treat that as a hack attempt.
Not to mention that the majority of IPs do NOT have an HTTP server.
All this leads up to your server wasting a lot of resources doing nothing.
KC,
As well as following internal website links, spiders will also harvest external links and follow those as well. Once it gets started a spider could theoretically crawl forever following site links to site links.
Of course, you'll have to give it a list of URLs to start, try to pick link rich sites.
Try this site, it has some good info on spiders
http://www.robotstxt.org/wc/robots.html