Helpful Information

Category: Post a PHP snippet

basic site spider

I have just rehashed an old site spider script that I used to use for building an internal search engine page index. I don't use it for that anymore, none the less its proven useful in the last few days so thought I would post it here, as per usual its quite pointless on its own ;) but can be useful combined with other classes or functions.

e.g. I recently tied this into a pspell routine which spell checks all of a given site including dynamic and aggregated content (a very slow process but thats another story)

Anyway the basic class..

<?php
/spider class , only spiders within $root !!

TODO
variable spider depth
mime-types to ignore
static calls to external methods
multiple filters (this is slow enough already so perhaps/perhaps not)
prefilters (rules to see if its even worth passing to filter ?)
/
class spider{
var $start = ''; #initial page to spider
var $root = ''; #domain root
var $caught = array(); #pages found (no root)
var $_curr_idx = ''; #the current page we are spidering
var $is_filter = false; #flag if we have set a filter object
var $filter_res = array(); #if we used an external filter the results of that call are stored in here

function spider($root, $start, $crawl_now=false){
$this->root = $root;
$this->start = $this->_curr_idx = $start;
$this->caught = array($start);
if($crawl_now === true){
$this->crawl_now();
}
}

/**you can start the spidering in the constructor or here , whatever stirs your bucket*/
function crawl_now(){
$this->crawl($this->root.'/'.$this->start);
}

/**regex could probably be better but this works/
function get_links($str){
$rets = '';
preg_match_all("|<a href=\"".$this->root."\/(.)\".>.\<\/a>|Uis",$str,$rets);
return $rets[1] ;
}

function crawl( $page ){
$cnt = file_get_contents($page);
if($cnt){
if($this->is_filter===true){
$this->call_filter($cnt);
}
$links = $this->get_links($cnt);
foreach($links as $l){
if(!in_array($l,$this->caught)){
$this->_curr_idx = $this->caught[] = $l;
$this->crawl($this->root.'/'.$l);
}
}
}
}
/*FILTERS***********/
/
if you want to work on the content of spidered pages,here is a good place to
do so since we already have the page content to hand , how you store or process
the external objects results will vary wildly so we dont bother much here except
to store any results in $this->filter_res

set the callback
*/
function set_filter(&$obj){
$this->is_filter = true;
$this->filter_method = false; #flag if object callback or a regular function
if(is_array($obj)){
$this->filter_callback = $obj[0];
$this->filter_method=$obj[1];
return;
}
$this->filter_callback=$obj;
}

/**
call the callback , be it a class method or a regular function
only called if set_filter has been called first
*/
function call_filter(&$cnt){
if($this->filter_method === false){
$filter = $this->filter_callback;
$this->filter_res[$this->_curr_idx] = $filter($cnt);
}else{
$filter = $this->filter_callback;
$meth = $this->filter_method;
$this->filter_res[$this->_curr_idx]=$filter->$meth($cnt);
}
}
}
?>

Example :: crawl and print the list of pages crawled

<?php
$yaks = new spider('http://localhost/pixelpushers', 'index.htm',true);
print_r($yaks->caught);
?>

Example :: pass crawled pages to an external object method for munging

<?php
class test_filter{
function filter($str){
return strlen($str);
}
}

$yaks = new spider('http://localhost/pixelpushers', 'index.htm',false);
# uses std PHP callback syntax for objects
$obj = array(new test_filter(), 'filter');
//#or regular functions (here a PHP function)
//$obj = 'strlen';
//$yaks->set_filter($obj);
$yaks->set_filter($obj);
# in this case we now have to start the filter ourselves
$yaks->crawl_now();
# see what we got
print_r($yaks->filter_res);
?>

thats about it , there is much more that could be done but I like small compact classes where possible , I only added the filter functionality because it is the most common and (I think) useful addition to the basic idea.

Note that there is no validation done on paths etc nor is security even something I have considered since this script (due to the snails pace at which it runs) is not really much use for production work more for admin backends etc.

Any suggestions/critiques etc welcomed