Helpful Information
 
 
Category: Regex Programming
Abnormal IMG SRC?

I've wasted too many hours trying to figure this out :(

Can I use a single pattern to extract the (correct) src attribute of image tags (even if there are mutiple src attributes)?

E.g. what pattern can get the 2nd src attribute of the 1st tag and the only src attribute of the 2nd tag?


<img onError= src="http://images.play.com/SiteCSS/Play/Live2/2010032301/img/proxy/01m.gif" src="http://images.play.com/covers/10667429m.jpg" alt="Tim Burton's Alice In Wonderland" style="border-width:0px;height:178px;width:117px;" />

<IMG SRC="http://images.play.com/banners/content/Alice 6.jpg " ALT="Alice In Wonderland" />

This is the latest pattern I have tried:


define('IMG_SRC_PATTERN', '#[^onError= ]*src=[\"\']?([^"\']+)#i');

preg_match(IMG_SRC_PATTERN, $tag, $match);


EDIT:
I think I may have stumbled onto the pattern I need, but I'm not sure if it is efficient or not. Can anyone advise?


$pattern = "#(= src=['\"].+[^\"]?)?src=[\"']?([^\"']+)#i";

Using regexps for parsing markup like HTML is generally a bad idea. It's usually advisable to use a proper tag-aware HTML parser for this.

Ha :)

I started using preg_ functions, then I changed to DomDocument->loadHTML(), and now I have changed back to preg_ again.

DomDocument is slow and doesn't pick up all of the image tags when they are "abnormal".

I changed primarily due to this thread:

http://forums.devshed.com/php-development-5/screen-scraping-multiple-pages-687448.html

Your "edit" pattern looks fine, I don't even know how they manage to make image sources like this, I didn't think it was valid.

You can also try actually stepping through the string, as distasteful as that may be.

-Dan










privacy (GDPR)