Screen Scraping made too easy

Recently I had some free time and I decided I wanted to automate some common tasks of mine. And let me tell you honestly, I hate having to do screen scraping. It's an annoying, tedious task. Making regex for this, and for that, and then to find out my hours were wasted as that regex won't work on another site.

That's a thing of the past.

I'm super excited about this find. Maybe I'm the last to discover it, but it's just too awesome to pass up.

The project is called: PHP Simple HTML DOM Parser.

Literally, this takes almost all of the magic out of screen scraping. Here's an example from a quick and dirty login and grab my stats for ESEA.

<?php

require_once('simple_html_dom.php'); // get our class

define('COOKIE_JAR','./cookie/cookie'); // cookie jar for cURL
/**
* We need to set up our referrer and user agent, for cURL
**/
$referrer = "http://www.esportsea.com/";
$agent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.4) Gecko/2008102920 Firefox/3.0.4";
/**
* str = our direct link to our user page
**/
$str = "http://www.esportsea.com/users/<your user id>";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$str);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_REFERER, $referrer);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0);
curl_setopt($ch, CURLOPT_COOKIEFILE, COOKIE_JAR);
curl_setopt($ch, CURLOPT_COOKIEJAR, COOKIE_JAR);
curl_setopt($ch, CURLOPT_TIMEOUT, '10');
$result = curl_exec ($ch);
curl_close ($ch);
// done with cURL
$html = new simple_html_dom(); // create our HTML object

$html->load($result);

$pug_stats = $html->find('#body-matches-pug table tr'); // load up pug stats

// this loops through each <tr>
$output = array();
foreach($pug_stats as $stat)
{
	$row['game'] 	= trim($stat->find('img',0)->title);
	$row['link'] 	= trim($stat->find('a',0)->href);
	$row['score']	= trim($stat->find('a',0)->plaintext);
	$row['srv'] 	= trim($stat->find('a',1)->href);
	$row['srv_txt'] = trim($stat->find('a',1)->plaintext);
	$output[] = $row;

}
echo "<pre>",print_r($output,true),"</pre>";
exit;

?>

And that's it.

Now take a look at that, and realize how much stuff I'm not forced to do.

I know this isn't some great new invention, loading the source into a DOM object and parsing it, but man, this almost eliminates the need to think about screen scraping entirely.

Enjoy.

Tags: , ,

Monday, December 22nd, 2008 PHP, Web Development

1 Comment to Screen Scraping made too easy

  1. hawt is appropriate here? If so then I say to thee: hawt.

  2. Peterson on December 22nd, 2008

Leave a comment

*