Pages

Tuesday, November 30, 2010

PHP: How to Scrape Websites using Simple HTML DOM Parser

Scraping websites or Web scraping is a method of collecting data off from websites using a program to automate the process. Wikipedia defines it as "a computer software technique of extracting information from websites." For whatever purpose it may be, the creation of web scraping PHP scripts has been made easier using this PHP class for parsing HTML content. From the PHP Simple HTML DOM Parser website:

  • A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way!

  • Require PHP 5+.

  • Supports invalid HTML.

  • Find tags on an HTML page with selectors just like jQuery.

  • Extract contents from HTML in a single line.



I have used this class several times already in some of my projects and it does work well. Below are some sample codes:

Disclaimer: This is solely for educational purposes only. I will not be held responsible if you violate the terms & whatever conditions for using this to other websites.

Scraping Slashdot!
/** THIS ONE IS A SAMPLE FOR SCRAPING SLASHDOT **/
// Create DOM from URL
$html = file_get_html('http://slashdot.org/');

// Find all article blocks
foreach($html->find('div.article') as $article) {
$item['title']     = $article->find('div.title', 0)->plaintext;
$item['intro']    = $article->find('div.intro', 0)->plaintext;
$item['details'] = $article->find('div.details', 0)->plaintext;
$articles[] = $item;
}

print_r($articles);


Scraping Sitejabber.com and saving the data into a CSV file:
/** THIS SCRIPT SCRAPES SITEJABBER.COM AND SAVES THE DATA INTO A TEXT FILE **/
include("simple_html_dom.php");
$fp = fopen("websitelist.txt","w");
$total = 1;
for ($page = 1;$page <= 1330; $page++){
$url = "http://www.sitejabber.com/reviews?page=".$page;
$html = file_get_html($url);
$count = 0;
$countphone = 0;
foreach($html->find('div.website_review_container') as $t){
$urladdress = trim($html->find('div.url_address',$count)->plaintext);
$author = trim($html->find('div.author_name',$count)->plaintext);
$categories = trim(str_replace("Topics:
", "", $html->find('div.categories',$count)->plaintext));
$categories = str_replace(" ","",$categories);
$review = htmlentities(trim($html->find('div.website_review_content',$count)->plaintext));
$date = trim($html->find('div.review_date',$count)->plaintext);
$count++;
print $urladdress."\n";
fwrite($fp,$total.",".$date.",".",\"".$urladdress."\",".$author.",\"".$categories."\",\"".$review."\"\n");
$total++;
}
$html->clear();
unset($html);
}
fclose($fp);


This scripts is so much easier to built than before when I had to painstakingly try out different regular expressions and use preg_match_all to match the data I need to scrape. Now we just define the DOM structure and scrape it off from there. Since most websites nowadays are using a CMS, this is the quickest solution.

Source: http://simplehtmldom.sourceforge.net/
Download: http://sourceforge.net/project/showfiles.php?group_id=218559

No comments:

Post a Comment