- A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way!
- Require PHP 5+.
- Supports invalid HTML.
- Find tags on an HTML page with selectors just like jQuery.
- Extract contents from HTML in a single line.
I have used this class several times already in some of my projects and it does work well. Below are some sample codes:
Disclaimer: This is solely for educational purposes only. I will not be held responsible if you violate the terms & whatever conditions for using this to other websites.
Scraping Slashdot!
/** THIS ONE IS A SAMPLE FOR SCRAPING SLASHDOT **/
// Create DOM from URL
$html = file_get_html('http://slashdot.org/');
// Find all article blocks
foreach($html->find('div.article') as $article) {
$item['title'] = $article->find('div.title', 0)->plaintext;
$item['intro'] = $article->find('div.intro', 0)->plaintext;
$item['details'] = $article->find('div.details', 0)->plaintext;
$articles[] = $item;
}
print_r($articles);
Scraping Sitejabber.com and saving the data into a CSV file:
/** THIS SCRIPT SCRAPES SITEJABBER.COM AND SAVES THE DATA INTO A TEXT FILE **/
include("simple_html_dom.php");
$fp = fopen("websitelist.txt","w");
$total = 1;
for ($page = 1;$page <= 1330; $page++){
$url = "http://www.sitejabber.com/reviews?page=".$page;
$html = file_get_html($url);
$count = 0;
$countphone = 0;
foreach($html->find('div.website_review_container') as $t){
$urladdress = trim($html->find('div.url_address',$count)->plaintext);
$author = trim($html->find('div.author_name',$count)->plaintext);
$categories = trim(str_replace("Topics:
", "", $html->find('div.categories',$count)->plaintext));
$categories = str_replace(" ","",$categories);
$review = htmlentities(trim($html->find('div.website_review_content',$count)->plaintext));
$date = trim($html->find('div.review_date',$count)->plaintext);
$count++;
print $urladdress."\n";
fwrite($fp,$total.",".$date.",".",\"".$urladdress."\",".$author.",\"".$categories."\",\"".$review."\"\n");
$total++;
}
$html->clear();
unset($html);
}
fclose($fp);
This scripts is so much easier to built than before when I had to painstakingly try out different regular expressions and use preg_match_all to match the data I need to scrape. Now we just define the DOM structure and scrape it off from there. Since most websites nowadays are using a CMS, this is the quickest solution.
Source: http://simplehtmldom.sourceforge.net/
Download: http://sourceforge.net/project/showfiles.php?group_id=218559
No comments:
Post a Comment