Extracting data from a business directory listing website

intern3t · Post by **intern3t** » 11 Aug 2013, 10:43

Good day, i just finish my business directory listing website. Is there a way i could automatically extract data from any business directory website and populate my website with it. I tried httrack copier but no luck. any suggestions?

Post by **maboroshi** » 11 Aug 2013, 11:52

Hey intern3t good work!

You would need a script to do this, I would suggest using PHP to implement your screen scraper. Lilrofl pointed this out as probably the best approach to developing a screen scraper. PHP has a really good library to achieve this and is relatively simple and fast. Maybe he can comment on what the library is.

I need something similar to be honest. So if I write something I will pass it along your way so you can modify it. If you write something maybe you can give me your example as well

*cheers mabs

Post by **lilrofl** » 12 Aug 2013, 12:22

Screen scraping is more of an art then a science I'm afraid, so I kind of feel like I'm talk in a lot of generalities when discussing them. I used to write scrapers in python, but I've found that PHP with PHPcurl is less robust, and easier to develop in.

So the general outline of a screen scraper is connect to the target site, download the target information, parse that information and store it in a format that is useful to you.

Connecting to a site in PHP is very easy using fopen() and fgets() or file().

fopen() and fgets() work together one to make a connection and the other to download the web text to local. file() is slightly more useful because it parses the captured web text as elements of an array, one array element per line. LIB_http.php has its own get functions that allow you to parse off headers during downloading... but I might be getting far afield here.

There are a few custom libraries (as Mabs mentioned) that make parsing data with PHP pretty straight forward, and have the added bonus of keeping you as far away from regular expressions as possible. They were put together by a guy named Michael Schrenk for a book, "Webbots, Spiders, and Screen Scrapers" I highly recommend that book BTW it's probably the most comprehensive resource I've found on screen scraping... either way you'll be able to download his custom libraries, LIB_http.php and LIB_parse.php in particular, from the link below:

Code: Select all

http://webbotsspidersscreenscrapers.com/DSP_download.php

Once you've collected and parsed your data you'll upload it to wherever it is that it's going using CURL, and that when things can get quite vague.

Your bot should upload data using the same method as a browser would. You may need to implement cookies, authentication, or encryption... without really looking at the target download and target upload sites there's very little in the way of specifics I can offer; however, there are a few things to consider:

Adding routines in your bot to slow it down, to many fetches to fast can look like or create a DoS on your target host... which is precisely how I got banned from Craigslist lol. Changing your bots agent ID so you don't go advertising that you are a bot. In many cases the web pages shown to a bot are organized different then ones shown to you, which can jack up your parses if you don't plan for it, or avoid the situation entirely (easier)

Hope that helps a little, it's not as involved as it sounds... but I understand if it's more involved then you were hoping

intern3t · Post by **intern3t** » 14 Aug 2013, 11:10

thanks mabs, thanks a million lilrofl. i have downloaded the necessary libs. and i have also acquired the ebook. Guess am gonna need like few weeks to fully develop what i need. thanks once again

suck-o.com

Extracting data from a business directory listing website

Extracting data from a business directory listing website

Re: Extracting data from a business directory listing website

Re: Extracting data from a business directory listing website

Re: Extracting data from a business directory listing website