Thursday, April 21, 2005

[Tech] Web Scraping Proxy

Interesting idea...

http://www.research.att.com/sw/tools/wsp/


Programmers often need to use information on Web pages as input to other programs. This is done by Web Scraping, writing a program to simulate a person viewing a Web site with a browser. It is often hard to write these programs because it is difficult to determine the Web requests necessary to do the simulation.

The Web Scraping Proxy (WSP) solves this problem by monitoring the flow of information between the browser and the Web site and emitting Perl LWP code fragments that can be used to write the Web Scraping program. A developer would use the WSP by browsing the site once with a browser that accesses the WSP as a proxy server. He then uses the emitted code as a template to build a Perl program that accesses the site.

2 comments:

Audrey said...

Proxies are an invaluable tool to a person or business scraping another site for business intelligence.

Occassionally, you may not want the site owner to know who you are. However, using your own IP address leaves a trackable link back to your ISP and then to you.

Even if your ISP won't give out your information, the site owner can still get your general location. Some site owners have even gone so far as to supoena ISPs with court orders demanding the information of parties extracting data from their sites.

By using a proxy, your own personal IP address can be hidden. Instead the site owner will see the IP address of the proxy machine which is preferably located off your network.

Another hurdle to web scraping is IP blocking. This is especially problematic when you need to access alot of data in a short time frame from a third party site.

Some website owners will temporarily ban or even permanently block the IP address issued to you by your ISP. Some sites will even ban the entire range of IP addresses that your ISP uses.

To get around this problem, the company I work for (ScrapeGoat) setup thousands of proxies across the world with randomly changing IP addresses. This allows us to access sites through a broad range of locations and appear as random, natural traffic.

While this is expensive to do, it can be very beneficial if the value of the data merits the cost. Or if you'll pardon a shameless plug for my company, we are very good at getting the data on your behalf.

Anonymous said...

Interesting points on extracting data, For simple stuff i use python to get or simplify data, data extraction can be a time consuming process but for larger projects like documents, files, or the web i tried "extracting data" which worked great, they build quick custom screen scrapers, extracting data, and data parsing programs