Scraping websites with Scrapy

2014-11-26

I found myself in need of scraping a few sites recently.
Normally I would’ve use requests*, but this time I needed a robust framework.

Scraping on a massive scale is not a trivial task, specifically if you’re targeting websites that don’t want to be scraped. I started looking into various scraping frameworks and bumped into Scrapy.

Scrapy is an open source and collaborative framework for extracting data from websites, in a fast, simple, yet extensible way. It’s widely used and extremely hackable!

Now, as I said, scraping a website that doesn’t want to get scraped is not easy, and that’s exactly what I needed. Why is that so hard? mainly because you can caught and blocked rather quickly. Scraping such sites is like playing catch.

I started by reading Scrapy’s tutorial, avoid-getting-banneed, and moved over to scrapying over TOR. This wasn’t the ideal solution. Honestly, the best solution IMO is using a proxy like Luminati**, but I needed a quick win, and TOR gave me exactly that.

*** The only Non-GMO HTTP library for Python, safe for human consumption.

**** Lumintai is a “business proxy network” built by Hola. They use their VPN users as proxies, which gives them ~20 million proxies worldwide!