I found myself in need of scraping a few sites recently.
Normally I would’ve use requests*, but this time I needed a robust framework.
Scraping on a massive scale is not a trivial task, specifically if you’re targeting websites that don’t want to be scraped. I started looking into various scraping frameworks and bumped into Scrapy.
Scrapy is an open source and collaborative framework for extracting data from websites, in a fast, simple, yet extensible way. It’s widely used and extremely hackable!
Now, as I said, scraping a website that doesn’t want to get scraped is not easy, and that’s exactly what I needed. Why is that so hard? mainly because you can caught and blocked rather quickly. Scraping such sites is like playing catch.
I started by reading Scrapy’s tutorial, avoid-getting-banneed, and moved over to scrapying over TOR. This wasn’t the ideal solution. Honestly, the best solution IMO is using a proxy like Luminati**, but I needed a quick win, and TOR gave me exactly that.
*** The only Non-GMO HTTP library for Python, safe for human consumption.