Scraping websites with Scrapy

I found myself in need of scraping a few sites recently.
Normally I would’ve use requests*, but this time I needed a robust framework.

Scraping on a massive scale is not a trivial task, specifically if you’re targeting websites that don’t want to be scraped. I started looking into various scraping frameworks and bumped into Scrapy.

Scrapy is an open source and collaborative framework for extracting data from websites, in a fast, simple, yet extensible way. It’s widely used and extremely hackable!

Now, as I said, scraping a website that doesn’t want to get scraped is not easy, and that’s exactly what I needed. Why is that so hard? mainly because you can caught and blocked rather quickly. Scraping such sites is like playing catch.

I started by reading Scrapy’s tutorial, avoid-getting-banneed, and moved over to scrapying over TOR. This wasn’t the ideal solution. Honestly, the best solution IMO is using a proxy like Luminati**, but I needed a quick win, and TOR gave me exactly that.

*** The only Non-GMO HTTP library for Python, safe for human consumption.

**** Lumintai is a “business proxy network” built by Hola. They use their VPN users as proxies, which gives them ~20 million proxies worldwide!

ssh in python

I needed to write some code that involves ssh, and like always, I took the time to research the state of ssh in the python kingdom before writing code.

what did I find?

  • there are two main packages that handle ssh in python: paramiko and conch (twisted)
  • there’s a cool package that mocks ssh connections, called MockSSH.
  • there’s a super awesome utility/package/you-name-it that creates an ssh honeypot in python, called kippo.

choose a strong password (explained)

The media is packed with articles about hackers that got access to pc’s because they had weak passwords. It seems these problems are huge, most people use Password[0-9] as their password. really smart people even use P@$$w0rd[0-9]:

processing queued netfilter packets with scapy

I found myself in need of writing something that filters packets according to their payload,and it had to be fast.

I wrote a netfilter module that filtered the data so it’ll meet my performance needs, but I was curious if it’s possible to hook nfq to scapy.

A quick query and I found someone who already did it! cool right?

You can write an iptables rule that redirects packets to a specific queue and process it with scapy!

by the way, did you know that iptables is being replaced by the superior nftables?

linux command basics

At my previous workplace I spent 90% of my time writing server side .Net code, on Windows (no mono). At my current workplace I shifted to a linux only environment.

Linux is an old buddy, but It was always a fling, not a serious relationship.

It took me a while to adjust to working on linux every day, and finding out that now the terminal is my best friend, as opposed to GUI in windows.

linux command line basics series really helped me get started: part one, part two.

how to create a local PyPI mirror

PyPI is amazing, but it does have its drawbacks:

  • If PyPI is down, you can’t download any packages or run any tests
  • PyPI can be extremely slow at times
  • PyPI doesn’t offer private package support

The solution? set up a local PyPI mirror of course.

Python already has a PEP that describes the mirroring infrastructure for PyPI: PEP 381, and the tooling that implements it.

I started by looking at my options, and at the end decided to follow these articles to do so:

During my search endeavours, I found out that Artifactory provides support for PyPI repositories. If you already have it installed, I suggest using it instead.

python metaclasses

Ever heard about python metaclasses?

A metaclass is the class of a class. Like a class defines how an instance of the class behaves, a metaclass defines how a class behaves. A class is an instance of a metaclass.

metaclasses make a lot of magic possible. This article describes:

  • What metaclasses are
  • Why they exists
  • How to use them (with code snippets!)
  • Why use them

after reading this you’ll be able to understand the magic behind abstract python classes.

P.S: If you’ve ever missed interfaces in python, zope.interfaces uses ABC’s achieve that. twisted blog has a post about that - Why Interfaces Are Great

'Must' watch python videos

I’m a reddit addict, & regularly follow /r/python.
A few years ago I came across a post called “Must-watch videos about Python”, and I’ve been following pymust.watch ever since.

The list is well balanced, and covers many important topics.
If you feel that a specific topic is missing, send a pull request to the websites GitHub repository.

Overwhelmed by the length of the list? go through my favorites first: