Web scraping in unix

I don’t recall why I was looking into web scraping a few weeks ago, but I felt the need for it. I always wondered how this works, and its basically a lot of regexps. Fun stuff. It gives me a great opportunity to brush up on my regexp-knowledge.

Tools

  • Curl

    For unix systems, just use Curl. It is the best for this stuff. Wget is nice too, but I’d prefer to use that when I’m downloading a whole site or a sub-section of one.

  • Sed

    My favourite batch text-editor, Sed is most well-known for its ‘s/foo/bar/’ command that uses ‘foo’ as a regexp and substitutes all matches with ‘bar’. Best stuff ever.

    Sed uses POSIX regexps, I think. I’m (Like most people, probably) most familiar with Perl-compatible regexps, but the posix regexp aren’t so very difficult to grasp. I enjoy doing things in different regexp styles, actually, since it forces me to think about the concept of the regex instead of depending on muscle-memory.

And thats it. Use Curl. Use Sed. And you’re golden. With that you can selectively download all the ‘.mp4’ files from a webpage. Or get all the table-data and re-format it and save it as text in a file, for further consumption by something else.

Conclusion

I was surprised how simple this was. I thought there would be so much more involvement to this, but it can be done by 2 of the standard unix tools. Amazing.

Leave a Reply