Web Scraping in Python

I've never done anything like web scraping before, so I've been briefly looking into it, but I've not had the time to sit down and learn it through and through. Are there any good sources that you guys could recommend?

Yep, plenty!

First, this is happening for a short while longer and most of those books, I would 100% recommend.

This youtube playlist should help more or less help you with web scraping, but if I remember correctly, it's aimed at people who know python syntax.

The creator has plenty of additonal playlists centered around python that you can look into if you want.

1 Like

Tried updating a Bitbucket repo for you. Wanted to share some code even though it is in PHP. But I am tired and I have somehow royally f'd up how Git is tracking the files somehow. If you want I can put it up somewhere else.

1 Like

Also, take a look at PhantomJS and then running CasperJS on top of that. It is a headless browser. Much more powerful.

1 Like

It would be interesting to see a script that would take a snapshot of the rendered web page, reference that with the source to determine active hyperlinks or clickable items.

Some pages use trickery like hidden links to keep bots from crawling the pages.

1 Like

PhantomJS can return all links in 15 lines. NOt sure if that is what you want though.

1 Like

A for instance is using a 1px by 1px image as a link to a script that blocks your IP address. No normal user will ever see that link, but a bot will recognize the link in the page source and try to navigate to it.

Its also common to hide links with css.

Just a way to keep bad robots off your page.

1 Like

I haven't had to deal with that but I am pretty sure PhantomJS has tooling to help you avoid situations like that.

1 Like