Web Scraping in Python

Argon · April 10, 2017, 2:26pm

I've never done anything like web scraping before, so I've been briefly looking into it, but I've not had the time to sit down and learn it through and through. Are there any good sources that you guys could recommend?

SgtAwesomesauce · April 10, 2017, 3:27pm

Yep, plenty!

First, this is happening for a short while longer and most of those books, I would 100% recommend.

This youtube playlist should help more or less help you with web scraping, but if I remember correctly, it's aimed at people who know python syntax.

The creator has plenty of additonal playlists centered around python that you can look into if you want.

dot404 · April 11, 2017, 2:28am

Tried updating a Bitbucket repo for you. Wanted to share some code even though it is in PHP. But I am tired and I have somehow royally f'd up how Git is tracking the files somehow. If you want I can put it up somewhere else.

dot404 · April 11, 2017, 2:29am

Also, take a look at PhantomJS and then running CasperJS on top of that. It is a headless browser. Much more powerful.

SudoSaibot · April 11, 2017, 2:33am

It would be interesting to see a script that would take a snapshot of the rendered web page, reference that with the source to determine active hyperlinks or clickable items.

Some pages use trickery like hidden links to keep bots from crawling the pages.

dot404 · April 11, 2017, 2:41am

PhantomJS can return all links in 15 lines. NOt sure if that is what you want though.

SudoSaibot · April 11, 2017, 2:47am

A for instance is using a 1px by 1px image as a link to a script that blocks your IP address. No normal user will ever see that link, but a bot will recognize the link in the page source and try to navigate to it.

Its also common to hide links with css.

Just a way to keep bad robots off your page.

dot404 · April 11, 2017, 2:48am

I haven't had to deal with that but I am pretty sure PhantomJS has tooling to help you avoid situations like that.