Writing Python Web Crawlers To Run In Parallel

Argon · March 9, 2018, 3:28pm

Intro

So I’ve started to write a web crawler in Python using a couple of libraries to help me achieve my goal. So far so good, however, I’m trying to increase the performance greatly, as I want to crawl across many URL’s, I was hoping there’s a relatively easy way to do this in parallel. It’s only slow as hell at the moment due to the fact that the back end of the website that I’m writing a crawler for is drastically slow, a single dynamic page can take up to a minute to render.

Modules Used

So due to the wide range of problems that I’m trying to solve with one script, I’m including a few modules, here they are:

os
csv
json
time
datetime
requests
BeautifulSoup
selenium
PrettyTable
BlockingScheduler

Performance Issue

As I’ve said, it can take as long as a minute for a dynamic page to be rendered, then on one of these pages it can take a further 20 or so seconds for the content to be generated, I don’t know why it’s so slow as the back end is out sourced. I cannot do anything to help assist the performance of the back end.

The Problem

As I’ve said, the crawler works fine, I’m just struggling to find a way where I can make multiple crawlers work in parallel. I’ve tried using Threads and I’ve also tried using multiprocessing. I would share the source code of my previous attempts, but I’ve had to delete it because I need to have the software currently running, but I’ll just do some dummy code to demonstrate what I’ve tried below.

Additionally, I would copy and paste the entire thing onto here, it’s just it’s turned into quite a beefy script, it has grown from a few functions to small giant quite quickly.

Code

Here’s what’s currently running:

## run init
init() ## initiate the crawler(s)

## schedule init to run every 8 hours 
scheduler = BlockingScheduler()
scheduler.add_job(init, 'interval', hours=8)
scheduler.start()

This is roughly what I’ve tried to do previously:

## run init with different args 
def runMe () :
	url1 = "http://example.com"
	url2 = "http://otherexample.com"
	threading.Thread(target=init, args=(url1,)).start()
	threading.Thread(target=init, args=(url2,)).start() 

## schedule init to run every 8 hours 
scheduler = BlockingScheduler()
scheduler.add_job(runMe, 'interval', hours=8)
scheduler.start()

FYI, in the code I’m currently running, I just have the URL’s in an array elsewhere in the code so that the init function can crawl the same URL’s regardless.

Finally

I’m not at all a pro when it comes to Python, so I know that’s a bit of a disadvantage to start with…

I’m aware to the fact that Python implements a global interpreter lock, hence why I tried to use multiple processes instead of multiple threads, but they still ran in a procedural manner rather than in a more parallel manner.

Does anyone have any idea where I’m going wrong?

Keep in mind, when I was writing this, it was meant to be a quick and dirty script to update server side cache (it’s a totally messed up system) by crawling the front end basically.

Rest Of The Code

So I think it’s worth mentioning that it’s far from finished, I will plan to make this script more dynamic and flexible, in addition to actually implementing good practices as I know at the moment, this code is VERY VERY dirty… All jokes aside, this is my first ever Python program, so it’s been more of a play around with the language more than anything else, I also know that in the code below I’m not using ALL of the stated modules, that’s purely because I plan to use them all, I’m just trying to get the bare bones to work first though.

'''
 As I'm not a professional Python developer, a lot of this code is very...
 Experimental to say the very least, I'm trying to make a crawler crawl
 through dynamically generated pages
'''
###############################################################################
import os
import csv
import json
import time
import datetime
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from prettytable import PrettyTable
from apscheduler.schedulers.blocking import BlockingScheduler
###############################################################################



###############################################################################
## these are some global variables
MONTHS = ["january","february","march","april","may","june",
		"july","august","september","october","november","december"]
URLS = ["example_dump_page","example_two_dump_page","example_three_dump_page"]
BASE_URL = "https://www.demo.com/"
DEV_MODE = False
NUMBER_OF_RUNS = 0
TOTAL_PAGES = 0
TABLE_HEADERS = ['TIMES RUN','URL','BASE URL','PAGE COUNT','MONTH','REMOVE',
		'PRICE','FINISHED']
TABLE = PrettyTable(TABLE_HEADERS)
RAW_DATA = {
	"headings" : TABLE_HEADERS,
	"rows" : []
}
###############################################################################



###############################################################################
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
## the purpose of this function is to get the number of runs
def get_number_of_runs () :
	global NUMBER_OF_RUNS
	return NUMBER_OF_RUNS


## the purpose of this function is to get the number of pages visited
def get_total_pages () :
	global TOTAL_PAGES
	return TOTAL_PAGES

## the purpose of this function is to get the table
def get_table () :
	global TABLE
	return TABLE
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''

''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
## the purpose of this function is to increment the number of runs
def inc_number_of_runs () :
	global NUMBER_OF_RUNS
	NUMBER_OF_RUNS += 1

## the purpose of this function is to increment the number of pages visited
def inc_total_pages () :
	global TOTAL_PAGES
	TOTAL_PAGES += 1

## the purpose of this function is to add a row to the table
def inc_table (row) :
	global TABLE
	global RAW_DATA
	TABLE.add_row(row)
	RAW_DATA['rows'].append(row)
	try :
		with open("jsonDump.json", "w") as outfile:
			json.dump(RAW_DATA, outfile)
	except : print("JSON Dump error.")
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''

''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
## the purpose of this function is to reset the number of pages visited
def reset_total_pages () :
	global TOTAL_PAGES
	TOTAL_PAGES = 0

## the purpose of this function is to reset the table
def reset_table () :
	global TABLE
	global TABLE_HEADERS
	TABLE = PrettyTable(TABLE_HEADERS)
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''

''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
## the purpose of this function is to help the init function below
def date_string_to_numbers(s) :
	arr = s.split("/")

	## this is a helper function to get the month name to an int
	def monthToNum(shortMonth):
		return {'Jan':1,'Feb':2,'Mar':3,'Apr':4,'May':5,'Jun':6,'Jul':7,
			'Aug':8,'Sep':9,'Oct':10,'Nov':11,'Dec':12}[shortMonth]

	## now get the day, month and year as a string
	day = arr[0]
	if (int(day) < 10) :
		day = "0" + str(day)
	month = monthToNum(arr[1][0:3])
	if (month < 10) :
		month = "0" + str(month)
	year = "20" + arr[2]

	## return as an int after concatenating the strings
	return int(str(year) + str(month) + str(day))

''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
###############################################################################



###############################################################################
## the purpose of this function is to visit a page and then interact with
## the dom
def interact (url, initUrl, month):
	initTest = True
	price = 0.00
	prTag = ""
	rem = "NO"

	try:
		## find the path to the gekodriver and set it
		path = os.path.dirname((os.path.realpath(__file__)))
		driver = webdriver.Firefox(path)

        ## hide the window when possible
		driver.minimize_window()
		driver.get(BASE_URL + url)

        ## try to find the button
		try :
			btn = driver.find_element_by_name('submit')
			btn.click()

            ## clear the console
			clear = lambda: os.system('cls' if os.name=='nt' else 'clear')
			clear()

            ## ensure that backend has had a chance to catch up
			time.sleep(30.0)

			## try to find the price
			try :
				prTag = driver.find_element_by_class_name("totalPrice")
				initTest = True
				try : price = float(prTag.get_attribute('innerHTML'))
				except : pass
			except :
				initTest = False

        ## just close the browser if it doesn't work
		except:
			driver.close()
			initTest = False

        ## close the browser either way
		driver.close()

	except:
		try:
			driver.close()
			initTest = False
		except:
			initTest = False
			pass

	if (initTest == False or price == 0.0) :
		rem = "YES"

	numbs = str(get_number_of_runs())
	now = datetime.datetime.now()
	formatTime =  str(now.day) + "/"
	formatTime += str(now.month) + "/" + str(now.year) + " - "
	formatTime += str(now.hour) + ":" + str(now.minute) + ":"
	formatTime += str(now.second)
	inc_total_pages()

	## add data to the global table
	inc_table([
		numbs,
		url,
		initUrl,
		get_total_pages(),
		month,
		rem,
		price,
		formatTime
	])

	## print the global table
	print("\n")
	print(get_table())
	print("\n")
###############################################################################



###############################################################################
## this is a function which houses the initial logic that's
## required to make this program run
def init ():

	## set up global variables
	inc_number_of_runs()
	reset_total_pages()
	reset_table()
	current_month = datetime.datetime.today().month	 - 1

	## create the date check
	now = datetime.datetime.now()

	## get the current month
	tmpMonth = now.month
	if(tmpMonth < 10) :
		tmpMonth = "0" + str(tmpMonth)

	## get the current day
	tmpDay = now.day
	if(tmpDay < 10) :
		tmpDay = "0" + str(tmpDay)

	## add current year onto the above two variables and then parse to int
	date_to_beat = int(str(now.year) + str(tmpMonth) + str(tmpDay))

	## loop through each month
	for i in range(len(MONTHS)) :

		## pointless doing/trying out of date pages
		if (i < current_month) :
			continue

		month = MONTHS[i]

		## loop through each url
		for j in range(len(URLS)) :
			url = BASE_URL + URLS[j] + "_" +  month + "?removeCache=true"
			data = requests.get(url)
			soup = BeautifulSoup(data.content, 'html.parser')
			rows = soup.find_all("div", {"class":"dealcontainer"});

			## loop through each link on the page
			for k in range(len(rows)) :
				row = rows[k]

				## get the link
				atag = row.find("a", {"class":"button-solid"})
				nestedUrl = atag['href']

				## see if it's worth parsing this url or to go onto the next
				## link
				ptag = row.find("p", {"class":"sub-title"})
				dateData = ptag.find("span").text
				row_date = date_string_to_numbers(str(dateData))

				if (row_date < date_to_beat) :
					continue
				interact(nestedUrl.replace("/", "", 1), URLS[j], month)
###############################################################################



###############################################################################
## run init
init()

## schedule init to run every 8 hours
scheduler = BlockingScheduler()
scheduler.add_job(init, 'interval', hours=8)
scheduler.start()
###############################################################################

SudoSaibot · March 9, 2018, 6:57pm

This is just a thought I had while thinking how I would write a web crawler.

Most pages are static and don’t change often. Such as a shopping site when looking at or for a product.
I would read in the http response and compare the text where relevant to know if there has been a change. Then when a change is detected, render the page for deeper analysis.

Also crawling links is a good way to have your bot banned. Many sites implement hidden links that are meant to detect crawlers.

Another thing I noticed is you are cycling through every line returned in the page and hitting on every link with interact. Lots of cleanup can be done by extracting the <body> element and caching the links to be ran later. Not sure how well python likes multiple nested recursions calling functions.

Hope this may trigger some ideas, don’t know python well, sorry.

risk · March 11, 2018, 9:01am

With Python (cpython) you basically get one thread worth of performance, due to this thing called GIL, unless you use multiprocessing, which makes sharing state more awkward than with threads.

Typical hack to get more performance out of Python is to use threads to parallelize the otherwise blocking IO operations. Python threads are real system threads, if one is waiting on a kernel syscall (e.g. read write from network), another one could be waiting on another syscall (e.g. another read write from network), or could be doing computation in Python.

Finally you have “green threads”:

Turns out kernel supports asynchronous reads writes and other things, where you can ask it to tell you if something has changed e.g. there’s data to pick up from a network connection, or HDD has finished a read and you can look at data. Idea is that your app can issue a massive number of requests in parallel from a single thread, and then process the replies as they come in and react to these events.
Things like select() and epoll() are what’s used for that.
This callback based app approach ends up tumbling up how your app works usually (arguably less of a problem for python compared to other languages)… So people invented this green threads concept where if you use the right kind of mutex and the right kind of libraries, you can get away with tens of thousands of things programmed just like threads would be without blowing up your ram or your kernel scheduler.

My recommendation to you is to stick to threads for now, as they’re very useful to learn:

thds = [threading.Thread(target=crawl)
       for _ in range(50)]
for t in thds:
  t.start()

### before the lines above, put the following:

q = queue.Queue()

def crawl():
  while True:
    url = q.get()
    if url is None:
      q.put(None)
      break
    if not first_time(url):
      continue
    for new_urls in visit_and_extract_urls(url):
      q.put(new_url)

# kick off everything by adding a url or a couple into the q

q.put("http://www.example.com")
# main thread falls off a cliff here and other threads keep running, when you want the crawler to stop: q.put(None)

Try filling in url extraction from your code, maybe sprinkling some logging to see what’s going on, this should get you started.

Argon · March 13, 2018, 9:05am

That’s the total pain in the a**, with this website I’m working on, in the back end a session variable gets set as you click on a link to view a product, if no session exists, then you get redirected to a ‘session expired’ page, this website I’m working on is a total mess. Plus as it’s from my work’s I.P. address, I don’t think it’ll actually get banned… You can’t actually access a product page directly by using a simple url, which I personally find incredibly annoying for problems such as this… … The back end actually make very little sense to me to be 100% honest, it’s just such a mess…

Me neither! … No need to say sorry, I appreciate the feedback and I appreciate the input, I agree that targeting just the body probably would be more efficient. But tbf, this is literally my first ever python attempt, screw hello world, I know python is pretty useful for things like this, so I thought why not?!

I did actually mention that in my original post, hence why I tried to see if I could achieve a similar implementation with processes rather than threads…

I’ve not heard of that term before, sure as hell worth looking into though! Thanks for that tip!

I’ll also try playing around with the code you wrote, I’ll see what I can do with something like that! Again, thank you for the feedback!