Intro
So I’ve started to write a web crawler in Python using a couple of libraries to help me achieve my goal. So far so good, however, I’m trying to increase the performance greatly, as I want to crawl across many URL’s, I was hoping there’s a relatively easy way to do this in parallel. It’s only slow as hell at the moment due to the fact that the back end of the website that I’m writing a crawler for is drastically slow, a single dynamic page can take up to a minute to render.
Modules Used
So due to the wide range of problems that I’m trying to solve with one script, I’m including a few modules, here they are:
- os
- csv
- json
- time
- datetime
- requests
- BeautifulSoup
- selenium
- PrettyTable
- BlockingScheduler
Performance Issue
As I’ve said, it can take as long as a minute for a dynamic page to be rendered, then on one of these pages it can take a further 20 or so seconds for the content to be generated, I don’t know why it’s so slow as the back end is out sourced. I cannot do anything to help assist the performance of the back end.
The Problem
As I’ve said, the crawler works fine, I’m just struggling to find a way where I can make multiple crawlers work in parallel. I’ve tried using Threads
and I’ve also tried using multiprocessing
. I would share the source code of my previous attempts, but I’ve had to delete it because I need to have the software currently running, but I’ll just do some dummy code to demonstrate what I’ve tried below.
Additionally, I would copy and paste the entire thing onto here, it’s just it’s turned into quite a beefy script, it has grown from a few functions to small giant quite quickly.
Code
Here’s what’s currently running:
## run init
init() ## initiate the crawler(s)
## schedule init to run every 8 hours
scheduler = BlockingScheduler()
scheduler.add_job(init, 'interval', hours=8)
scheduler.start()
This is roughly what I’ve tried to do previously:
## run init with different args
def runMe () :
url1 = "http://example.com"
url2 = "http://otherexample.com"
threading.Thread(target=init, args=(url1,)).start()
threading.Thread(target=init, args=(url2,)).start()
## schedule init to run every 8 hours
scheduler = BlockingScheduler()
scheduler.add_job(runMe, 'interval', hours=8)
scheduler.start()
FYI, in the code I’m currently running, I just have the URL’s in an array elsewhere in the code so that the init
function can crawl the same URL’s regardless.
Finally
I’m not at all a pro when it comes to Python, so I know that’s a bit of a disadvantage to start with…
I’m aware to the fact that Python implements a global interpreter lock, hence why I tried to use multiple processes instead of multiple threads, but they still ran in a procedural manner rather than in a more parallel manner.
Does anyone have any idea where I’m going wrong?
Keep in mind, when I was writing this, it was meant to be a quick and dirty script to update server side cache (it’s a totally messed up system) by crawling the front end basically.
Rest Of The Code
So I think it’s worth mentioning that it’s far from finished, I will plan to make this script more dynamic and flexible, in addition to actually implementing good practices as I know at the moment, this code is VERY VERY dirty… All jokes aside, this is my first ever Python program, so it’s been more of a play around with the language more than anything else, I also know that in the code below I’m not using ALL of the stated modules, that’s purely because I plan to use them all, I’m just trying to get the bare bones to work first though.
'''
As I'm not a professional Python developer, a lot of this code is very...
Experimental to say the very least, I'm trying to make a crawler crawl
through dynamically generated pages
'''
###############################################################################
import os
import csv
import json
import time
import datetime
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from prettytable import PrettyTable
from apscheduler.schedulers.blocking import BlockingScheduler
###############################################################################
###############################################################################
## these are some global variables
MONTHS = ["january","february","march","april","may","june",
"july","august","september","october","november","december"]
URLS = ["example_dump_page","example_two_dump_page","example_three_dump_page"]
BASE_URL = "https://www.demo.com/"
DEV_MODE = False
NUMBER_OF_RUNS = 0
TOTAL_PAGES = 0
TABLE_HEADERS = ['TIMES RUN','URL','BASE URL','PAGE COUNT','MONTH','REMOVE',
'PRICE','FINISHED']
TABLE = PrettyTable(TABLE_HEADERS)
RAW_DATA = {
"headings" : TABLE_HEADERS,
"rows" : []
}
###############################################################################
###############################################################################
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
## the purpose of this function is to get the number of runs
def get_number_of_runs () :
global NUMBER_OF_RUNS
return NUMBER_OF_RUNS
## the purpose of this function is to get the number of pages visited
def get_total_pages () :
global TOTAL_PAGES
return TOTAL_PAGES
## the purpose of this function is to get the table
def get_table () :
global TABLE
return TABLE
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
## the purpose of this function is to increment the number of runs
def inc_number_of_runs () :
global NUMBER_OF_RUNS
NUMBER_OF_RUNS += 1
## the purpose of this function is to increment the number of pages visited
def inc_total_pages () :
global TOTAL_PAGES
TOTAL_PAGES += 1
## the purpose of this function is to add a row to the table
def inc_table (row) :
global TABLE
global RAW_DATA
TABLE.add_row(row)
RAW_DATA['rows'].append(row)
try :
with open("jsonDump.json", "w") as outfile:
json.dump(RAW_DATA, outfile)
except : print("JSON Dump error.")
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
## the purpose of this function is to reset the number of pages visited
def reset_total_pages () :
global TOTAL_PAGES
TOTAL_PAGES = 0
## the purpose of this function is to reset the table
def reset_table () :
global TABLE
global TABLE_HEADERS
TABLE = PrettyTable(TABLE_HEADERS)
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
## the purpose of this function is to help the init function below
def date_string_to_numbers(s) :
arr = s.split("/")
## this is a helper function to get the month name to an int
def monthToNum(shortMonth):
return {'Jan':1,'Feb':2,'Mar':3,'Apr':4,'May':5,'Jun':6,'Jul':7,
'Aug':8,'Sep':9,'Oct':10,'Nov':11,'Dec':12}[shortMonth]
## now get the day, month and year as a string
day = arr[0]
if (int(day) < 10) :
day = "0" + str(day)
month = monthToNum(arr[1][0:3])
if (month < 10) :
month = "0" + str(month)
year = "20" + arr[2]
## return as an int after concatenating the strings
return int(str(year) + str(month) + str(day))
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
###############################################################################
###############################################################################
## the purpose of this function is to visit a page and then interact with
## the dom
def interact (url, initUrl, month):
initTest = True
price = 0.00
prTag = ""
rem = "NO"
try:
## find the path to the gekodriver and set it
path = os.path.dirname((os.path.realpath(__file__)))
driver = webdriver.Firefox(path)
## hide the window when possible
driver.minimize_window()
driver.get(BASE_URL + url)
## try to find the button
try :
btn = driver.find_element_by_name('submit')
btn.click()
## clear the console
clear = lambda: os.system('cls' if os.name=='nt' else 'clear')
clear()
## ensure that backend has had a chance to catch up
time.sleep(30.0)
## try to find the price
try :
prTag = driver.find_element_by_class_name("totalPrice")
initTest = True
try : price = float(prTag.get_attribute('innerHTML'))
except : pass
except :
initTest = False
## just close the browser if it doesn't work
except:
driver.close()
initTest = False
## close the browser either way
driver.close()
except:
try:
driver.close()
initTest = False
except:
initTest = False
pass
if (initTest == False or price == 0.0) :
rem = "YES"
numbs = str(get_number_of_runs())
now = datetime.datetime.now()
formatTime = str(now.day) + "/"
formatTime += str(now.month) + "/" + str(now.year) + " - "
formatTime += str(now.hour) + ":" + str(now.minute) + ":"
formatTime += str(now.second)
inc_total_pages()
## add data to the global table
inc_table([
numbs,
url,
initUrl,
get_total_pages(),
month,
rem,
price,
formatTime
])
## print the global table
print("\n")
print(get_table())
print("\n")
###############################################################################
###############################################################################
## this is a function which houses the initial logic that's
## required to make this program run
def init ():
## set up global variables
inc_number_of_runs()
reset_total_pages()
reset_table()
current_month = datetime.datetime.today().month - 1
## create the date check
now = datetime.datetime.now()
## get the current month
tmpMonth = now.month
if(tmpMonth < 10) :
tmpMonth = "0" + str(tmpMonth)
## get the current day
tmpDay = now.day
if(tmpDay < 10) :
tmpDay = "0" + str(tmpDay)
## add current year onto the above two variables and then parse to int
date_to_beat = int(str(now.year) + str(tmpMonth) + str(tmpDay))
## loop through each month
for i in range(len(MONTHS)) :
## pointless doing/trying out of date pages
if (i < current_month) :
continue
month = MONTHS[i]
## loop through each url
for j in range(len(URLS)) :
url = BASE_URL + URLS[j] + "_" + month + "?removeCache=true"
data = requests.get(url)
soup = BeautifulSoup(data.content, 'html.parser')
rows = soup.find_all("div", {"class":"dealcontainer"});
## loop through each link on the page
for k in range(len(rows)) :
row = rows[k]
## get the link
atag = row.find("a", {"class":"button-solid"})
nestedUrl = atag['href']
## see if it's worth parsing this url or to go onto the next
## link
ptag = row.find("p", {"class":"sub-title"})
dateData = ptag.find("span").text
row_date = date_string_to_numbers(str(dateData))
if (row_date < date_to_beat) :
continue
interact(nestedUrl.replace("/", "", 1), URLS[j], month)
###############################################################################
###############################################################################
## run init
init()
## schedule init to run every 8 hours
scheduler = BlockingScheduler()
scheduler.add_job(init, 'interval', hours=8)
scheduler.start()
###############################################################################