TAGS :Viewed: 9 - Published at: a few seconds ago

[ Optimization of this Python code - webscraping and output results to CSV file ]

I am trying to scrape data from a few thousands pages. The code I have works fine for about a 100 pages, but then slows down dramatically. I am pretty sure that my Tarzan-like code could be improved, so that the speed of the webscrapping process increases. Any help would be appreciated. TIA!

Here is the simplified code:

csvfile=open('test.csv', 'w', encoding='cp850', errors='replace')

list_url= ["http://www.randomsite.com"]

for url in list_url:
 base_url_parts = urllib.parse.urlparse(url)
 while True:
    raw_html = urllib.request.urlopen(url).read()
    soup = BeautifulSoup(raw_html, "lxml")

    #### scrape the page for the desired info 


   #Zip the data
    output_data=zip(variable_1, variable_2, variable_3, ..., variable_10)

#Write the observations to the CSV file
    writer=csv.writer(open('test.csv','a',newline='', encoding='cp850', errors='replace'))

    url_test = base2+url_part2

       if url_test != None:
           url = url_test


EDIT: Thanks for all the answers, I learn quite a lot from them. I am (slowly!) learning my way around Scrapy. However, I found that the pages are available via bulk download, which will be an ever better way to solve the performance issue.

Answer 1

The main bottleneck is that your code is synchronous (blocking). You don't proceed to the next URL until you finish processing the current one.

You need to make things asynchronously either by switching to Scrapy which solves this problem out-of-the-box, or by building something yourself via, for example, grequests.

Answer 2

If you want to go really fast without a lot of complex code, you need to A) Segregate the requests from the parsing (because the parsing is blocking the thread you'd otherwise use to make the request), and B) Make requests concurrently and parse them concurrently. So, I'd do a couple things:

  1. Request all pages asynchronously using eventlets. I've struggled with async http frameworks in Python and find eventlets the easiest to learn.
  2. Every time you successfully fetch a page, store the html somewhere. A) You could write it to individual html files locally but you'll have a lot of files on your hands. B) You could probably store this many records as strings (str(source_code)) and put them in a native data store so long as it's hashed (probably a set or dict). C) You could use super lightweight but not particularly performant database like TinyDB and stick the page source in JSON files. D) You could use a third party library's data structures for high performance computing like a Pandas DataFrame or a NumPy array. They can easily store this much data but may be overkill.
  3. Parse each document separately after it's been retrieved. Parsing with lxml will be extremely fast, so depending on how fast you need to go you may be able to get away with parsing the files sequentially. If you want to go faster, look up a tutorial on multiprocessing in python. It's pretty darn easy to learn, and you'd be able to concurrently parse X documents, where X is the number of available cores on your machine.

Answer 3

Perhaps this is just a bug in simplification, but looks like you are opening 'test.csv' multiple times, but closing it only once. Not sure if that's the cause for unexpected slowdown when number of URLs grows above 100, but if you want all data to go in one csv file, you should probably stick to opening the file and the writer once at the top, like you're already doing, and not do it inside the loop.

Also, in the logic of constructing the new URL: isn't url_test != None always true? Then how are you exiting the loop? on the exception when urlopen fails? Then you should have the try-except around that. This is not a performance issue, but any kind of clarity helps.