TAGS :Viewed: 29 - Published at: a few seconds ago

[ Python: Download multiple files quickly ]

In Python how can I download a bunch of files quickly? urllib.urlretrieve() is very slow, and I'm not very sure how to go about this.

I have a list of 15-20 files to download, and it takes forever just to download one. Each file is about 2-4 mb.

I have never done this before, and I'm not really sure where I should start. Should I use threading and download a few at a time? Or should I use threading to download pieces of each file, but one file at a time, or should I even be using threading?

Answer 1


urllib.urlretrieve() is very slow

Really? If you've got 15-20 files of 2-4mb each, then I'd just line 'em up and download 'em. The bottle neck is going to be the bandwith for your server and yourself. So IMHO, hardly worth threading or trying anything clever in this case...

Answer 2


One solution (which is not Python specific) is to save the download URLs in another file and download them using a download manager program such as wget or aria2. You can invoke the download manager from your Python program.

But as @Jon mentioned, this is not really necessary for your case. urllib.urlretrieve() is enough for it!

Another option is to use Mechanize to download the files.

Answer 3


stream.py is an somewhat experimental, yet cute UI for parallel python (via threads or processes) based on ideas from data flow programming: An URL-retriever is provided in the examples:

Since it's short:

#!/usr/bin/env python

"""
Demonstrate the use of a ThreadPool to simultaneously retrieve web pages.
"""

import urllib2
from stream import ThreadPool

URLs = [
    'http://www.cnn.com/',
    'http://www.bbc.co.uk/',
    'http://www.economist.com/',
    'http://nonexistant.website.at.baddomain/',
    'http://slashdot.org/',
    'http://reddit.com/',
    'http://news.ycombinator.com/',
]

def retrieve(urls, timeout=30):
    for url in urls:
        yield url, urllib2.urlopen(url, timeout=timeout).read()

if __name__ == '__main__':
    retrieved = URLs >> ThreadPool(retrieve, poolsize=4)
    for url, content in retrieved:
        print '%r is %d bytes' % (url, len(content))
    for url, exception in retrieved.failure:
        print '%r failed: %s' % (url, exception)

You would just need to replace urllib2.urlopen(url, timeout=timeout).read() with urlretrieve....