Fast way to do multi url parsing

Since opening URL sequentially, especially hundreds of URL is very slow, it is a perfect case to implement parallel computing. There are two biggest components that determine the speed of this task: “opening URL” and “read the context from the website”. So, I will briefly talk about the fast way to do the multi-URL parsing.

First, we should try multithreading/multiprocessing packages. Currently, the three popular ones are multiprocessing;concurrent.futures and threading. Those packages could help us to open multi url at the same time, which could increase the speed.

More importantly, after using multithread processing, and if you try to open hundreds of urls at the same time, you will find urllib.request.urlopen is very slow, and opening and read the context become the most time-consuming part. So if you want to make it even faster, you should try requests packages, requests.get(url).content() is faster than urllib.request.urlopen(url).read().

So, here I list two example to do fast multi url parsing, and the speed is faster than the other answers. The first example use classical threading package and generate hundreds thread at the same time. (One trivial shortcoming is it cannot keep the original order of the ticker.)

import time
import threading
import pandas as pd
import requests
from bs4 import BeautifulSoup


ticker = pd.ExcelFile('short_tickerlist.xlsx')
ticker_df = ticker.parse(str(ticker.sheet_names[0]))
ticker_list = list(ticker_df['Ticker'])

start = time.time()

result = []
def fetch(ticker):
    url = ('http://finance.yahoo.com/quote/' + ticker)
    print('Visit ' + url)
    text = requests.get(url).content
    soup = BeautifulSoup(text,'lxml')
    result.append([ticker,soup])
    print(url +' fetching...... ' + str(time.time()-start))



if __name__ == '__main__':
    process = [None] * len(ticker_list)
    for i in range(len(ticker_list)):
        process[i] = threading.Thread(target=fetch, args=[ticker_list[i]])

    for i in range(len(ticker_list)):    
        print('Start_' + str(i))
        process[i].start()



    # for i in range(len(ticker_list)):
    #     print('Join_' + str(i))    
    #     process[i].join()

    print("Elapsed Time: %ss" % (time.time() - start))

The second example uses multiprocessing package, and it is little more straightforward. Since you just need to state the number of pool and map the function. The order will not change after fetching the context and the speed is similar to the first example but much faster than other methods.

from multiprocessing import Pool
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os
import time

os.chdir('file_path')

start = time.time()

def fetch_url(x):
    print('Getting Data')
    myurl = ("http://finance.yahoo.com/q/cp?s=%s" % x)
    html = requests.get(myurl).content
    soup = BeautifulSoup(html,'lxml')
    out = str(soup)
    listOut = [x, out]
    return listOut

tickDF = pd.read_excel('short_tickerlist.xlsx')
li = tickDF['Ticker'].tolist()    

if __name__ == '__main__':
    p = Pool(5)
    output = p.map(fetch_url, ji, chunksize=30)
    print("Time is %ss" %(time.time()-start))

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s