1/10/2017

Process Pools

#Process Pools

I read this beautiful article by Adam Geitgey about process pools. You can read the full article on his page. However, I wanted to have a quick note here as a good reminder to self.

Note: Using a Process Pool requires passing data back and forth between separate Python processes. If the data you are working with can’t be efficiently passed between processes, this won’t work.

####Regular method to create thumbnails in Python2.7

import glob
import os
from PIL import Image

def make_image_thumbnail(filename):
 # The thumbnail will be named "<original_filename>_thumbnail.jpg"
 base_filename, file_extension = os.path.splitext(filename)
 thumbnail_filename = f"{base_filename}_thumbnail{file_extension}"
 # Create and save thumbnail image
 image = Image.open(filename)
 image.thumbnail(size=(128, 128))
 image.save(thumbnail_filename, "JPEG")
 return thumbnail_filename

# Loop through all jpeg files in the folder and make a thumbnail for each

for image_file in glob.glob("*.jpg"):
 thumbnail_file = make_image_thumbnail(image_file)
 print(f"A thumbnail for {image_file} was saved as {thumbnail_file}")

Time it to get a good measurement of total time taken. Keep in mind here that the above code still runs as a single process on a single CPU. So most of the compute power is idle.

$ time python thumbnails_1.py

####Creating thumbnails using Process Pools

import glob
import os
from PIL import Image
import concurrent.futures

def make_image_thumbnail(filename):
 # The thumbnail will be named "<original_filename>_thumbnail.jpg"
 base_filename, file_extension = os.path.splitext(filename)
 thumbnail_filename = f"{base_filename}_thumbnail{file_extension}"
 # Create and save thumbnail image
 image = Image.open(filename)
 image.thumbnail(size=(128, 128))
 image.save(thumbnail_filename, "JPEG")
 return thumbnail_filename

# Create a pool of processes. By default, one is created for each CPU in your machine.

with concurrent.futures.ProcessPoolExecutor() as executor:
 # Get a list of files to process
 image_files = glob.glob("*.jpg")
 # Process the list of files, but split the work across the process pool to use all CPUs!
 for image_file, thumbnail_file in zip(image_files, executor.map(make_image_thumbnail, image_files)):
 print(f"A thumbnail for {image_file} was saved as {thumbnail_file}")

Always timeit.

$ time python thumbnails_2.py

####Where is this approach good?

  • Grabbing statistics out of a collection of separate web server log files
  • Parsing data out of a bunch of XML, CSV or json files
  • Pre-processing lots of images to create a machine learning data set

Previous post
Pelican Setup #Pelican Setup Blogging with Pelican is amazing. Just a note to self on how to setup one. I tested this on Ubuntu 16.04 with python 2.7. I created
Next post
Speedtest #Speedtest A very good location to test download speeds. Has sample files from 5MB to 1GB. http://www.thinkbroadband.com/download/