A Guide on Multithreading and Multiprocessing: from the OS Kernel to ML Applications in Python.

A deep dive into multithreading and multiprocessing, its relation with the OS, and how to use it in Python for speeding up Machine Learning tasks.
en
ML
python
multithreading
multiprocessing
Author

Diego Andrés Gómez Polo

Published

February 15, 2024

Important

DRAFT - WORK IN PROGRESS

Motivation

If you’ve worked in software for long enough, you’ve probably heard in the wild people throwing around terms like multithreading, multiprocessing, asynchronous programming, concurrency, parallelism, and so on. Often, and mistakenly, these terms are used interchangeably, to refere to stuff thet happens ‘at the same time’ in a computer program. You will see in this blog how most of the time this ‘parallel behaviour’ is actually a carefully orchestrated illusion by the OS kernel and the CPU that lies at the heart of your computer. You will also learn how these concepts apply to Python and ML tasks in both CPU-bound and I/O-bound scenarios (don’t worry, we will define these terms later ;D) and how a proper understanding of these concepts can help you speed up your ML tasks.

Personal Motivation:

The contents of this blog are mostly based on the first 2 weeks of the CUDA Cohort by Cohere for AI, an OpenScience Community which i am part of.

Kudos to the C4AI team for the amazing content and the great community they are building.

Warning

Since this blog is in part a note to myself and in part a guide for others, there will be mixed didacticle and highly technical content. I will try to keep a balance and annotate the most technical parts with a warning like this one, so you can skip them if you are not interested in the nitty-gritty details.

Parallelism vs Concurrency

Sometimes, many programmers, even experienced ones, think that by adding more threads to a program it will magically become faster, and is not rare that when they actually do this, the program becomes slower, and they are left scratching their heads, questioning every piece of knowledge they have recieved about how a computer should work.

Often times, the essence of this confussion is due to a misunderstanding of these two concepts: Parallelism and Concurrency.

There are many great resources on the internet which try to explain the difference between the two, but in my opinion regularly they actually get it wrong. A good explanation i’ve found is by Rob Pike in his talk Concurrency is not Parallelism. In this talk, Rob Pike explains that concurrency is about dealing with lots of things at once, while parallelism is about doing lots of things at once.

This may seem rather abstract, so to clear things out, i am aspiring to have a very precise wording in my explanations: Concurrency is a property of the structure of a program, while Parallelism talks about the execution of a program. A concurrent program may be run completely sequentially, and most of the time this is the case, but it has a structure such that parts of it can be executed in parallel, potentially making it faster.

This is best seen with the following diagram:

Figure 1: Structure of a concurrent programm

In Figure 1 we can see the abstract representation of a program that has to complete many tasks (T_0, T_0^1, T_0^2, \cdots, T_2, T_3) before exiting.

Crucially, this is NOT representing how it runs, but it represents the dependencies between the tasks. For example, task T_2^1 needs that T_1^1 is completed before it runs but doesn’t really care about T_0^3, whether it has already run, will run or will never run, in the same way, T_0^3 needs for T_0 to be completed before it runs but doesn’t really care about T_2^1.

A Note for the Data Engineers

Those of you who have worked with Directed Acyclic Graphs (DAGs) in the context of data processing frameworks like Apache Airflow, Dagster, or Dask, may have already identified the structure of a concurrent program as a DAG. This is not a coincidence, as it is precisely the exact same thing, where in this case the nodes are the instructions of the program and the edges are the dependencies between them.

You will see in the following section that your good old Airflow Scheduler is very similar to the OS Kernel Scheduler, but on a higher level of abstraction.

The important observation here is that because of this lack of dependency between T_0^3 and T_2^1, they are suited for parallel execution and if executed as such, can potentially make the program faster.

For a more intuitive view, you can think of the tasks that a program runs like the functions it calls, some functions require a prior call to other functions to run properly (probably the function read_file needs that we first call a function open_file) and some functions don’t really care about the result of others (you can probably run print("Hello World") at any point in your program without any problem). 1

Understanding what parts of your program constitute a block of tasks suited for parallel execution is the first step to make it faster.

I hope is also clear by now that Parallelism is just a synonym for Parallel Execution. So, indeed, this is the proper word to use when you want to refer to something “running at the same time” in a computer program.

Now, let’s step away from the world of theory and actually get to know our computers…

If our CPUs had unlimited parallelism, we could just throw all the tasks suited for parallel execution at it and it would run them all at the same time, making our programs run faster. But, as you will see in the next section, this is not the case, and we need to understand how our computers work to make the most out of them.

I am crossing my fingers so that, when you finish reading the next section, you will not be that poor programmer scratching his head, after creating 128 threads in a Core i5 processor.

TL;DR: Concurrency is about the structure of a program, while Parallelism is about its execution

Concurrency, DAGs and Strict Partial Orders

If you have a little bit more of a formal background you may have already identified that the concept of Concurrent Structure later presented can be precisely defined in terms of basic set theory and binary relations. In concrete, the structure of a concurrent program can be seen as a Strict Partial Order where the elements of the set are the tasks t \in T of the program and the relation R \subset T \times T is the happens-before relation, where \{t, t'\} \in R iff we have that t happens-before of t'. Moreover, 2 tasks t and t' are Suited for parallel execution if and only if they are not comparable in the partial order , that is, \{t, t'\} \notin R and \{t', t\} \notin R.

It is also worth noting that the notation used in Figure 1 ist not generalizable to all concurrent programs. If it were, that would imply the existence of a constant time algorithm \mathcal{O}(1) with linear space \mathcal{O}(E) which finds if two nodes are connected in a Directed Acyclic Graph (DAG) without executing any path traversal algorithm. This would be the greatest solution ever discovered to calculating the Transitive Clousure of a Graph…. (Indeed this is me justyfing my simplistic notation :p)

Parrallelism and Concurrency in your Computer

Identifying which parts of your program are suited for parallel execution is half of the recipe. let’s see the other half.

The CPU or Central Processing Unit is the core component of your machine that is doing the heavy lifting when you run a program. It is the one2 doing all the math and logical operations necessary for providing you with a fully rendered and interactive screen running the latest versions of Chrome, Safari, Firefox, or, god forbid, Internet Explorer. But how does it actually do this?

Belive me when i tell you, this is close to magic, and for non-technical folks, it is actually magic, i have seen people on the internet saying that humans could have never invented such a complicated device, and that is the proof that aliens exist and have visited us. But, as you will see, it is actually a very well thought out and engineered piece of technology.

Yeah, probably most of the magic is on the Photolithography process, and the aliens are the dutch, but i am not that kind of engineer, so i will stick to the software side of things.

CPU Basics

Your CPU consists of three main components: the Control Unit, the Arithmetic Logic Unit (ALU), and the Registers. This design follows the von Neumann architecture, which is the basis for most modern computers.

Figure 2: Simple Diagram of CPU with Von Neumann Architecture

Like the conductor of an orchestra swinging the baton, the Control Unit is the part of the CPU that reads and interprets instructions from memory. It then directs the operation of the other units by providing control signals. The Arithmetic Logic Unit (ALU) is the part of the CPU that performs arithmetic and logical operations. The Registers are small, fast storage locations within the CPU that hold data temporarily during processing.

Unlike an orchestra, the tempo or speed in a CPU is not set by the conductor or the Control Unit, but by a specific hardware component called the Clock. The clock is a device that generates a signal at a constant frequency that is used to synchronize the operation of the CPU. The frequency of the clock is measured in Hertz (Hz) and is the number of clock cycles per second

Your CPU also interacts with external storage and I/O devices via the Buses which are basically wires connecting things. There are many kinds of buses which allow the CPU to talk to different components like the RAM, the SSD, the Keyboard, the Display, and so on.

But, why do we care about the Registers, Clock and so on if we are talking about parallelism and concurrency? Bare with me, we are getting there.

Well, it turns out that even though this design is capable of supporting a complete system running many apps, it is NOT running them all at the same time. In fact, in the simplest case, a given CPU core is only capable of running a single instruction of a very limited set of instructions at a time.

Everything your computer ever does must be translated, in one way or another, to a series of these simple instructions 3, which are interpreted by the Control Unit and then executed either by the Control Unit itself when they involve control flow operations or by the ALU when they involve arithmetic or logical operations.

But clearly you have seen your computer running many things at the same time, right? The music in the background doesn’t just stop playing if you open a new tab in your browser, right? Well, in some sense, it does, but it is so fast that you don’t notice it.

Figure 3: Interleaving of Program Instructions

To give you the illusion that your computer is running many things at the same time, the OS kernel and the CPU have to do some magic…

In the dark old ages, prior to 2005, there was no parallelism AT ALL in your personal PC, scary times indeed my friends :c, all the consumer CPUs in the market followed the standard single core architucture we just described. For them to run many applications without one completly freezing waiting for the other to finish, we basically had a single option, to interleave between programs, taking turns executing some of their instructions: a little bit of word here, OS kernel, some more excel there, OS kernel, let’s do networking stuff, OS kernel, and so on, really, really, REALLY FAST, and just like that, creating a perfectlly orchestrated illusion of parallelism while providing you with a fully interactive experience.

And if you wanted faster CPUs you just needed to increase the frequency of the clock, interleaving ever faster, without worrying about true parallelism and silly things like that.

Multi-core CPUs

But it turns out that the universe has some physical limitations, and we hit a wall of how many clocks we could squeeze in a second back in the early 2000s. This is known as the Breakdown of Dennard’s scaling and it is the reason why we don’t have 10 GHz CPUs in our laptops, basically because they would melt.

So instead of increasing the frequency of the clock, we started adding more cores to the CPU, striving for True Parallelism.

Thanks to this new design, then we didn’t have a single core running many things, we then had many cores running many things, and the OS kernel has to orchestrate the execution of all of them, giving them time to run their instructions and taking them out of the CPU at specific interlvals, to let the rest of programs execute some of their instructions. What we saw in Figure 3 now looks more like this:

Figure 4: Interleaving Instructions in a Multi-core CPU

Stuff stills gets interleaved, but now we have many interleaving streams, allowing for true parallelism.

Crearly the OS kernel has to be a little bit more sophisticated now, implementing what people call Symmetric Multi Processing, but the basic idea is the same, it has to give execution time to every program that wants to run, and it has to do it in a way that is fair and efficient.

‘Fair and efficient’ is a very broad term, and it is actually a very hard problem to solve, but the OS kernel has some tools to help it with this task, and one of the most important is the Scheduler, which assigns execution time to the Threads of a Process. We will talk about all of this terms in the following sections, the problems that arise with true parallelism, and how to solve them.

Hardware Multi-threading

Before getting into Processes, Threads, Context Switches and all, i shall admit that i’ve been cutting corners while doing Figure 3 and Figure 4. It turns out that in modern architectures the ALU is not the single Execution Unit, but we have a collection of highly specialized circuits that can perform different operations, for example, we can have a circuit for integer operations while another one for floating point operations, called the Floating-point Unit. The specific units of execution depends on the Microarchitecture of the CPU, but the important thing is that we have more than one.

What this implies in practice is that if we only run one instruction per clock cycle, most likely some part of the processor will be idle, and we would not be making the most out of our hardware.

That is why some great geniuses came up with Super Scalar Processors and Simultaneous Multi-threading (SMT) which allows for the execution of multiple instructions per clock cycle, by having the CPU dispatch multiple instructions of the Instruction Queue per clock cycle, as long as those instructions are meant to be executed by different circuits. [1]

Figure 5: CPU Core with Two Execution Threads

With this architecture, each CPU core can improve its parallelism, without making too many changes to the overall design, and without increasing the power consumption too much.

On most of today’s consumer grade CPUs, the CPU exposes 2 execution contexts per core, see Figure 5, which are commonly refer to as Threads, and the OS kernel usually treats them as such, giving them execution time as if they were different cores. [2]

Further Reading

We are not Electronic Engineers in here, nor chip designers (or maybe you are, who knows), but this is a fascinating topic, so if you want a deeper dive into this subject, i recommend you resources like WikiChip and the Intel Developer Manual for a more detailed explanation of the microarchitecture of modern CPUs.

Programs, Processes, and Threads

From now on, we will, mostly, talk about software. From a low level with the OS Kernel, to a very high level with Python, but still, software.

In the last chapter we saw how in most modern CPUs we do have a way of doing actual parallelism, by using multiple CPU cores. Moreover, even inside a single CPU core, we can have multiple Threads of execution, which can be used to improve parallelism. But, we also saw that the OS Kernel has to orchestrate the execution of all of them.

This magic OS Kernel is just another piece of software, but it is a very special one, with all the privileges to do whatever it wants with the hardware, and it is the one that gives execution time to every program that wants to run.

So, for the OS kernel to nicely orchastrate all of this madness, it needs some structure, some way of managing the state of your application and the state of the hardware.

To do so, it creates some very important abstractions for each Program you wish to run, and even for itself, and these are the Processes and Threads.

Rembember we are talking about software in here, so the term Thread is not the same as the physical things we just talked about, but it is very closly related.

Every piece of information about your program, its state, the things it needs to properly run, to be safely interrupted by the Kernel and to resume execution, are nicely packed in these abstractions, which formally are Data Structures in the Kernel’s memory, but for us, they are just Processes and Threads.

Definitions

Let’s be precise about these concepts, since they are the key parts a software engineer actually needs to understand to properly program concurrent applications.

A Program is a set of instructions that are meant to be executed by the CPU. This is the code you write in your favorite programming language, which is then compiled or, in the case of Languages like Python, interpreted by another program called the Interpreter. Anyways, the important thing is that a Program is just machine code that the CPU can execute (think, for example, of all of those .exe files).

Figure 6: Process Control Block and Thread Control Block

A Process is an instance of a Program that is being executed by the CPU. This is the actual running of the Program, with all the data structures, memory, and resources that it needs to run properly. The OS Kernel creates a Process for every instance of a Program that you run (if you click on a .exe file many times, your OS will create a separate process for each time).

If you open the Task Manager in Windows, the Activity Monitor in MacOS or run systemctl in Linux, you will see a list of all the Processes that are currently running in your computer, and you will see that there are many of them, even if you are not running many applications. This is because the OS Kernel also creates Processes for itself, for the System Services, for the Drivers, and for every running instance of your favorite browser, text editor, or game.

A Thread means an execution thread within a Process, i.e, a series or list of instructions, from the ones that the Process has in memory, that is meant to be executed.

Crearly each Thread needs the specifics of which instructions from the whole set to execute, and some resources to do it, that is why, for each Thread we also have indepedent Program Counter, Registers, and Stack. But Threads of the same Process share the same Memory Space, including, as said before, the actual code, also, the File Descriptors and the Heap.

Threads are also the smallest unit of execution, meaning the OS Kernel will Schedule Threads, not Processes, to run in the CPU. This is because Threads are much lighter than Processes, since they share the same Memory Space, and the OS Kernel can easily switch between them, without having to do a lot of work to save and restore the state of the Process.

Process Control Block (PCB): The Actual Data Structure of a Process

PCB is the generic term used to refer to the Data Structure that an OS employs in order to manage Processes. In the case of the Linux Kernel (the only major one where we can see the code), it is called the task_struct and it is a very complex structure that holds all the information about a Process. [3]

Thread Control Block (TCB): The Actual Data Structure of a Thread

Likewise, TCB is the generic term used to refer to the Data Structure that an OS employs in order to manage Threads. Linux implements it by kind of cheating, because for the Linux Kernel, a Thread is just another task_struct, the same as before, but with a pointer to a parent task_struct that is the parent Process for the Thread, also, all the children Threads of a Process will share some resources but will have some differences specified in the thread_info struct, and you can create a new children Thread by calling the clone() system call.

In this sense you may see that in Linux, we have a tree-like structure of tasks, with the root being the first process (PID 0), i.e, the init process, and every other process being an ancestor of the init process, finally, the Threads will then be the leaves of this tree of tasks. [4] [5] [6] [7] [3]

This is in Linux, in other OSs, like Windows, Threads are a separate concept from Processes, and they have their own Data Structure, but the idea is the same, they hold all the information about the Thread, like the Registers, the Stack, the Program Counter, and so on, as specified in Figure 6, but between sibling Threads, they will share the same Memory Space, the same File Descriptors, and the same Heap.

Protection Rings and General Hierarchy

The core distinction between the kernel process and all other processes is implemented at the hardware level by using Protection Rings. These are a set of hardware-enforced protection domains that are used to control the CPU’s access to memory, I/O devices, and other resources. The OS kernel runs in the most privileged ring, usually ring 0, while user processes run in a less privileged ring, usually ring 3. This is a security feature that prevents user processes from accessing system resources directly and ensures that the OS kernel has full control over the system.

Figure 7: Rings of Protection. Fig Source: Wikipedia

Modern microarchitectures have many more than 2 protection rings, but mostly for retrocompatibliity reasons, most Operating Systems including Windows, MacOS and Linux, only use 2, the most privileged and the least privileged, where ring 0 is the Supervisor Mode, and ring 3 is the User Mode. [8] [9]

Figure 8: Only 2 Protection Rings are used in modern Oss. Fig Source: Wikipedia

OS Kernel Scheduler

We are clear by now that the OS Kernel is the sole administrator of resources in our computers. There are many types of these resources, from memory and files, to the CPU itself. The latter is our subject matter for this chapter.

Access to CPU is quite valuable, most Threads usually want it, so the OS Kernel Scheduler has to be very careful in how it gives it.

Processes in modern OSs fall in one of, usually, 5 states, commomly known as the 5 State Model. These states are: Init, Ready, Running, Waiting and Finished. [10]

Figure 9: 5 State Model. From [10]
Note

The 5 State Model is an asbtraction and simplification of the actual states. For example, the Linux Kernel implements de following states: TASK_RUNNING, TASK_INTERRUPTIBLE, TASK_UNINTERRUPTIBLE, TASK_STOPPED, TASK_ZOMBIE. [11]

The Scheduler will look for the ones that are ready to run, and will give them execution time in the CPU. The way it gives access to CPU is by doing a Context Switch, which is the process of saving the state of the currently running Thread, and loading the state of the Thread that is going to run. This is a very expensive operation, since it involves saving and restoring the state of the Registers, the Program Counter, the Stack, and so on, but it is necessary to give execution time to every Thread that wants to run.

Time Slices

To have a somewhat simple way of calculating the time a Thread will be allowed to run, the Scheduler has a quanta or minimum, discrete, amount of time that it can give, called the Time Unit. So, the assignment of execution time for a given Thread will always be calculated as a fixed integer multiple of this Time Unit.

When a Thread is given Time Units of execution, it is called a Time Slice, and the Scheduler will preemptively take the Thread out of the CPU when it has used up all of its Time Slice, and will give the CPU to another Thread that is ready to run. This is the essence of a Preemptive Scheduler, which is the most common type of Scheduler in modern OSs.

Note that a time slice Is NOT a promise of uninterrupted execution time, but a promise of a minimum amount of total available execution time. The Scheduler can take the Thread out of the CPU at any moment, even before the Time Slice is up, if it decides that another Thread needs to run.

Scheduling Algorithm

The way the Scheduler decides which Thread to run next is called the Scheduling Algorithm. There are many of these, and they are usually classified in two categories: Preemptive and Non-Preemptive. [12], [13], [14]

A Preemptive Scheduler is pessimistic, and it assumes that any given Thread will run forever if left to its own devices, so they have to be taken forcefully out of the CPU to allow for efficient interleaving of Threads. On the other hand, a Non-Preemptive Scheduler is counting on the goodwill of the Thread to give up its resources from time to time to allow for interleaving, which is not very efficient, but it is much simpler to implement.

Modern Schedulers are preemptive, and they usually use Priority Queues and Round Robin Scheduling to decide which Thread to run next.

Priority Queues and Priority Levels

The Priority level is how the Scheduler assigns importance to all the Threads that are ready to run. The Thread with the highest priority level will be the one schedule to run next, and the Scheduler will modify the priority level of the Threads as they run, to give execution time to all of them and to ensure that no Thread is left behind.

Figure 10: Simplified Priority Levels. Source [12]

The very specifics of the modifications and updates of priority levels vary by OS kernel implementations. For example, the linux kernel scheduler has been implemented with the CFS or Completely Fair Scheduler for the last 15 years, it utilizes a Red-Black Tree to store the Threads, and it uses the Virtual Runtime of the Threads to calculate the priority level.

Earliest eligible virtual deadline first scheduling

EEVDF is the scheduler algorithm set to replace CFS. It was described almost 30 years ago, but was only merged in 2023 to the development version 6.6 of the kernel. [15] [16]

Round Robin Scheduling

When the top priority level has been given to more than one Thread, like in Figure 10, the Scheduler will use a Round Robin Scheduling to decide which one to run next. This is a simple algorithm that gives a fixed amount of Time Units to each Thread in a circular fashion, and then starts again. Is desgined to resolve disputes between Threads of the same priority level, and to ensure that all Threads get execution time.

!Simplified Round Robing Scheduler. Source [12]

Simplified Round Robing Scheduler. Source [12]: images/round_robin.svg {#fig-round-robin width=“75%”}

Simulating a Round Robin Scheduler

Round Robin is a very simple, yet effective algorithm, and we can simulate its behaviour even by hand with a toy example:

Exercise: Consider a system with 3 Threads A, B and C. The threads arrive in the order A, B, C and the arrival time are at 0ms for each thread. The time quantum is 2ms. The threads are scheduled using the Round Robin scheduling algorithm. The context switch time is negligible. The CPU burst times for threads A, B and C are 4ms, 8ms and 6ms respectively. Compute the turnaround time for each thread.

Solution: The threads are scheduled as follows:

Figure 11: Example RR

This example exercise is a modification of a very similar one shown to me in the Cuda Mini Cohort at Cohere For AI.

  • Thread A runs from 0ms to 2ms
  • Thread B runs from 2ms to 4ms
  • Thread C runs from 4ms to 6ms
  • Thread A runs from 6ms to 8ms
  • Thread B runs from 8ms to 10ms
  • Thread C runs from 10ms to 12ms
  • Thread B runs from 12ms to 14ms
  • Thread C runs from 14ms to 16ms
  • Thread B runs from 16ms to 18ms

The turnaround time for each thread is the time it takes to complete the execution of the thread from the time it arrives. The turnaround time for threads A, B and C are 8ms, 18ms and 16ms respectively. And the average turnaround time is 14ms.

Problems that Appear in Concurrent Programs

Concurrency is a powerfull tool to harness every bit of your CPU processing power but it comes with a lot of potential problems and hard to debug situations for the usual synchrounous minded programmer. Hopefully, the kinds of problems that appear are well documented and there are techniques to solve them. We will cover the most common ones in the next section.

Race Conditions

A Race Condition occurs when two or more Threads are trying to access and modify the same resource at the same time. Like we saw early, the scheduler may take a Thread out of the CPU at any moment, so 2 Threads wanting to modify the same resource may end up corrupting it, or worse, corrupting the whole program, because they are not aware of each other and not guaranteed to run in a specific order.

If i can say so myself, i think that Race Conditions are the root of all evil in concurrent programming. These are the main reason we have Synchronization Primitives (Like Mutexes, Semaphores, Barriers, etc), and the reason we have to be very careful when writing concurrent programs.

When your Threads or Processes don’t share data or resources, you usually won’t have to worry about any of the problems that we will see next, but when they do, you do have to be very careful.

We can get a little bit ahead of ourselves and see a simple example of a forced race condition in Python:

Show source
from threading import Thread
import random
import time


class Balance:
    def __init__(self, balance=200):
        self.balance = balance

    def deposit(self, amount):
        balance = self.balance
1        time.sleep(random.random() / 10)
        self.balance = balance + amount

    def withdraw(self, amount):
        balance = self.balance
        time.sleep(random.random() / 10)
        self.balance = balance - amount

    def get_balance(self):
        time.sleep(random.random() / 10)
        return self.balance


def race_condition():
    balance = Balance()
    t1 = Thread(target=balance.deposit, args=(100,))
    t2 = Thread(target=balance.withdraw, args=(50,))
    t1.start(), t2.start()
    t1.join(), t2.join()
    return balance.get_balance()


if __name__ == "__main__":
    balances = [race_condition() for _ in range(10)]
    print(balances)
1
Simulate possible context switch here
[300, 300, 300, 300, 150, 150, 150, 150, 150, 150]

What is happening in here will conceptually be clearer by seeing Figure 12 (a), basically we have 2 Threads which are trying to modify the same resource, the balance attribute of the Balance class, and they are doing so without any kind of synchronization and whitout knowing about each other. Here we are forcing a context switch by calling time.sleep(random.random() / 10) in the middle of the method, but in a real world scenario, the context switch can happen at any moment, making it really hard to debug and reproduce the problem, since it may happen, literally, at one every 1000 or 10000 runs.

(a) Unsinchronized Access to Resource
(b) Sinchronized Access to Resource
Figure 12: The same race condition as in the code, with and without a synchronization primitive.

Not only that, but even harder to debug is the fact that some of Python’s syntatic sugar operations may hide the fact that we are dealing with multiple calls to the same resource, like in the case of the += operator, which is actually a shorthand for balance = balance + amount, and it is not atomic, meaning that it is actually 3 operations: read the value of balance, sum amount to it, and write the result back to balance. Implying that the context switch can happen between any of these operations, and the result will be a corrupted balance attribute.

To solve this type of problems people invented synchronization primitives, like Mutexes, which basically are a way of telling the OS Kernel that a Thread is using a resource, and that no other Thread can use it until the first Thread is done with it.

We can change our code to use one of these primitives, like so:

Show source
from threading import Thread, Lock
import random
import time


class Balance:
    def __init__(self, balance=200):
        self.balance = balance
        self.lock = Lock()

    def deposit(self, amount):
1        with self.lock:
            balance = self.balance
            time.sleep(random.random() / 10)
            self.balance = balance + amount

    def withdraw(self, amount):
        with self.lock:
            balance = self.balance
            time.sleep(random.random() / 10)
            self.balance = balance - amount

    def get_balance(self):
        with self.lock: 
            time.sleep(random.random() / 10)
            return self.balance


def race_condition():
    balance = Balance()
    t1 = Thread(target=balance.deposit, args=(100,))
    t2 = Thread(target=balance.withdraw, args=(50,))
    t1.start(), t2.start()
    t1.join(), t2.join()
    return balance.get_balance()


if __name__ == "__main__":
    balances = [race_condition() for _ in range(10)]
    print(balances)
1
Use a Lock, the Python implementation of a Mutex, to synchronize access to the resource
[250, 250, 250, 250, 250, 250, 250, 250, 250, 250]

Now, the Lock object is used to synchronize access to the balance attribute, and the with statement is used to acquire and release the Lock, making sure that only one Thread can access the resource at a time. This way, we can be sure that the balance attribute will not be corrupted, and that the result of the race_condition function will always be the same. See Figure 12 (b)

Deadlocks

Using Locks solves Race Conditions but creates its own problems. I mean, if you think about it, many entities locking many stuff all at once should lead to problems, right? Well, indeed it does, namely, Deadlocks.

Deadlock are situations in which 2 or more entities, in this case Threads, are waiting for each other to release a resource, but none of them will release the resource until the other one does. Is like when you want to give the right of way to the other driver, but the other driver is also trying to give you the right of way, and you both end up stuck in the middle of the intersection.

(a) A Deadlock in a street intersection
(b) A Deadlock with 2 Threads
Figure 13: Illustrative representation of a Deadlock

In computers this happen only if some specific properties apply to the Locks: The first one is that the Locks acquired are not preemtible, meaning that other Threads, including the Kernel, can not just take the resource away from the Thread that is holding it; the Second one is that the Locks are not acquired in a strict order, so the case may happen in which Threads want to aquire the same set of Locks but in a different order, and the Third one is that the Threads are waiting for each other to release the resource.

Given that these 3 conditions are met, the Threads will be stuck in a Deadlock, and the only way to solve it is to restart the program, or to kill the Threads that are stuck. If we don’t want to basically kill the programs, the only ways to solve Deadlock is by attacking one of such necessary conditions. We can attack the first one by making the Locks preemtible, such that the OS Kernel can take the Lock away from a Thread that is holding it, and give it to another Thread that is waiting for it. Another way of tackling this first condition is to use a Timeout when acquiring a Lock, with the intention that if the Lock is not acquired in a specific amount of time, the Thread will give up and try again later.

For tackling the second necessary condition, we can force the Locks to be acquired in a strict order, defining a hierarchy of Locks, and making sure that every Thread acquires the Locks in the same order.

All of these solutions can work, but they are not easy to implement, and they are not always possible, so the best way to avoid Deadlocks is to be very careful when using Locks, and to always think about the consequences of acquiring a Lock.

Implementing a Lock Hierarchy in Python

In Python, the Lock object does not have a built-in way of defining a hierarchy of Locks, but we can implement it ourselves. Here is an example of how to do it:

Show source
import threading
from time import sleep
from random import random

# Define locks
lock_A = threading.Lock()
lock_B = threading.Lock()
lock_C = threading.Lock()

# Create a lock hierarchy (order)
lock_order = {lock_A, lock_B, lock_C}


# Function to acquire locks in order
def acquire_locks():
    sleep(random() / 10)  # Simulate context switch
    for lock in lock_order:
        lock.acquire()


# Function to release locks in reverse order
def release_locks():
    sleep(random() / 10)  # Simulate context switch
    for lock in lock_order:
        lock.release()


# Example function using the locks
def thread_task(i):
    # Acquire locks in the defined order
    acquire_locks()

    try:
        # Critical section
        print(f"Thread {i} is executing with all locks acquired")
    finally:
        # Always release locks to avoid deadlocks
        release_locks()


# Create multiple threads
threads = []
for i in range(5):
    t = threading.Thread(target=thread_task, args=(i,))
    threads.append(t)
    t.start()

# Wait for all threads to complete
for t in threads:
    t.join()
Thread 4 is executing with all locks acquired
Thread 1 is executing with all locks acquired
Thread 2 is executing with all locks acquired
Thread 0 is executing with all locks acquired
Thread 3 is executing with all locks acquired
Priority Inversion

Finally, we will talk about another problem that is of concern no so much to application developers but to OS Kernel developers; Priority Inversion.

This is a somewhat weird situation, first described back in the 80’s in [17] where you have a lower priority thread executing first and blocking a higher priority thread thanks to a resource that is being held and is impeding the higher priority thread to execute.

Remember that the priority of threads are crucial for the scheduler to know which one should run first as seen in Figure 10, so if a lower priority thread is blocking a higher priority thread, the scheduler is not doing its job properly.

Figure 14: Sort of Priority Inversion. Not that bad since are running as per designed.

The situation we just described is not that bad, since even if the lower priority thread is executing, it is not doing enything out of the ordinary, i mean, it was programmed to take the lock of the shared resource, and it is doing so, yes it is blocking a higher priority thread, but it is not doing anything wrong, and when it finished executing, the higher priority thread will be able to run.

I want to enfatize, that when the application developer used the locks in both Thread A and C, he knew in advanced that this very situation could happend, that is why is not that bad.

The really bad situation, the same that caused glitches in the mars rover back in 1997 [18], is when there is a medium priority thread involved, that will actually cause some unexpected behaviour of the system, that the application developer never thought of.

Figure 15: Priority Inversion Worst Case. Definitely breaking expected behaviour.

Look at Figure 15, first the lower priority Thread C aquires the resource the same way it did in Figure 14, because of this, the higher priority Thread A is blocked waiting for the resource to be release, but in this case, a medium priority Thread B preempts Thread C, and starts running for as long as it wants, blocking C by priority and consequently, indirectly blocking the higher priority A. This is a very bad situation, because the system is not behaving as expected, and the higher priority Thread A is being blocked by something that was never designed to block it, leading to possible system failures.

Let’s enfatize again that this is the really bad situation because the application developer never thought of Thread B blocking Thread A, so the code will not be prepared to handle this case, and the system may fail.

And this lack of preparation for the unexpected situation is what caused the Mars Rover to glitch, because in its case, when the higher priority Thread was blocked for a given amount of time, the scheduler detected a failure and went into safe mode, restarting the whole system, and causing a lot of problems. [19].

OS kernels have some ways of solving this problem, like the Priority Inheritance in linux [20] and the Priority Ceiling as described in [21]. The former solves the problem by giving the priority of the higher priority thread, in our case Thread A, to the lower priority thread, Thread C, that is blocking it, and the latter solves the problem by assigning a priority to the shared resource, the Lock in our case, and making sure that the threads that are going to use it have a priority higher than the priority of the resource.

Other approach that is used by Windows, is Random Boosting [22], which is a way of giving a random boost to the priority of the lower priority thread, so it can finish faster and the higher priority thread can run.

TL;DR Priority Inversion is an example of weird problems that can happen in concurrent programs, and it is a problem that is of concern to OS Kernel developers, not so much to application developers.

(Off-topic) How Does Your Computer Boot?

Warning

This is a very off-topic section, but i think it is very interesting. If you don’t think so, you can skip it.

We have been talking about Threads and Processes, OS Kernels and Schedulers, under the asumption that they are already running in your PC, but how does your computer actually get to this state? How does the push of a button in your computer actually makes it run?

When you press the power button on your computer, a lot of things happen, and most of them are done by the BIOS or UEFI firmware, which is a very small program that is stored in a special memory chip in your motherboard, and that is executed by the CPU when you press the power button.

When you power on your computer, the CPU starts executing instructions from a predefined memory location known as the reset vector. This vector typically points to a small piece of firmware stored in a ROM or flash memory chip on the motherboard. In x86 systems, this firmware is known as the BIOS (Basic Input/Output System) or UEFI (Unified Extensible Firmware Interface).

The BIOS or UEFI performs a series of hardware checks and initializations to ensure that essential components like RAM, storage devices, and peripherals are functioning correctly. It also identifies and initializes the CPU, sets up interrupt handlers, and performs other low-level hardware configurations.

Once the hardware initialization is complete, the BIOS/UEFI locates and loads the boot loader into memory. The boot loader is a small program responsible for loading the operating system kernel into memory and transferring control to it. In Unix-based systems, the most commonly used boot loader is GRUB (GRand Unified Bootloader).

The boot loader typically resides in a specific location on the disk known as the Master Boot Record (MBR) or the EFI System Partition (ESP) in UEFI systems. The BIOS/UEFI reads the boot loader from this location and executes it. The boot loader’s main job is to locate the kernel image on the disk, load it into memory, and pass control to the kernel.

Once the boot loader transfers control to the kernel, the kernel takes over the boot process. The kernel initializes essential system components such as process management, memory management, device drivers, and file systems. It also detects and configures hardware devices that were not initialized by the BIOS/UEFI.

After initializing the kernel, the Unix-based system creates the first user-space process, known as the init process. The init process has process ID (PID) 1 and serves as the ancestor of all other processes in the system. Its primary role is to initialize the system environment, start system daemons, and maintain system services.

The init process continues the system initialization process by executing startup scripts and configuration files. These scripts set up network interfaces, mount file systems, configure system settings, and launch essential system services such as syslogd, cron, and SSH.

Once the system initialization is complete, the system typically prompts the user to log in either through a text-based login prompt or a graphical login manager, depending on the system configuration. After successful authentication, the user gains access to the system and can start using it for various tasks. 😉👌 [23]

Parallelism and Concurrency in Python

If you have muddle through all of this, and still don’t hate me for talking too much, i appreciate it, and to your releif, we are finally going to talk about how to use all of this in Python.

This section onwards is what most programmers will found useful, altghough i shall defend myself from all the talkative allegations by saying that the previous sections are very interesting and useful to understand the next ones, also, i did put a warning at the start of the blog 😅.

The Global Interpreter Lock (GIL)

We absolutely can not in right mind talk about concurrency in Python without talking about the Global Interpreter Lock (GIL). A ‘feature’ which has been probably the most controversial and discussed topic in the Python community since its creation… apart from the migration from Python 2 to Python 3, that’s an ugly one too.

Important

What is going to be discussed in this section only applies to CPython which is the default, standard implementation fo Python that we all now and love. But the Python language its a specifications, and there are other implementations of it like Jython, IronPython, PyPy, and MicroPython, which do not have a GIL, or have a different implementation of it. But since CPython is the one that 99% of people know and the one that actually has all the libraries and tools, we clearly must focus on that.

Rembember that Python is an interpreted language, and that the Python interpreter is a program that runs in your computer, which runs an intermidiate, platform agnostic, representation of machine code called bytecode, which is usually generated on the fly by the interpreter itself when executing a Python script. You can see the bytecode of a Python function by using the dis module (dis as in disassembly), like so:

Show source
1import dis


def add(a, b):
    return a + b


2dis.dis(add)
1
Import the dis module, which is used to disassemble Python into bytecode.
2
Disassemble the bytecode of the add function.
  4           0 RESUME                   0

  5           2 LOAD_FAST                0 (a)
              4 LOAD_FAST                1 (b)
              6 BINARY_OP                0 (+)
             10 RETURN_VALUE

The GIL is a mutex (or lock) that protects access to the Python Interpreter, and it is used to make sure that only one Thread can execute Python bytecode at a time. This is because the Python Interpreter is not thread-safe, meaning that it can not handle multiple Threads executing at the same time without risking corruption of the Python objects and possible memory leaks.

The reason for this incapability lies in the way Python, or at least CPython, handles memory management. Other programming languages use a technique called Garbage Collection to manage memory, which is a way of automatically freeing memory that is no longer in use by having a separate process that runs in the background and releases memory when it is no longer needed, but Python uses a different technique called Reference Counting, where every Python object has a counter that keeps track of how many references to it are in the program, and when the counter reaches zero, the memory is freed.

Show source
import sys

a = [1, 2, 3]
b = a
c = a

1ref_count = sys.getrefcount(a)
2print(ref_count)
1
Get the reference count of the a list.
2
This should print 4, because there are 4 references to the a list: a, b, c, and the one that getrefcount creates when passing a as an argument.
4

This is a very efficient way of managing memory, but it is not thread-safe, because if two Threads are trying to modify the reference count of the same object at the same time, the counter may get corrupted, and the memory may not be freed when it should be, leading to memory leaks and other incredibly hard to debug problems.

The GIL solved it quickly and efficiently, at least for single threaded programs, but it has been a thorn in the side of Python developers for a long time, because it makes it very hard to write parallel programs in Python, you can no longer rely on lightweight threads to run your code in parallel, and you have to use more complex and less efficient ways of parallelizing your code, like using the multiprocessing module, which is a way of creating multiple processes instead of multiple threads, with the aims of bypassing the GIL, because each process has its own Python Interpreter, but then it becomes harder to communicate between processes since they do not share memory and by nature they are more expensive to create and manage than threads.

Moreover, replacing the GIL hasn’t been easy, since almost every solution would involve making single threaded programs, which are the absolute vast majority of Python programs, slower, and that is not a trade-off that the Python Software Foundation is willing to make.

But i am glad to tell you that in recent years, the Python Software Foundation together with private companies like Microsoft, have been working on a way to replace the GIL without sacrificing single threaded performance. The Python Enhancement Proposal (PEP) 703 describes the roadmap for a GIL-optional Python in version 3.13 with slowly making it the default by version 3.16.

Note

Guess what is one of the official motivators listed in PEP 703 for replacing the GIL? Deploying AI Models!!! 😁👌

Compile Python to Bytecode

Did you know you can compile Python code to bytecode ‘manually’? You can do it with the compile function of the builtins module, like so:

Show source
1code = """
def add(a, b):
    return a + b

print(f'Printing result of add(1, 2): {add(1, 2)}')
"""

2bytecode = compile(code, "<string>", "exec")

print(bytecode)
3exec(bytecode)
1
Define a simple Python function. Instead of a string this could have been a .py file.
2
Compile the code to bytecode. This functions returns a code object.
3
Execute the bytecode with the exec function.
<code object <module> at 0x1687c9a30, file "<string>", line 1>
Printing result of add(1, 2): 3

Multithreading in Python

Finally, we wil see how to use threads in Python, including its synchronization primitives, things like ThreadPools, and we will do a little benchmark to show how CPU-bound multithreaded applications are heavily affected by the GIL and you should probably only use them in I/O-bound applications.

The threading and concurrent.futures Modules

We have already glimpsed at some multithreading code here and there. This is a more thorough look at the threading module in Python.

The common way to create a Thread in Python is by using the Thread class from the threading module by passing a target function to the constructor. Then you can start the thread by calling the start method, and wait for it to finish by calling the join method. You can also pass arguments to the target function by using the args parameter and keyword arguments by using the kwargs parameter.

Show source
from threading import Thread


def print_numbers(n, name):
    print(f"{name=}")
    for i in range(n):
        print(i, end=" ")


t = Thread(target=print_numbers, args=(5,), kwargs={"name": "Thread 1"})
t.start()
t.join()
name='Thread 1'
0 1 2 3 4 

The Thread class is a very low-level way of creating Threads in Python, and it is not very efficient, because creating a new Thread is a expensive operation, also it lacks flexibility to handle large numbers of Threads, their creation, deletion and communication between them.

For these reasons, Python has a higher-level module called concurrent.futures, which provides a way of creating and managing Threads in a more efficient and flexible way.

The concurrent.futures module provides a class for managing a pool of Threads called ThreadPoolExecutor, which is a subclass of the Executor class.

This class makes it easy to submit tasks to the pool of Threads, and to get the results of the tasks once they are done.

Show source
from concurrent.futures import ThreadPoolExecutor

def print_numbers(n):
    for i in range(n):
        print(i, end=" ")

with ThreadPoolExecutor() as executor:
    executor.submit(print_numbers, 5)
0 1 2 3 4 

Other way of creating multiple threads with this class is by using the map method, which can recieve many iterables that will be passed as ordered arguments to the target function.

Show source
from concurrent.futures import ThreadPoolExecutor
import time


def task(n, thread_id):
    l = []
    for i in range(n):
        time.sleep(0.1)
        l.append(f"{thread_id=} {i=}")
    return l


with ThreadPoolExecutor() as executor:
1    results = executor.map(task, [5, 5, 5], [1, 2, 3])
    for result in results:
        print(result)
1
The map method takes an iterable of arguments and an iterable of keyword arguments, and it returns an iterable of the results of the tasks.
['thread_id=1 i=0', 'thread_id=1 i=1', 'thread_id=1 i=2', 'thread_id=1 i=3', 'thread_id=1 i=4']
['thread_id=2 i=0', 'thread_id=2 i=1', 'thread_id=2 i=2', 'thread_id=2 i=3', 'thread_id=2 i=4']
['thread_id=3 i=0', 'thread_id=3 i=1', 'thread_id=3 i=2', 'thread_id=3 i=3', 'thread_id=3 i=4']

You can also get the Thread object representing the current thread by calling the current_thread function from the threading module.

Show source
from concurrent.futures import ThreadPoolExecutor
from threading import current_thread

def task(n):
    print(f"{current_thread().name=} {n=}")
    return n

with ThreadPoolExecutor(max_workers=3) as executor:
    results = executor.map(task, [1, 2, 3, 4, 5])
    for result in results:
        print(result)
current_thread().name='ThreadPoolExecutor-2_0' n=1current_thread().name='ThreadPoolExecutor-2_1' n=2

current_thread().name='ThreadPoolExecutor-2_1' n=3
current_thread().name='ThreadPoolExecutor-2_1' n=4
current_thread().name='ThreadPoolExecutor-2_1' n=5
1
2
3
4
5

Another cool feature is that you may have data that is local to each thread by using the local class from the threading module, which provides a way of creating a thread-local storage for each thread.

Show source
from threading import local, Thread


def task(n):
1    local_data = local()
    local_data.value = n
    print(f"{local_data.value=}")


t1 = Thread(target=task, args=(1,))
t2 = Thread(target=task, args=(2,))
t1.start(), t2.start()
t1.join()
t2.join()
1
Create a local object to store data that is local to each thread.
local_data.value=1
local_data.value=2

Keep in mind that in the case of the ThreadPoolExecutor class, the threads are reused, so the data that is local to each thread will be shared between all the tasks that are executed by the same thread, unless explicitely deleted or modified. For example:

Show source
from concurrent.futures import ThreadPoolExecutor
from threading import local
import time
local_data = local()

1def task_update(idx):
    local_data.value = idx
    print(f"{idx=} - Updated Value - {local_data.value=}")
2    time.sleep(0.1)

3def task_prior(idx):
    if hasattr(local_data, "value"):
        print(f"{idx=} - Prior Value - {local_data.value=}")
        time.sleep(0.1) 


with ThreadPoolExecutor(max_workers=2) as executor:
4    executor.map(task_update, [0, 1])
5    executor.map(task_prior, [3, 4])
1
Define a function that updates the value of the local_data object.
2
Sleep for a short time to simulate a context switch.
3
Define a function that prints the value of the local_data object.
4
Submit tasks to update the value of the local_data object.
5
Submit tasks to print the value of the local_data object.
idx=0 - Updated Value - local_data.value=0
idx=1 - Updated Value - local_data.value=1
idx=3 - Prior Value - local_data.value=0
idx=4 - Prior Value - local_data.value=1

See that when calling the second batch of threads with the task_prior function that prints the value of the local_data, such value already exists and has been updated by the first batch of threads with the task_update function.

So be careful when using the local class and reusing threads, you may find yourself with unexpected results if handled incorrectly.

This is the basics of Thread creation and management in Python. Is actually surprisingly simple, but very powerful. Where things start to get complicated is when you start to use multiple threads to access shared resources, so it becomes a necesity to use synchronization primitives to avoid race conditions.

We will see how to use these primitives in the next section.

Syncronization Primitives for Multithreading in Python

Python provides a number of synchronization primitives to help you manage access to shared resources in multithreaded programs. These primitives are available in the threading module, and they include:

Lock

A simple mutual exclusion lock that can be used to protect shared resources from being accessed by multiple threads at the same time.

Show source
from threading import Lock, Thread
import time

lock = Lock()


def task(n):
1    with lock:
        for i in range(n):
            print(i, end=" ")
            time.sleep(0.05)


t1 = Thread(target=task, args=(3,))
t2 = Thread(target=task, args=(3,))
t1.start(), t2.start()
t1.join()
t2.join()
1
Use a Lock object as a context manager to protect access to the shared resource, in this case, the whole for loop.
0 1 2 0 1 2 

The prior example uses a Lock object as a context manager, which automatically acquires the lock before entering the block and releases it when exiting the block, crucially it also handles exceptions, so the lock is always released, even if an exception is raised.

You can aquire and realease a Lock manually, although it is not recommended, because it is easy to forget to release the lock, and it can lead to deadlocks. But here is the same example as before with manual lock management:

Show source
from threading import Lock, Thread
import time

lock = Lock()


def task(n):
    try:
1        lock.acquire()
        for i in range(n):
            print(i, end=" ")
            time.sleep(0.05)
    except Exception as e:
        print(e)
    finally:
2        lock.release()


t1 = Thread(target=task, args=(3,))
t2 = Thread(target=task, args=(3,))
t1.start(), t2.start()
t1.join()
t2.join()
1
Manual aquiring the lock..
2
When manually managing locks, it is important to release the lock in a finally block to ensure that it is always released, even if an exception is raised.
0 1 2 0 1 2 
RLock

A reentrant lock that can be acquired multiple times by the same thread.

Show source
from threading import RLock


class RecursiveCounter:
    def __init__(self):
        self.count = 0
        self.lock = RLock()

    def increment(self, times):
        with self.lock:
            if times > 0:
                self.count += 1
1                self.increment(times - 1)


# Create a RecursiveCounter instance
counter = RecursiveCounter()

# Increment the counter recursively
counter.increment(5)

# Print the final count
print("Counter final value:", counter.count)
1
Recursively call the increment method. With a normal Lock this would cause a deadlock.
Counter final value: 5

This type of Lock is not too common, but can be usefull when dealing with recursion in a multithreaded environment.

Semaphore

A counter that can be used to control access to a shared resource by a fixed number of threads.

Show source
from threading import Semaphore, Thread
import time

# Create a semaphore that allows up to two threads to enter a section of code at once
1semaphore = Semaphore(2)

2def thread_function(num):
    print(f"Thread {num} is waiting for the semaphore")
    with semaphore:
        print(f"Thread {num} has acquired the semaphore")
        time.sleep(0.05)  # Simulate some work
    print(f"Thread {num} has released the semaphore")

# Create 5 threads that will each try to acquire the semaphore
threads = [Thread(target=thread_function, args=(i,)) for i in range(5)]

# Start all threads
for t in threads:
    t.start()

# Wait for all threads to finish
for t in threads:
    t.join()
1
Create a semaphore that allows up to two threads to enter a section of code at once.
2
Use the semaphore as a context manager to control access to the shared resource.
Thread 0 is waiting for the semaphore
Thread 0 has acquired the semaphore
Thread 1 is waiting for the semaphore
Thread 1 has acquired the semaphore
Thread 2 is waiting for the semaphore
Thread 3 is waiting for the semaphore
Thread 4 is waiting for the semaphore
Thread 0 has released the semaphoreThread 2 has acquired the semaphore

Thread 1 has released the semaphore
Thread 3 has acquired the semaphore
Thread 2 has released the semaphore
Thread 4 has acquired the semaphore
Thread 3 has released the semaphore
Thread 4 has released the semaphore

The Sempahore is one of the classic synchronization primitives, and can be usefull if you have a resource than can handle multiple threads up to certain amount, imagine a conection to a DB that can handle 10 parallel connections without freezing, you can use a Semaphore to control the access to it.

Event

A simple way to communicate between threads using a flag that can be set or cleared.

Show source
from threading import Event, Thread
import time

# Create an Event object
3event = Event()

1def waiting_thread():
    print("Waiting for the event to be set")
    event.wait()
    print("The event has been set, proceeding")

2def setting_thread():
    time.sleep(0.05)  # Simulate some work
    print("Setting the event")
    event.set()

# Create the threads
t1 = Thread(target=waiting_thread)
t2 = Thread(target=setting_thread)

# Start the threads
t1.start()
t2.start()

# Wait for both threads to finish
t1.join()
t2.join()
1
The waiting_thread function waits for the event to be set before proceeding
2
The setting_thread function sets the event after a delay.
3
The Event object allows these two threads to synchronize their actions.
Waiting for the event to be set
Setting the event
The event has been set, proceeding

An Event is a simple synchronization object; the main methods are set() and clear(). If the Event is set, wait() doesn’t do anything. If the Event is not set, wait() will block until the Event is set.

With this primitive, some thread can control the execution flow of another thread, by setting or clearing the Event. Can be usefull when you have a thread that is waiting for some other thread to finish something, like if one is going to read from a file but the other has not finished writing to it.

Condition

A more advanced synchronization primitive that allows threads to wait for a condition to be met before proceeding.

Show source
from threading import Condition, Thread

# Create a Condition object
condition = Condition()


1def consumer_thread():
    with condition:
        print("Consumer: Waiting for the condition to be met")
        condition.wait()
        print("Consumer: The condition has been met, proceeding")


2def producer_thread():
    with condition:
        print("Producer: Making the condition true")
        condition.notify_all()


# Create the threads
t1 = Thread(target=consumer_thread)
t2 = Thread(target=producer_thread)

# Start the threads
t1.start()
t2.start()

# Wait for both threads to finish
t1.join()
t2.join()
1
The consumer_thread function waits for the condition to be met before proceeding.
2
The producer_thread function sets the condition and notifies the waiting threads.
Consumer: Waiting for the condition to be met
Producer: Making the condition true
Consumer: The condition has been met, proceeding

A Condition object allows one or more threads to wait until they are notified by another thread.

This follows the classic producer-consumer pattern, where one thread produces some data and another thread consumes it. The producer thread sets the condition and notifies the consumer thread, which then proceeds.

You can notify a specific amount of threads by using the notify(n) method, or all of them by using the notify_all method. Also, the wait method can take a timeout argument, so it will wait for the condition to be met for a certain amount of time before proceeding.

Barrier

A synchronization primitive that allows a fixed number of threads to wait for each other at a barrier before proceeding.

Show source
from threading import Barrier, Thread

# Create a Barrier for three threads
barrier = Barrier(3)

1def worker_thread(num):
    print(f"Thread {num} is doing some work")
    # Simulate work with a sleep
    time.sleep(num/10)
    print(f"Thread {num} is waiting at the barrier")
    barrier.wait()
    print(f"Thread {num} is proceeding")

# Create three worker threads
threads = [Thread(target=worker_thread, args=(i,)) for i in range(3)]

# Start all threads
for t in threads:
    t.start()

# Wait for all threads to finish
for t in threads:
    t.join()
1
The worker_thread function simulates work and waits at the barrier before proceeding.
Thread 0 is doing some work
Thread 0 is waiting at the barrier
Thread 1 is doing some work
Thread 2 is doing some work
Thread 1 is waiting at the barrier
Thread 2 is waiting at the barrier
Thread 2 is proceeding
Thread 0 is proceeding
Thread 1 is proceeding
Queue

A thread-safe queue that can be used to pass messages between threads.

Show source
from threading import Thread
from queue import Queue

# Create a Queue object
q = Queue()

1def producer_thread():
    for i in range(5):
        q.put(i)
        print(f"Produced: {i}")

2def consumer_thread():
    while True:
        item = q.get()
        if item is None:
            break
        print(f"Consumed: {item}")

# Create the threads
t1 = Thread(target=producer_thread)
t2 = Thread(target=consumer_thread)

# Start the threads
t1.start()
t2.start()

# Wait for the producer thread to finish
t1.join()

# Signal the consumer thread to stop
3q.put(None)

# Wait for the consumer thread to finish
t2.join()
1
The producer_thread function puts items into the queue.
2
The consumer_thread function gets items from the queue and processes them.
3
The None item is used to signal the consumer thread to stop, if this is not done, the consumer thread will block indefinitely.
Produced: 0
Produced: 1
Produced: 2
Produced: 3
Produced: 4
Consumed: 0
Consumed: 1
Consumed: 2
Consumed: 3
Consumed: 4

So we can have fun with all of these synchronization primitives, but the cases for multithreading in Python are very specific, thank to the GIL. In general, CPU-bound tasks, which are those that spend most of their time actually doing computation in the CPU, are not a good fit for multithreading in Python, because the GIL will prevent the threads from running in parallel, and the performance will be worse than using a single thread.

Let’s see a simple example of performance degradation when using multithreading for a CPU-bound task in Python.

Benchmarking Multithreading in Python in a CPU-Bound Task

In this eval we will compare execution times of a dumb CPU-bound task (one million iterations) when doing all the work in one Thread versus splitting the work in multiple Threads.

Show source
from time import perf_counter_ns # High resolution timer
from concurrent.futures import ThreadPoolExecutor
import pandas as pd

def task(n):
    start_time = perf_counter_ns()
1    for _ in range(1_000_000 // n):
        pass
    end_time = perf_counter_ns()
    ellapsed_ms = (end_time - start_time) / 1_000_000
    # print(f"Time taken: {ellapsed_ms} ns")
    return ellapsed_ms


2N_REPETITIONS = 50
MAX_NUM_THREADS = 30

data = {}

with ThreadPoolExecutor(max_workers=MAX_NUM_THREADS) as executor:
    for n in range(1, MAX_NUM_THREADS + 1):
        l = []
        for _ in range(N_REPETITIONS):
3            partial_times = list(executor.map(task, [n] * n))
4            total_time = sum(partial_times)
            l.append(total_time)
        data[str(n)] = l
    

df = pd.DataFrame(data)
5df.to_parquet('resources/threading_benchmark.parquet')
1
Each thread will do 1_000_000 // n iterations (Perfectly splitting the work).
2
We will repeat the experiment 50 times for each number of threads to ensure statistical significance.
3
Run the task n times where n is the number of threads.
4
Add the partial times to get the total time for each group of threads.
5
Save the results to a parquet file.

We can now do a box plot of the results:

Show source
import plotly.graph_objects as go
import numpy as np
import pandas as pd
import plotly.io as pio

pio.renderers.default = "notebook"

N = 30  # Number of boxes

df_org = pd.read_parquet("resources/threading_benchmark.parquet")

print("Dataframe shape:", df_org.shape)

# generate an array of rainbow colors by fixing the saturation and lightness of the HSL
# representation of colour and marching around the hue.
# Plotly accepts any CSS color format, see e.g. http://www.w3schools.com/cssref/css_colors_legal.asp.
c = ["hsl(" + str(h) + ",50%" + ",50%)" for h in np.linspace(0, 360, N)]


# Each box is represented by a dict that contains the data, the type, and the colour.
# Use list comprehension to describe N boxes, each with a different colour and with different randomly generated data:
fig = go.Figure(
    data=[
        go.Box(
            y=df_org.iloc[:, i].values,
            marker_color=c[i],
            name=str(i + 1),
            boxmean=True
        )
        for i in range(int(N))
    ]
)

# format the layout
fig.update_layout(
    xaxis=dict(zeroline=True, gridcolor="white", showgrid=True, zerolinecolor="white", ),
    yaxis=dict(gridcolor="white", zeroline=True, showgrid=True, zerolinecolor="white",),
    paper_bgcolor="rgb(246,246,246)",
    plot_bgcolor="rgb(246,246,246)",
    title=dict(
      text="<b>Time Taken for One Million Iterations<br>With Different Number <br>of Threads</b>",
      x=0.5, 
      y=0.95,
      font=dict(
        family="Times New Roman, serif",
        size=24),
      ),
    xaxis_title="Number of Threads Used",
    yaxis_title="Time (ms)",
    legend_title=dict(
      text="Threads",
      font=dict(
        size=17,
        color="black"
      ),
    ),
    legend=dict(
        font=dict(
            size=11,
            color="black"
        ),
    ),
    margin=dict(l=40, r=40, t=40, b=10),
    font=dict(family="Times New Roman, serif", size=16),
    height=660,
    width=None
)
fig.update_yaxes(minor_ticks="inside",
                  minor=dict(
                    ticklen=6, 
                    tickcolor="black", 
                    tickmode='auto', 
                    nticks=10, 
                    showgrid=True
                    )
                )

fig.show(config={"responsive": True})
Dataframe shape: (50, 30)

It’s clear now that for a CPU-bound task, multithreading in Python undermines performance. The theory is that with more threads in Python, there will not be parallelism because of the GIL, but, given that we increase the number of threads wanting to acces the CPU and acquire the GIL, the overhead of context switching between threads will increase, and the performance will degrade.

There is a nice way of testing this theory with hard data. We can use the psutil library to count the number of context switches per process and try to find the correlation between this number and the time taken to complete the tasks for each group of threads.

For such endevour, we just need a small modification of the prior code, to not only measure the time taken for each number of threads, but also the number of context switches per process.

Show source
from time import perf_counter_ns # High resolution timer
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
import psutil

1def task(n):
    start_time = perf_counter_ns()
    for _ in range(1_000_000 // n): # Each thread will do 1_000_000 // n iterations (Perfectly splitting the work)
        pass
    end_time = perf_counter_ns()
    ellapsed_ms = (end_time - start_time) / 1_000_000
    return ellapsed_ms


N_REPETITIONS = 50
MAX_NUM_THREADS = 30

data = {}
process = psutil.Process()

with ThreadPoolExecutor(max_workers=MAX_NUM_THREADS) as executor:
    for n in range(1, MAX_NUM_THREADS + 1):
        l = []
        switches = []
        for _ in range(N_REPETITIONS):
2            init_ctx_switches = process.num_ctx_switches()
            init_ctx_switches = init_ctx_switches.voluntary + init_ctx_switches.involuntary
            partial_times = list(executor.map(task, [n] * n)) # Run the task n times
3            end_ctx_switches = process.num_ctx_switches()
            end_ctx_switches = end_ctx_switches.voluntary + end_ctx_switches.involuntary
4            ctx_switches = end_ctx_switches - init_ctx_switches
            total_time = sum(partial_times)
            l.append(total_time)
            switches.append(ctx_switches)
        data[str(n)] = l
5        data[f"{n}_ctx_switches"] = switches
    

df = pd.DataFrame(data)
df.to_parquet('resources/threading_benchmark_ctx_switches.parquet')
1
This will remain the same as before. The measurement won’t happen here, since we can only do process-scoped measurements with psutil.
2
We will measure the number of context switches of the running process before the execution of our taks.
3
We will measure the number of context switches of the running process after the execution of our taks.
4
The difference between the two will give us the number of context switches that our task caused.
5
We will add this information to the dataframe.

With this information, we can calculate Pearson’s correlation between the time taken for each group of threads and the number of context switches caused reported by the parent process of all those threads.

Show source
import pandas as pd
import plotly.graph_objects as go
import numpy as np

df = pd.read_parquet("resources/threading_benchmark_ctx_switches.parquet")
NUM_THREADS = df.shape[1] // 2

corrs = []
for n in range(1, NUM_THREADS + 1):
    corrs.append(df[str(n)].corr(df[f"{n}_ctx_switches"]))

x = np.arange(1, len(corrs) + 1)
mean_corr = np.mean(corrs)

fig = go.Figure(data=go.Scatter(x=x, y=corrs, mode='lines+markers',))
fig.add_hline(y=mean_corr, line_dash="dot",
              annotation_text=f"<b>Mean Correlation<br>{mean_corr:.3f}</b>", 
              annotation_position="bottom right",
              annotation_font_size=16
              )
fig.update_layout(
    boxgap=0.3,
    boxmode='group',
    xaxis=dict(zeroline=True, gridcolor="white", showgrid=True, zerolinecolor="white", ),
    yaxis=dict(gridcolor="white", zeroline=True, showgrid=True, zerolinecolor="white"),
    title=dict(
      text="<b>Correlation of<br>Time and Context Switches</b>",
      x=0.5, 
      y=0.93,
      font=dict(
        family="Times New Roman, serif",
        size=24),
      ),
    xaxis_title="Number of Threads Used",
    yaxis_title="Pearson Correlation Coefficient",
    margin=dict(l=40, r=20, t=80, b=40),
    font=dict(family="Times New Roman, serif", size=16)


)
fig.update_yaxes(minor_ticks="inside",
                  minor=dict(
                    ticklen=6, 
                    tickcolor="black", 
                    tickmode='auto', 
                    nticks=6, 
                    showgrid=True
                    )
                )
fig.show(config={"responsive": True})

Even with the noise of not having data per thread, the correlation is quite clear, the more threads you use, the more context switches you will have, and the more context switches you have, the more time it will take to complete the task.

Look at the following graph for an in-depth final look at our data:

Show source
import plotly.express as px
import plotly.graph_objs as go

def flatten_data(df):
    flattened_data = []
    for key in df.keys():
        data_type = "Thread"
        if key.endswith("_ctx_switches"):
            data_type = "Context Switches"
        idx = key.split("_")[0]
        for value in df[key]:
            flattened_data.append({"Thread": idx, "Value": value, "Type": data_type})
    return pd.DataFrame(flattened_data)

# df = pd.read_parquet("resources/threading_benchmark_ctx_switches.parquet")
flattened_df = flatten_data(df)

threads_df = flattened_df[flattened_df.Type == "Thread"]
ctx_switches_df = flattened_df[flattened_df.Type == "Context Switches"]

threads_data = go.Box(x=threads_df["Thread"], y=threads_df["Value"], name="Time<br>Taken", boxmean=True)
ctx_switches_data = go.Box(x=ctx_switches_df["Thread"], y=ctx_switches_df["Value"], name="Context<br>Switches", boxmean=True, yaxis="y2")
fig = go.Figure(data=[threads_data, ctx_switches_data])
fig.update_traces(quartilemethod="inclusive", boxmean=True) 
fig.update_layout(
    boxgap=0.2,
    boxmode='group',
    xaxis=dict(zeroline=True, gridcolor="white", showgrid=True, zerolinecolor="white", ),
    yaxis=dict(gridcolor="white", zeroline=True, showgrid=True, zerolinecolor="white"),
    paper_bgcolor="rgb(248,248,248)",
    plot_bgcolor="rgb(248,248,248)",
    title=dict(
      text="<b>Visual Correlation of<br>Time Taken vs Context Switches</b>",
      x=0.48, 
      y=0.95,
      font=dict(
        family="Times New Roman, serif",
        size=24),
      ),
    xaxis_title="Number of Threads Used",
    yaxis_title="Time (ms)",
    margin=dict(l=40, r=40, t=40, b=40),
    font=dict(family="Times New Roman, serif", size=16),
    height=650,
    width=None,
    yaxis2=dict(
        title="Number of Context Switches",
        overlaying="y",
        side="right",
        showgrid=False,
        zeroline=False,
        type="log"
    )


)
fig.update_yaxes(minor_ticks="inside",
                  minor=dict(
                    ticklen=6, 
                    tickcolor="black", 
                    tickmode='auto', 
                    nticks=10, 
                    showgrid=True
                    ),
                    type="log"
                )
fig.show(config={"responsive": True})
Note

Ideally, we should measure the number of context switches per Thread not per process like we do with psutil. The reason is quite simple, with the prior method, context switches from the main Python thread are also counted, dispite the fact that we are only interested in the context switches of the threads we created. This extra count will just add noise to our measurements.

Sadly, measurement of context switches per thread is not possible with the psutil library. Although, in Unix based systems it would be quite straightforward to do a small custom code to get those numbers, thanks to the /proc filesystem, where the kernel diligently puts all the info we need in a convenient single file per thread, that we can open and read. In Windows, though, the necessary steps are waaaaay more complicated and the result not nearly as good as in Unix systems.

Counting Context Switches Per Thread in Python

For the sake of completeness, i will show what would be required to measure the context switches per thread in Unix systems, together with some hacks to get the best info possible in Windows systems… at least the best i could find.

Altough, i will warn now that i shall not reproduce the prior results with these new measurement methods, since it would be a lot of extra platform-dependant code and not very useful for the main point of this section, which has already been established.

Unix Systems

In Unix systems, we can get the number of context switches per thread by reading the /proc/[pid]/task/[tid]/status file, where [pid] is the process id and [tid] is the thread id.

In this file we can find the number of voluntary and involuntary context switches for each thread.

Here is a simple Python function that reads the /proc/[pid]/task/[tid]/status file and returns the number of voluntary and involuntary context switches for a given thread:

Show source
def get_context_switches_per_thread(pid, tid):
    with open(f"/proc/{pid}/task/{tid}/status") as f:
        for line in f:
            if line.startswith("voluntary_ctxt_switches"):
                voluntary = int(line.split()[1])
            elif line.startswith("nonvoluntary_ctxt_switches"):
                involuntary = int(line.split()[1])
    return voluntary, involuntary

Quite simple indeed, but very effective.

Windows Systems

For Windows we basically need to do black magic to get the number of context switches per thread per second, yes per second, because the Windows API only provides the information in this way (well, i was only able to find it in this way, at least).

In this case we will exploit Performance Counters to get the number of context switches per thread per second. We need to use various of these counters, because the one we actually care about does not use process ids and thread ids, but rather process names and instances, weird concepts that can, with some extra work, be mapped to the process ids and thread ids that we have in Python.

Performance Counters are a series of windows API functions that allow us to get a lot of information about the system, like CPU usage, memory usage, disk usage, etc. In our case the performance counter we are interested in is the Context Switches/sec counter. For accessing these counters, we need to specify a counter path that is a string that tells the API where to find the counter we are interested in.

The general format of the counter path is \\Computer\PerfObject(ParentInstance/ObjectInstance#InstanceIndex)\Counter, where:

  • Computer is the name of the computer we want to get the information from, in our case it will be the local computer.
  • PerfObject is the name of the performance object we are interested in, in our case it will be Thread, although, as said before, we need extra info to know which thread.
  • Counter is the name of the counter we are interested in, in our case it will be Context Switches/sec.

That extra info we need is the ParentInstance/ObjectInstance#InstanceIndex part of the path. This is an alternative way that windows have to index threads, for example, in my machine i can list all Python thread identifiers as follows:

Show source
import win32pdh
import re

1pattern = r'python\/\d+#?\d?'

python_threads = [re.findall(pattern, s)[0] 
                  for s in win32pdh.ExpandCounterPath('\\Thread(*)\\ID Thread') 
                  if 'python' in s]

3print(python_threads)
1
This Regular expression will match all the python thread identifiers in the format python/[instance]#[index] and also python/[instance].
3
I’ve actually deactivated code execution for these examples since they are platform dependant and sometimes i render my documents in other platforms.

In the case of my Windows computer what the prior cell would print is something like:

['python/0#6', 'python/1#5', 'python/2#5', 'python/3#5', 'python/0#5', 'python/1#4', 'python/2#4', 'python/3#4', ...., 'python/27', 'python/0', 'python/1', 'python/2', 'python/3']

The task is now to decifer which of these identifiers corresponds to the current thread id and process id.

To do so we can use the win32pdh module again, which is a Python wrapper around the Windows Performance Data Helper (PDH) API. This API allows us to access performance counters in Windows. And we will first need to use the ID Process and ID Thread counters to get the process id and thread id for each thread

Let’s define a function that returns the pid and tid for a given parent and instance:

Show source
import win32pdh

def get_pid_tid(parent, instance):
    """
    Get the Process ID (PID) and Thread ID (TID) for a given parent and instance.

    Parameters:
    parent (int): The parent value.
    instance (int): The instance value.

    Returns:
    tuple: A tuple containing the Process ID (PID) and Thread ID (TID).

    Raises:
    Exception: If there is an error while getting the counter data.
    """

    pid_path = f"\\Thread({parent}/{instance})\\ID Process"
    tid_path = f"\\Thread({parent}/{instance})\\ID Thread"

    query_handle = win32pdh.OpenQuery()
    pid_handle = win32pdh.AddCounter(query_handle, pid_path)
    tid_handle = win32pdh.AddCounter(query_handle, tid_path)
    pid, tid = None, None

    try:
        win32pdh.CollectQueryData(query_handle)
        _, pid = win32pdh.GetFormattedCounterValue(pid_handle, win32pdh.PDH_FMT_LONG)
        _, tid = win32pdh.GetFormattedCounterValue(tid_handle, win32pdh.PDH_FMT_LONG)
    except Exception as e:
        print(f"Failed to get counter data: {e}")

    win32pdh.CloseQuery(query_handle)
    return pid, tid

Now we can associate the process ids and thread ids with the thread identifiers we got before and check which identifier is the one for the current pid and tid

Show source
from multiprocessing import current_process
from threading import current_thread

current_pid = current_process().pid
current_tid = current_thread().ident


threads_info = {}
current_thread_info = None

for t in python_threads:
    parent, instance = t.split("/")
    pid, tid = get_pid_tid(parent, instance)
    threads_info[t] = dict(pid=pid, tid=tid)

    if pid == current_pid and tid == current_tid:
        current_thread_info = dict(parent=parent, instance=instance, pid=pid, tid=tid)

print(current_thread_info)

The time i executed the prior code in my Windows machine, the info for the current thread was found to be:

{'parent': 'python', 'instance': '0#5', 'pid': 27884, 'tid': 30752}

This was the final piece of the puzzle we needed. Calculating the context switches per second of our threads has become possible with the use of the following function:

Show source
import win32pdh
import win32api
import time

def get_thread_context_switches(parent, instance):
    query_handle = win32pdh.OpenQuery()
    
    # Performance counter path for context switches of a specific thread
    path = f"\\Thread({parent}/{instance})\\Context Switches/sec"
    counter_handle = win32pdh.AddCounter(query_handle, path)
    
    try:
        win32pdh.CollectQueryData(query_handle)
        for i in range(1_000_000):
1          pass # Simulate some work in here
        win32pdh.CollectQueryData(query_handle)
        _, value = win32pdh.GetFormattedCounterValue(counter_handle, win32pdh.PDH_FMT_LONG)
    except Exception as e:
        print(f"Failed to get counter data: {e}")
        value = None
    
    win32pdh.CloseQuery(query_handle)
    return value
1
In a real world scenario we would do some productive stuff in here.

By calling get_thread_context_switches we can get the number of context switches per second for our thread.

Show source
parent, instance = current_thread_info["parent"], current_thread_info["instance"]
context_switches = get_thread_context_switches(parent, instance)
print(context_switches)

When i executed this cell in my Windows pc it printed 80 the first time and 120 the second time, then i got bored :p. Two data points for you, that is more than what you started with ;D

And, again, just like that, we have the number of context switches per second for our running thread.

With the prior information, you have all the tools needed to start using multithreading in Python. Please, just don’t use it in CPU-bound tasks, the whole point of the benchmark we did was so you don’t use them for that. Rather, use it for I/O-bound tasks, such as multiple nwetwork calls and stuff like that.

For CPU-bound tasks, we do have a way in Python of improving their performance and writing truly parallel code, this is by using the multiprocessing module, which we will cover in the next section.

Multiprocessing in Python

Multiprocessing is probably one of the only ways of writing truly parallel code in Python.

Since the GIL was such a problem, bacause threads could not parallely run with the same interpreter, the solution the Python community found was to just have multiple interpreters, running in parallel in multiple processes. By having different processes, the running threads would not share memory space, so its reference counts variables would be completely safe of race conditions.

In this sections we will explain how to use the multiprocessing module and concurrent.futures module to create, manage and synchronize processes. You will find that most of the apis for doing these tasks, are incredibly similar to the ones used for multithreading.

We will also see some of the many ways to solve interprocess communication, which is a new problem that arises given that our processes do not share memory space, thus don’t have an easy way of exchanging information with each other.

The multiprocessingand concurrent.futures Modules

The Process class from the multiprocessing module is the equivalent of the Thread class from the threading module. It allows you to create and manage processes in Python.

There is, though, a new catch when creating a process with this class. The multiprocessing module use low-level system calls to start each new process, these system calls differ a little depending on the OS you are running. There are 3 main system calls that can be used to create a new process, and only one of them is available in all 3 major OSs (Mac, Linux and Windows). Here is a table with this information:

Start Method Mac Linux Windows
fork Yes Yes No
forkserver Yes Yes No
spawn Yes Yes Yes

Additionally, the default start method used by each platform is:

  • fork on Linux
  • spawn on Windows and Mac.
Important

With most of the code in this section i won’t be executing the code cells but rather just showing their expected results in every execution environment. This is because multiprocessing code is highly platform dependant and, in particular, it plays badly when on jupyter notebooks in Windows (like this blog). For that reason, i rather take the problem as an oportunity to explain all these platform dependant nuances.

Also, unless specificaly stated, all the code in this section is meant to be run in a Python script, not in a Jupyter notebook. We will have a dedicated section for multiprocessing in Jupyter notebooks later on.

The fork and forkserver system calls are not available on Windows, so the spawn method is used by default on this platform. The spawn method creates a new Python interpreter from scratch. This means that the new process will not inherit the state of the parent process, and the new process will not share memory with the parent process. Also, the newly spwaned process will run the main module from the beginning, so you will need to safeguard the main module with a if __name__ == "__main__": clause.

On the contrary, the fork method creates a new process by duplicating the current process. This tends to be faster than spawn and also means that the new process will inherit the state of the parent process, its variables, its memory, file descriptors, etc. Crucially, the new process will continue executing from the point where the fork method was called, so you don’t need to safeguard the main module with a if __name__ == "__main__": clause, although is still a good practice to do so.

Finally, the forkserver method is a compromise between the two. It creates a new process called the server process in a way similar to calling spawn, but any other new process after that will be created by calling fork on the server process. [24]

You can change the default start method by using the function set_start_method from the multiprocessing module. This function should be called before creating any new process and must be called a single time in your program, else we will get an error of the form RuntimeError: context has already been set.

Let’s see an example of how to change the start method to fork:

Linux, Mac
from multiprocessing import set_start_method

# Always safeguard the main module. It is a good practice.
if __name__ == "__main__":
  set_start_method("fork")
  # Create and manage processes here

The prior code will not work in Windows, since the fork method is not available in this platform. If you try to run this code in Windows, you will get an error of the form ValueError: cannot find context for 'fork'.

Basic Process Creation and Management

The APIs for process creation and management in the multiprocessing module are very similar to the ones in the threading module. You can create a new process by creating an instance of the Process class and calling its start method. The target argument should be a function that the new process will run. You can wait for a process to finish by calling its join method.

Linux, Mac, Windows
from multiprocessing import Process
import os

def worker():
1  print(f"Worker: My PID is {os.getpid()}", flush=True)

if __name__ == "__main__":
  print("This is from the main process")
2  p = Process(target=worker)
3  p.start()
4  p.join()
1
The worker function prints the PID of the current process.
2
The Process class is created with the worker function as the target.
3
The new process is started.
4
The main process waits for the new process to finish.
This is from the main process
Worker: My PID is 12345

The Process class has a name attribute that you can use to give a name to the process. This can be useful for debugging purposes.

Linux, Mac, Windows
from multiprocessing import Process, current_process
import os


def worker():
1    worker_process = current_process()
    name = worker_process.name
    print(f"Worker: My PID is {os.getpid()} and my name is: {name}", flush=True)


if __name__ == "__main__":
    print(f"This is from the main process: {os.getpid()}")
    p = Process(target=worker, name="Worker Process")
    p.start()
    p.join()
1
The current_process function returns the current process object. This object has a name attribute that you can use to get the name of the process.
This is from the main process: 9348
Worker: My PID is 27268 and my name is: Worker Process
Starting Methods Differences

As we saw before, fork duplicates the main process while spawncreates a new one from scratch. In practices this can lead to some differences when executing the exact same code on different platforms.

See the following example where we have code executing outside of the if __name__ == "__main__": clause:

Windows, Linux, Mac
from multiprocessing import Process, current_process

print(f"Global Scope. Process name: {current_process().name}")


def worker():
  print(f"Worker: My PID is {current_process().pid}", flush=True)


if __name__ == "__main__":
  print(f"This is from the main process: {current_process().pid}")
  p = Process(target=worker, name="Worker Process")
  p.start()
  p.join()

Since the default start method in Windows and Mac is spawn, the output of the prior code on those platforms will be:

Global Scope. Process name: MainProcess
This is from the main process: 16560
Global Scope. Process name: Worker Process
Worker: My PID is 26380

On the other hand, in Linux, the default start method is fork, so the output will be:


Global Scope. Process name: MainProcess
This is from the main process: 16560
Worker: My PID is 26380

You can see that when using fork as start method, the execution flow continues from the point the process was created, while with spawn the execution flow starts from the beginning of the main module because it is imported again in the new process.

Passing Arguments and Keyword Arguments to a Process

You can pass arguments to a process by using the args argument when creating the process. The arguments should be passed as a tuple.

Keywords arguments can also be passed to the target function by using the kwargs argument when creating the process.

Linux, Mac, Windows
from multiprocessing import Process
import os


def worker(name, age, **kwargs):
    print(f"Worker: My name is {name} and my age is {age}", flush=True)
    print(f"Worker: My PID is {os.getpid()}", flush=True)
    print(f"Worker: My kwargs are: {kwargs}", flush=True)


if __name__ == "__main__":
    print(f"This is from the main process: {os.getpid()}")
    p = Process(
        target=worker, args=("John", 25), kwargs={"city": "New York", "country": "USA"}
    )
    p.start()
    p.join()
This is from the main process: 23096
Worker: My name is John and my age is 25
Worker: My PID is 5148
Worker: My kwargs are: {'city': 'New York', 'country': 'USA'}
Extending the Process Class

You can create a new class that extends the Process class and overrides the run method. The run method is the method that the new process will run.

Linux, Mac, Windows
from multiprocessing import Process
import os

class Worker(Process):
    def __init__(self, name, age, **kwargs):
        super().__init__()
        self.name = name
        self.age = age
        self.kwargs = kwargs

    def run(self):
        print(f"Worker: My name is {self.name} and my age is {self.age}", flush=True)
        print(f"Worker: My PID is {os.getpid()}", flush=True)
        print(f"Worker: My kwargs are: {self.kwargs}", flush=True)


if __name__ == "__main__":
    print(f"This is from the main process: {os.getpid()}")
    p = Worker("John", 25, city="New York", country="USA")
    p.start()
    p.join()
This is from the main process: 3316
Worker: My name is John and my age is 25
Worker: My PID is 2000
Worker: My kwargs are: {'city': 'New York', 'country': 'USA'}

The pattern above might be usefull when dealing with parallelizable entities, like a list of tasks that can be executed in parallel. In this way, your Object Oriented code can be more organized and easier to maintain.

Using a ProcessPoolExecutor from the concurrent.futures Module

The concurrent.futures module provides a high-level interface for managing the creation of processes. The ProcessPoolExecutor class is a subclass of the Executor class and allows you to create a pool of processes that can be used to execute tasks asynchronously.

The ProcessPoolExecutor class has a map method that works similarly to the map method of the ThreadPoolExecutor class. The map method takes a function and an iterable of arguments and returns an iterator that yields the results of applying the function to the arguments.

Linux, Mac, Windows
from concurrent.futures import ProcessPoolExecutor
import os

print("This is from global scope", flush=True)


def worker(name, age):
    print(f"Worker: my PID is {os.getpid()}", flush=True)
    return f"Worker: My name is {name} and my age is {age}"


if __name__ == "__main__":
    with ProcessPoolExecutor(max_workers=3) as executor:
        results = executor.map(worker, ["John", "Jane"], [25, 30])
        for result in results:
            print(result, flush=True)

As it is, when executed on Windows almost all the times it will print something as follows:

This is from global scope
This is from global scope
This is from global scope
Worker: my PID is 6416
Worker: my PID is 6416
Worker: My name is John and my age is 25
Worker: My name is Jane and my age is 30

You might notice things like the main module being imported multiple times, but what you should pay attention in here is to the fact that the process pool is reutilizing the same process to execute the tasks. But if we execute the same code in Linux we will have the following output almost all the times:

This is from global scope
Worker: my PID is 33264
Worker: My name is John and my age is 25
Worker: my PID is 33265
Worker: My name is Jane and my age is 30

Not only the main module is not imported multiple times, but also the tasks are executed in different processes, as expected.

Now, Windows IS capable of using many processes, it just seems that it only does so when it really needs it. For example, if we put a sleep call in the middle of the worker task:

Linux, Mac, Windows
from concurrent.futures import ProcessPoolExecutor
import os
import time

print("This is from global scope", flush=True)


def worker(name, age):
    time.sleep(1)
    print(f"Worker: my PID is {os.getpid()}", flush=True)
    return f"Worker: My name is {name} and my age is {age}"


if __name__ == "__main__":
    with ProcessPoolExecutor(max_workers=3) as executor:
        results = executor.map(worker, ["John", "Jane"], [25, 30])
        for result in results:
            print(result, flush=True)

When run in Windows, all the time, it will print something as follows:

This is from global scope
This is from global scope
This is from global scope
Worker: my PID is 8392
Worker: my PID is 11304
Worker: My name is John and my age is 25
Worker: My name is Jane and my age is 30

Showing that, in this case, it does use multiple processes. This weird behaviour might be due to the spawn method again. The reason is simple, creating a new process with spawn is expensive, so for simple tasks as the prior example, it might have happened that the first task released the first process before the second process was created by the ProcessPoolExecutor. In such case, to the eyes of the second task, the first process was already available, so it just used it.

In summary, do not trust that when using a ProcessPoolExecutor the tasks will be run in different OS Processes with different PIDs.

Another way of usig the ProcessPoolExecutor class is by calling its submit method. This method takes a function and its arguments and returns a Future object that represents the result of the function call. You can use the result method of the Future object to get the result of the function call.

Linux, Mac, Windows
from concurrent.futures import ProcessPoolExecutor
import os
import time


def worker(name, age):
    time.sleep(1)
    print(f"Worker: my PID is {os.getpid()}", flush=True)
    return f"Worker: My name is {name} and my age is {age}"


if __name__ == "__main__":
    with ProcessPoolExecutor(max_workers=3) as executor:
        future1 = executor.submit(worker, "John", 25)
        future2 = executor.submit(worker, "Jane", 30)
        print(future1.result(), flush=True)
        print(future2.result(), flush=True)
Worker: my PID is 15808
Worker: my PID is 21816
Worker: My name is John and my age is 25
Worker: My name is Jane and my age is 30
Printing and Logging from different Processes

You may have seen in all the prior example code that when printing we always use the flush=True argument. The reason for this extra code is because the print function in child processes created with spwan may be trated by some OSs in some circunstances as a buffered stream, so the output may not be printed immediately, but rather when the buffer is full or when the process finishes. This can be quite annoying when debugging, so we use the flush=True argument to force the output to be printed immediately.

Another BIG consideration to have in mind is the fact that when you create a new process with spawn, the new process will not inherit the state of the parent process, so it will not inherit the stdout and stderr streams. This means that if you use the print function in the new process, depending on the situation, the output may not be printed to the console, but rather to a file or to nowhere at all. Usually, if you execute your python script from the command line, the output will be printed to the console, but this is only because the default behaviour of OSs whe creating a new process from a console is to set such console as the stdout and stderr streams of the new process, so both the main process and all its children will be pointing to the console.

BUT, keep in mind that although both parent and children point to the same external location (the console were the .py script was launched), the children DO NOT have the same stdout and stderr objects as the parent process, so if you change the stdout or stderr objects in the parent process, the children will not be affected.

We can see this in action by using the redirect_stdout context manager from the contextlib module. This context manager allows you to redirect the stdout stream to a file-like object.

In the following example, we will redirect the stdout stream to a file and then create a new process that prints to the stdout stream.

Linux, Mac, Windows
from multiprocessing import Process
import os
from contextlib import redirect_stdout


class Worker(Process):
    def __init__(self, name, age, **kwargs):
        super().__init__()
        self.name = name
        self.age = age
        self.kwargs = kwargs

    def run(self):
        print(f"Worker: My name is {self.name} and my age is {self.age}", flush=True)
        print(f"Worker: My PID is {os.getpid()}", flush=True)
        print(f"Worker: My kwargs are: {self.kwargs}", flush=True)


if __name__ == "__main__":
    with open("log.txt", "w") as f:
        with redirect_stdout(f):
            print(f"This is from the main process: {os.getpid()}")
            p = Worker("John", 25, city="New York", country="USA")
            p.start()
            p.join()

For Windows, the console output will be:

Worker: My name is John and my age is 25
Worker: My PID is 11712
Worker: My kwargs are: {'city': 'New York', 'country': 'USA'}

and the file output:

log.txt
This is from the main process: 14676

In Linux though, the console output will be empty and the file output will be:

log.txt
This is from the main process: 270
Worker: My name is John and my age is 25
Worker: My PID is 271
Worker: My kwargs are: {'city': 'New York', 'country': 'USA'}

So the stdout redirection in Linux affected not only the parent but also the children, while on Windows it only affected the parent.

This distinction seems subtle, indeed it is, but it will be the main reason why in jupyter notebooks on Windows, printing from children processes is basically broken.

We will see this in more detail in the next section.

Multiprocessing in Jupyter Notebooks

For the ML comunity Jupyter Notebooks are one of the core tools where they write and prototype Python code. And if someone is following this tutorial, they will probably want to use multiprocessing in Jupyter Notebooks.

There are design choices in Jupyter that make multiprocessing a little bit more complicated than in a Python script, even more so when using Windows.

We’ll start with the basics, creating a simple process in a Jupyter Notebook.

To see everything in action, we will first start a Jupyter Notebook server by running the following command in the terminal (of course, you must have jupyter installed):

jupyter notebook

Please don’t use out-of-the-box notebooks like the ones in VScode or Pycharm for this section. You will see why it is important to start it directly from the terminal in a moment.

This will open a new tab in your browser with the Jupyter Notebook interface. But crucially, from the terminal you started the server, you will also have access to the server logs, which will be very important for this section.

Example of Jupyter Notebook Server Logs

Create a new Python 3 notebook and run the following code:

Linux, Mac, Windows
import multiprocessing as mp
from time import sleep

def mp_func():
    print('Starting mp_func')
    sleep(1)
    print('Finishing mp_func')

if __name__ == '__main__':
    mp.set_start_method("spawn")
    p = mp.Process(target=mp_func)
    p.start()
    print('Waiting for mp_func to end')
    p.join()
    print('mp_func ended')

This is almost the same code we used before, so it should work fine doesn’t it? Well, it doesn’t. If you run this code in a Jupyter Notebook, it will throw the following error:

Show source
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/diegomezp/miniconda3/envs/general311/lib/python3.11/multiprocessing/spawn.py", line 122, in spawn_main
    exitcode = _main(fd, parent_sentinel)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/diegomezp/miniconda3/envs/general311/lib/python3.11/multiprocessing/spawn.py", line 132, in _main
    self = reduction.pickle.load(from_parent)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: Can't get attribute 'mp_func' on <module '__main__' (built-in)>

This happens because there is a major limitation of the multiprocessing module on interactive interpreters like Jupyter Notebooks. Quoting the Python documentation:

Functionality within this package requires that the __main__ module be importable by the children. This is covered in Programming guidelines however it is worth pointing out here. This means that some examples, such as the multiprocessing.Pool examples will not work in the interactive interpreter. source

This limitation only happens when using spawn as start method, a.k.a. in Windows and Mac. If you change the start method to fork, the code will work fine in Linux and Mac (but not in Windows). To solve the issue, you must made the mp_func function importable by the children. This can be done by moving the function to a separate module and importing it in the main module.

Create a new Python file called mp_module.py with the following content:

mp_module.py

from time import sleep

def mp_func():
    print('Starting mp_func')
    sleep(1)
    print('Finishing mp_func')

Now, modify the Jupyter Notebook code to import the mp_func function from the mp_module module:

Linux, Mac, Windows
import multiprocessing as mp
from mp_module import mp_func

if __name__ == '__main__':
    mp.set_start_method("spawn")
    p = mp.Process(target=mp_func)
    p.start()
    print('Waiting for mp_func to end')
    p.join()
    print('mp_func ended')

This should be sufficient for a successful execution in Linux, Mac and Windows too. But if you are following the code in a Windows machine, you may have noticed the the print statements from the child process are not being printed to the Notebook. Here is where the console logs from the Jupyter Notebook server come in handy. If you look at the logs, you will see that the output of the child process is being printed there.

Example of Jupyter Notebook Server Logs

As of now, this behaviour is preatty much unavoidable in Windows. The reason involves process inheritance and the way Jupyter Notebooks are designed.

Why are Children Process Prints Broken on Jupyter Notebooks in Windows?

We need to get aqcuainted with the way Jupyter Notebooks work. When you run a Jupyter Notebook server, you are actually running a web server that serves the Jupyter Notebook interface. All the code is executed by a background Python process called the kernel. Jupyter intercepts the stdout and stderr streams of the kernel process and redirects them to the web interface. This is why you see the output of the code cells in the notebook, that way it can show the results of your operations in the web view.

Jupyter Notebook Architecture Diagram. Source

When you create a new process in a Jupyter Notebook, the new process is a child of the kernel process. The kernel process is the one that has its stdout and stderr streams redirected to the web interface, whether or not the output of the child process is shown in the notebook entirely depends on if the child process inherits the stdout and stderr streams of the kernel process.

When those process are created using fork, by design, they will be a copy of the parent process, including its file descriptors, in particular they will inherit the stdout and stderr streams of the parent process. This is why the output of the child process is shown in the notebook when using fork as start method.

But, when using spawn it entirely depends on the OS. In some Unix based systems, the child process will inherit the stdout and stderr streams of the parent process, so the output of the child process will be shown in the notebook. But in Windows, the child process will not inherit the stdout and stderr streams of the parent process, so the output of the child process will not be shown in the notebook but in wherevere was its default locations when they were created.

This default location is usually the console where the Jupyter Notebook server was started, so you can see the output of the child process in the console logs of the Jupyter Notebook server. In fact, even with fork, the output of the child process will be shown in the console logs too, not only in the notebook.

So how do you log stuff in this specific, although, not quite uncommon case? Well, if you want for it to be printed in the notebook, you might need to have an alternate way of sending the output to the kernel process. One way of doing this is by using the Queue class from the multiprocessing module.

We will have a separate section for interprocess communication but i wanted to have a possible solution to the problem right here on this same section it was posed.

mp_module.py

from time import sleep

def mp_func(queue):
    try:
        queue.put('Starting mp_func')
        sleep(1)
        queue.put('Finishing mp_func')
        queue.put("mp_func completed successfully")
    except Exception as e:
        queue.put(f"Error in mp_func: {e}")
Linux, Mac, Windows
import multiprocessing as mp
from time import sleep
from multiprocessing_code import mp_func

if __name__ == '__main__':
    queue = mp.Queue()
    p = mp.Process(target=mp_func, args=(queue,))
    p.start()
    print('Waiting for mp_func to end')
    p.join()
    # Get output from the queue
    while not queue.empty():
        print(queue.get())
Waiting for mp_func to end
Starting mp_func
Finishing mp_func
mp_func completed successfully

This code will work in all platforms and will print the output of the child process in the notebook.

Synchronization Primitives in Multiprocessing

Lock
RLock
Semaphore
Event
Condition
Barrier
Queue

Interprocess Communication

Pipes
Queues
Shared Memory
Value and Array
Manager

Benchmarking Multiprocessing in Python in a CPU-Bound Task

Asynchronous Programming in Python

Practical Examples

(CPU-Bound) Speeding up Image Pre-Processing Tasks

(I/O-Bound) Speeding up Batch Calls to the OpenAI API

Conclusion

References

[1]
D. P. Bovet and M. Cesati, Understanding the linux kernel. O’Reilly, 2001.
[2]
[3]
A. Brouwer, “Processes,” The linux kernel. Feb. 2003. Available: https://www.win.tue.nl/~aeb/linux/lk/lk-10.html
[4]
S. O. A. N. Bard, “Implementation of processes and threads by the linux kernel,” Medium. Medium, Jan. 2022. Available: https://soul-of-a-nameless-bard.medium.com/the-implementation-of-processes-and-threads-by-the-linux-kernel-36adaec746aa
[5]
S. Boutnaru, “Linux kernel - task_struct,” Medium. Medium, Dec. 2022. Available: https://medium.com/@boutnaru/linux-kernel-task-struct-829f51d97275
[6]
Baeldung, “Linux process vs. thread,” Baeldung on Linux. Baeldung on Linux, Mar. 2024. Available: https://www.baeldung.com/linux/process-vs-thread
[7]
[8]
M. D. Schroeder and J. H. Saltzer, “A hardware architecture for implementing protection rings,” Communications of the ACM, vol. 15, no. 3, pp. 157–170, Mar. 1972, doi: 10.1145/361268.361275. Available: https://www.multicians.org/protection.html
[9]
M. Russinovich, D. A. Solomon, and A. Ionescu, Windows internals: Part 1, 6th ed. Microsoft Press, 2012, p. 17.
[10]
D. E. Culler, “Kernel threads,” CS162: Operating Systems and Systems Programming. Department of Electrical Engineering; Computer Sciences UC Berkeley, 2014. Available: https://inst.eecs.berkeley.edu/~cs162/fa14//
[11]
C. Ming Jun, “Linux process states,” Baeldung. Feb. 2003. Available: https://www.baeldung.com/linux/process-states
[12]
B. P. Miler, “Scheduling and CPU scheduling,” CS 537 Introduction to Operating Systems. University of Wisconsin, Madison. School of Computer, Data & Information Sciences, 2018. Available: https://pages.cs.wisc.edu/~bart/537/lecturenotes/s11.html
[13]
[14]
M. Kravetz and H. Franke, “Implementation of a multi-queue scheduler for linux.” IBM Linux Technology Center, 2001. Available: https://lse.sourceforge.net/scheduling/mq1.html
[15]
M. Larabel, “EEVDF scheduler may be ready for landing with linux 6.6,” Phoronix. Aug. 2023. Available: https://www.phoronix.com/news/Linux-6.6-EEVDF-Likely
[16]
J. Corbet, “An EEVDF CPU scheduler for linux,” [LWN.net]. Mar. 2023. Available: https://lwn.net/Articles/925371/
[17]
B. W. Lampson and D. D. Redell, “Experience with processes and monitors in mesa,” Commun. ACM, vol. 23, no. 2, pp. 105–117, Feb. 1980, doi: 10.1145/358818.358824. Available: https://doi.org/10.1145/358818.358824
[18]
G. E. Reeves, “What really happened on mars ?” University of North Carolina, Dec. 1997. Available: https://www.cs.unc.edu/~anderson/teach/comp790/papers/mars_pathfinder_long_version.html
[19]
C. Poellabauer, “Mars pathfinder incident,” Mobile computing Lab - Real-Time Systems. University of Notre Dame. Available: https://www3.nd.edu/~cpoellab/teaching/cse40463/slides8.pdf
[20]
O. Source, “The linux foundation wiki,” Priority inversion - priority inheritance. The Linux Foundation, Jun. 2016. Available: https://wiki.linuxfoundation.org/realtime/documentation/technical_basics/pi
[21]
L. Sha, R. Rajkumar, and J. P. Lehoczky, “Priority inheritance protocols: An approach to real-time synchronization,” IEEE Trans. Comput., vol. 39, no. 9, pp. 1175–1185, Sep. 1990, doi: 10.1109/12.57058. Available: https://doi.org/10.1109/12.57058
[22]
K. Bridge, K. Sharkey, and M. Satran, “Priority inversion - win32 apps,” Win32 apps | Microsoft Learn. Microsoft, Mar. 2024. Available: https://learn.microsoft.com/en-us/windows/win32/procthread/priority-inversion?redirectedfrom=MSDN
[23]
T. Jones, “Inside the linux boot process,” DeveloperWorks. IBM, May 2006. Available: https://web.archive.org/web/20071011074709/http://www.ibm.com/developerworks/library/l-linuxboot/index.html
[24]
P. S. Foundation, “Multiprocessing - process-based parallelism,” Python documentation. 2024. Available: https://docs.python.org/3/library/multiprocessing.html

Footnotes

  1. If we get really precise, the tasks in Figure 1 are actually CPU-level instructions, but for the sake of simplicity we will think of them as functions. For now…↩︎

  2. Let’s forget about GPUs for now, we will cover that in other blogs. Also, in my defense, we lived perfectly okay without them for half of the digital era. ;p↩︎

  3. That is why we have compilers and why a program needs to be compiled for every different target Instruction Set Architecture, like x_64, x_86, RISC, etc, since those have different instruction sets↩︎