Computers are not as parallel as we think

Matt Rocklin July 7, 2021

, ,

Often users report that their computation isn’t scaling as well as they expected.

This is confusing.  Our machines have more and more cores.  Why aren’t we able to parallelize across all of those cores?  What is getting in the way?

The Myth of the Parallel CPU

Computer scientists have lied to you.  We’ve promoted a perception of reality that your computer is many independent CPU cores, all happy to work independently to solve various problems.  Here is an evocative image from a VentureBeat article on multi-core CPUs

But in reality, there are many other aspects to computer architecture that affect performance and that may not be as parallelizable.  I’ll list a few below and then dive into one in particular, memory hierarchy.

  1. Disk access
  2. Internet access
  3. The Python GIL (although this is less relevant today than most people think)
  4. Locks and global state within your code
  5. Memory hierarchy


Some of these, like disk access and internet access, are well understood.  If you’re reading many files from disk or the internet then you can easily saturate that resource at around 1GB/s today.  Reading more files in parallel doesn’t necessarily help. It just splits up the bandwidth coming from your disk or network card (indeed, it may even hurt).  

Most people seem to get this situation pretty intuitively.  There is a resource apart from your CPU that is not as parallelizable.  Even though your CPUs are ready and willing, they’re bottlenecked by a non-parallelizable resource.  

Memory Hierarchy

This is also true of your memory hierarchy, to a certain extent.  Here is a more realistic image of a CPU.

Credit: Sandy Bridge Core i7 architecture

The cores in a CPU are genuinely parallelizable, but they often share central memory, at least beyond a certain limit. If your application is memory intensive, you may not see the kind of parallelism that you were hoping for.  This leads to unsatisfying performance plots like what we saw at the beginning of this blog post.

If your cores have to pass through many bytes of memory per flop, then they’re all probably waiting on memory rather than getting their jobs done.  Like disk or network access, the memory hierarchy can only do so much at once.  

Typically, poor memory performance is caused by one of two common behaviors:

  1. Random access throughout a large dataset
  2. Excessive copies of data

Avoid random access patterns

If your computation marches through a dataset and does lots of work on each byte then you are unlikely to run into memory performance issues.  In contrast, if your computation has to jump around your dataset a lot, visiting different pieces of data that aren’t stored in order, then you’ll probably not be able to get much speedup from parallel processing.

Common examples of random access patterns include …

  1. Walking through data in an order counter to the way that it is stored, like iterating over columns when your data is stored contiguously by rows
  2. Graph-like algorithms that hop around a dataset
  3. Advanced algorithms like FFTs

The good news though, is that if you’re using out-of-the-box algorithms from high-level libraries like NumPy or Pandas, then they have probably thought of this for you.

Excessive copies of data

The algorithm that produced the poor plot above was the following:

 return (
             -4 * field
             + np.roll(field, -1, axis=-1)
             + np.roll(field, 1, axis=-1)
             + np.roll(field, -1, axis=-2)
             + np.roll(field, 1, axis=-2)

This was an attempt to smooth over a 2d array by making copies of that array that were slightly offset from each other and then summing those copies together.  This is accurate and simple code, but making five copies of the same array stresses out the memory hierarchy.  It’s no wonder that doing this on 36 cores in parallel, each core triggering five copies of data, would stress out the memory hierarchy and result in less performance than we might otherwise expect, much in the same way that trying to download 180 files from the internet simultaneously might not happen quite as quickly as we expect.

The memory hierarchy is parallelizable but not quite to the same level that CPU cores are.  It’s somewhere in between CPU cores and disk/internet access.  


As we start pushing our hardware to higher and higher levels of performance, we find that we need to become increasingly familiar with all aspects of that hardware, and the convenient abstractions of independent parallel processors start to break down.

Fortunately, there are tools to help us as we develop this renewed understanding.  If you want to learn more about how to build memory-efficient algorithms to replace the np.roll algorithm above then I’ll point you to this Dask + Numba example on stencil computations.

Try Coiled Cloud for free

Thanks for reading. If you’re interested in trying out Coiled Cloud, which provides hosted Dask clusters, docker-less managed software, and one-click deployments, you can do so for free today when you click below.

Try Coiled Cloud

Frequently Asked Questions

The term Parallelism refers to techniques to make programs faster by performing several computations at the same time.

Traditionally, software has been written for serial computation: A problem is broken into a discrete series of instructions that are executed sequentially one after another on a single processor.

When we combine different processing units to work on a problem simultaneously, the result can be faster and more efficient than if one unit was doing all of it alone. Parallel computing is an approach that takes this idea into account by breaking down larger problems into smaller parts which are executed simultaneously using shared memory between multiple processors in order to achieve maximum efficiency.

Parallel computing has become an essential component of modern life. Without the speed and power it offers, our digital tasks would be significantly slower than they are now - think back to opening emails on your old phone: 30 seconds or more for every task! Because serial processing is so slow compared with parallel systems like those found in today’s computers, we rely heavily upon them when doing anything that requires intensive calculations such as scientific research and engineering design workflows where you need to access data from multiple sources at once.

Compared to serial computing, parallel computing is much better suited for modeling, simulating, and understanding complex, real-world phenomena. For instance, it is useful for data analysis in fields such as weather forecasting and financial markets, where there is an immense amount of data but limited processing power on individual machines.

Parallel processing is the technical term used for a method in computing in which separate parts of an overall complex task are broken up and run simultaneously on multiple CPU cores, thereby reducing the amount of time for processing. The broader term “parallel computing” refers to this technology as a whole.

A computer scientist will often divide a complex task into multiple smaller tasks with the help of software and assign each part to one processor, then let that individual operate as though they were doing everything on their own. Processors use communication tools so these tasks can continue being executed correctly without getting “out-of-process” due to changes made by other tasks. The final result is computed by reassembling the results of all the small tasks. For example, if we have 10 processors, we can calculate the sum of 100 rows of data by assigning 10 rows to each processor, and then adding all the individual results from each processor. A computer without multiple processors can also be used for parallel processing if it is networked together with other computers that have been configured as a cluster.

In terms of hardware, parallel computer architecture refers to the computer system design that enables parallel processing -- a form of computation in which many calculations can be carried out simultaneously.

Distributed computing refers to the process of using a network of computers to work together on a single task. Distributed systems are used to solve problems that require more resources than a single computer has.

Distributed programming is the process of writing computer programs that use multiple computers to function.

Cloud computing is a general term that refers to the delivery of scalable services, such as computers, databases, data storage, networking, servers, and software, over the Internet on an as-needed, pay-as-you-go basis.

The speed of a parallel algorithm can be increased by optimizing the size of each chunk of task, storing data effectively, avoiding operations that aren’t conducive to parallel execution, etc.

The speedup of an algorithm is the ratio between its execution time on a sequential computer and some other reference. For example, if you have 100 milliseconds execution time with one core on your laptop, and 10 milliseconds with 4 cores, then the “speedup” using 4 cores would be 10.

Data dependencies occur when the input of a task (say A) is the output of another task (say B), which means task A can only be executed after task B.

Ready to get started?

Create your first cluster in minutes.