Computers are not as parallel as we think
• July 7, 2021
Often users report that their computation isn’t scaling as well as they expected.
This is confusing. Our machines have more and more cores. Why aren’t we able to parallelize across all of those cores? What is getting in the way?
The Myth of the Parallel CPU
Computer scientists have lied to you. We’ve promoted a perception of reality that your computer is many independent CPU cores, all happy to work independently to solve various problems. Here is an evocative image from a VentureBeat article on multi-core CPUs
But in reality, there are many other aspects to computer architecture that affect performance and that may not be as parallelizable. I’ll list a few below and then dive into one in particular, memory hierarchy.
- Disk access
- Internet access
- The Python GIL (although this is less relevant today than most people think)
- Locks and global state within your code
- Memory hierarchy
Some of these, like disk access and internet access, are well understood. If you’re reading many files from disk or the internet then you can easily saturate that resource at around 1GB/s today. Reading more files in parallel doesn’t necessarily help. It just splits up the bandwidth coming from your disk or network card (indeed, it may even hurt).
Most people seem to get this situation pretty intuitively. There is a resource apart from your CPU that is not as parallelizable. Even though your CPUs are ready and willing, they’re bottlenecked by a non-parallelizable resource.
This is also true of your memory hierarchy, to a certain extent. Here is a more realistic image of a CPU.
The cores in a CPU are genuinely parallelizable, but they often share central memory, at least beyond a certain limit. If your application is memory intensive, you may not see the kind of parallelism that you were hoping for. This leads to unsatisfying performance plots like what we saw at the beginning of this blog post.
If your cores have to pass through many bytes of memory per flop, then they’re all probably waiting on memory rather than getting their jobs done. Like disk or network access, the memory hierarchy can only do so much at once.
Typically, poor memory performance is caused by one of two common behaviors:
- Random access throughout a large dataset
- Excessive copies of data
Avoid random access patterns
If your computation marches through a dataset and does lots of work on each byte then you are unlikely to run into memory performance issues. In contrast, if your computation has to jump around your dataset a lot, visiting different pieces of data that aren’t stored in order, then you’ll probably not be able to get much speedup from parallel processing.
Common examples of random access patterns include …
- Walking through data in an order counter to the way that it is stored, like iterating over columns when your data is stored contiguously by rows
- Graph-like algorithms that hop around a dataset
- Advanced algorithms like FFTs
The good news though, is that if you’re using out-of-the-box algorithms from high-level libraries like NumPy or Pandas, then they have probably thought of this for you.
Excessive copies of data
The algorithm that produced the poor plot above was the following:
return ( -4 * field + np.roll(field, -1, axis=-1) + np.roll(field, 1, axis=-1) + np.roll(field, -1, axis=-2) + np.roll(field, 1, axis=-2) )
This was an attempt to smooth over a 2d array by making copies of that array that were slightly offset from each other and then summing those copies together. This is accurate and simple code, but making five copies of the same array stresses out the memory hierarchy. It’s no wonder that doing this on 36 cores in parallel, each core triggering five copies of data, would stress out the memory hierarchy and result in less performance than we might otherwise expect, much in the same way that trying to download 180 files from the internet simultaneously might not happen quite as quickly as we expect.
The memory hierarchy is parallelizable but not quite to the same level that CPU cores are. It’s somewhere in between CPU cores and disk/internet access.
As we start pushing our hardware to higher and higher levels of performance, we find that we need to become increasingly familiar with all aspects of that hardware, and the convenient abstractions of independent parallel processors start to break down.
Fortunately, there are tools to help us as we develop this renewed understanding. If you want to learn more about how to build memory-efficient algorithms to replace the np.roll algorithm above then I’ll point you to this Dask + Numba example on stencil computations.
Try Coiled Cloud for free
Thanks for reading. If you’re interested in trying out Coiled Cloud, which provides hosted Dask clusters, docker-less managed software, and one-click deployments, you can do so for free today when you click below.