Conway’s Game of Life has been an inspiration to computer scientists since it’s creation in 1970. Life is a simulation of cells arranged in a two-dimensional grid. Each cell can be in one of two states, alive or dead. I will leave the full explanation of Life to Wikipedia, and only restate here the rules regarding the interactions between cells:
Any live cell with fewer than two live neighbors dies, as if caused by under-population.
Any live cell with two or three live neighbors lives on to the next generation.
Any live cell with more than three live neighbors dies, as if by over-population.
Any dead cell with exactly three live neighbors becomes a live cell, as if by reproduction.
Note that in our implementation of Life, like many others, the environment wraps around to the other side at the edges, like in Pac-Man.
This text chronicles my journey in improving the performance of a Life kernel. To make this guide representative of many problems in performance optimization for scientific kernels, I have disallowed myself from pursuing algorithmic optimizations (that said, if you haven’t seen Hashlife, it is worth checking out). Many of the optimization techniques we see here should be applicable to a wide variety of codes, and will focus on optimizing the naive algorithm for a given architecture.
These techniques can make the code go faster, but they increase code complexity by several orders of magnitude and tend to need different tunings for different machines. If someone else has written an optimized version of code that does what you want to do, I would strongly recommend using that code before trying anything you see here. The general advice is to use optimized libraries whenever possible.
Like many logical simulations, life is fully deterministic. This means that we can determine if our simulation is correct by comparing our output to a reference implementation. The reference implementation we use will also provide a starting point for optimization. The reference implementation I use has been adapted from the RosettaCode C implementation. Rather than expound on the code for ages, I will let you read it yourself. Explanatory comments are included.
This reference implementation is easy to read and understand, but it is pretty slow. It has lots of conditional and arithmetic logic in the inner loop and it copies the entire universe at every step. The reference code is our starting point, and we will use it to check the correctness of our optimized versions.
Before we start optimizing, lets write our benchmarking and test code. Having an accurate benchmark that tests a common case for our code gives us the information we’ll need to make optimization decisions. Our benchmark code includes test code as well. Rather than paste the whole file, I only include the highlights here.
Note that in our benchmark, TIMEOUT is set to 0.1 seconds
Before we complicate the reference implementation with our optimizations, let’s
simplify it a little bit. Here is the new inner loop:
On a 8192x8192 grid for 256 iterations, these optimizations provide a 1.288x speedup over the reference implementation. Not much, but we are just getting started!
Typically, programs perform differently on different platforms and with different compilers and flags. For the record, I am using gcc version 4.9.3 with the compiler flags “-O3 -march=native”. Meet my processor! Here’s the output of the command lscpu on my machine.
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model name: Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
CPU MHz: 2600.000
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39
This processor has 20 CPUs across two NUMA domains, so we’ll need to be mindful of the way that we access memory. We’ll also need to write parallel code to take advantage of our multiple CPUs.
One last thing before we jump down this rabbit hole. We need to know the peak performance so that we know when to stop! Notice that there are 12 necessary integer operations in our inner loop (we count the comparisons to 2 and 3, and we don’t count the redundant add and subtract of the cell value itself). Assume that we can do one instruction per clock cycle. The processor runs at 2.6109 clock cycles per second, and this machine supports AVX2, so we can operate on 32 8-bit sized integers at once, and there are 20 cores. Therefore, on average we can advance one cell one iteration in 7.210-12 seconds. Therefore, the theoretical peak time to compute our 8192x8192 test problem over 256 iterations is 1.24*10-1 seconds. Don’t forget it or you’ll never stop optimizing!
Since our program is so simple, we can be pretty sure where it spends most of it’s time. Our inner loop includes complicated modular arithmetic on indices and a doubly-nested for loop. Let’s fix this! We can use a technique called “padding”. In short, instead of looking to the other side of the universe to wrap around in the inner loop, we will allocate an array with extra cells on all sides (“ghost cells”) and fill these cells with values from the other side of the array. That way, when the inner loop accesses beyond the the edges of the universe, it looks like the universe is wrapping around (and we don’t need to check to see if we are falling off the edge).
Each time we perform an iteration, the outer layer of valid ghost cells becomes invalid (we did not calculate anything on the outermost layer of the array, and this error propagates inward by one cell each iteration). To avoid copying with every iteration, we can pad with multiple ghost cell layers at once, and then run several iterations before each copy.
Note that this code assumes that width is a multiple of sizeof(unsigned).
On a 8192x8192 grid for 256 iterations, this code achieves our first significant speedup of 5.201x over the reference version!
Our calculation progress linearly across each row, accessing only the rows above and below it. Life needs to access each element nine times (eight times while counting among neighbors, and once to calculate the cell itself). As computation proceeds row by row, these accesses occur in groups of three (once group for each row), and if three rows of the matrix can fit in L1 cache, then the data is only loaded once from cache per iteration. However, if our computation were more data intensive, it might benefit from a technique called blocking.
Blocking is the practice of restructuring the computation so that data in registers cache is reused before it is evicted from these locations. This keeps the relevant data for a computation in the higher (faster) levels of the memory hierarchy. Register blocking involves rewriting the inner loop of your code to reuse values you have loaded from memory instead of loading them multiple times. Cache blocking involves restructuring the ordering of loops so that the same or nearby values are accessed soon after each other. Typically, we size the cache blocks so that the entire computation fills the L1 cache.
It doesn’t help much in this case (our kernel spends more time computing than loading from memory), here’s an example of how to restructure the padded inner loop for cache blocking:
Let’s cram more operations into the inner loop using vectorization. Intel’s SSE vector intrinsics are 128 bits wide, so we can cram 16 uint8_t types into a single vector register, and operate on them all at once. To keep the code nice, we require that the width of the input is a multiple of 16. A good resource for Intel vector intrinsics is the Intel Intrinsics Guide. The best way for me to show you what the inner loop looks like at this point would be to write it out:
This code achieves a speedup of 58.06x over the reference on the 8192x8192 grid for 256 iterations. Not bad.
AVX instructions are like double-wide SSE instructions. They can hold 256 bits (meaning 32 uint8_t), so we require that the width of the input is a multiple of 32. This code achieves a speedup of 72.42x (66.87% of the single CPU peak) over the reference on the 8192x8192 grid for 256 iterations, but because it is so similar to the SSE version we do not include it.
Since we perform 12 operations per byte and an AVX instruction can operate on 32 bytes simultaneously, we might benefit from some memory optimizations. Let’s put a streaming store in the inner loop. A streaming store writes to memory without first reading the value into cache (leaving more room for useful values in the cache). Since we know we do not need to read the value in the new array, this is the perfect operation for us. The last line of our inner loop moves from this:
And our code now achieves a speedup of 86.44x over the reference on the 8192x8192 grid for 256 iterations. We are now running at 79.82% of the single CPU peak, and I don’t think we’re going to get very many additional speedups. It’s time to go parallel!
We have a pretty decent single core utilization, so why don’t we move to multiple cores? OpenMP is a library that makes it easy to distribute loop iterations among threads. Here’s what it looks like:
Now we are getting the speedups we deserve! This gets a speedup of 375.6x over the reference version on the 8192x8192 grid for 256 iterations, and we are only using 10 of the available 40 processors on this CPU! Keep in mind that we are currently only running at 17.34% of our theoretical peak processing rate!
Our OpenMP code does not scale beyond a single NUMA domain (where communication is cheap). From the following graph, we see that our code gets no real performance gain beyond 10 threads.
There are a few reasons that this might happen. My guess is that either our processor cannot load memory fast enough to satisfy hungry cpus, or communication is too costly. We should perform a more thorough analysis of why our implementation isn’t scaling, but I like writing code more than I like running it so let’s make a brash decision!
If our code were communication-bound, we might benefit from explicitly managing our communication patterns. We can do this using MPI, to run multiple processes that do not share an address space, and binding processes to physical CPUs. This way, each NUMA domain has a separate MPI task. Our MPI tasks can run OpenMP on their own NUMA domains, and communicate to other explicitly.
Our MPI program makes several simplifications, assuming that the number of processors is a perfect square, that the height is divisible by the square root of the number of processors, and that the width is divisible by the word size times the square root of the number of processes.
Watch out! This MPI code is more that 10 times longer than our reference code. TL;DR: each process is part of a grid, and instead of copying the ghost cells like we did in previous versions, we will send them to our neighbors.
This implementation runs our 8192x8192 problem for 256 iterations in 0.58 seconds, a speedup of 462.30x over the reference code. We did gain some performance by explicitly managing our memory, but it appears that we have not achieved full utilization of our single processor.
You may be wondering why mpi.c is faster than the omp.c. After all, omp.c performs as well at 10 threads as it does at 40 threads, and mpi.c has a lot of extra copying into buffers, etc. I suspect that because the MPI code explicitly manages communication, passing only the ghost zones to its neighbors once every couple of iterations, mpi.c spends less time in communication. The OpenMP version of the code leaves the communication to the cache, which doesn’t know or care about our delicate ghost zones. By keeping two MPI processes on each NUMA node with 10 OpenMP threads each, we can explicitly manage communication between NUMA nodes.
Perhaps an interested reader can help explain to me why the application does not scale beyond 10 or so CPUs. My hypothesis is that we are memory-bound at some level of the cache, meaning that most of the processors are waiting to load life cells from memory instead of staying busy computing lice cells.
A benefit of rewriting using MPI is that we can now run our program on an arbitrary number of nodes networked together. Who cares about single processor performance when you can have 400 processors? Future work!
Here is a table showing the various times, speedups, and percentages of peak for each code:
Speedup (over reference.c)
% of peak
Hopefully you had as much fun reading this code as I did writing it. This writeup is heavy on code and light on analysis, but I hope you enjoyed following me on my journey through life.
Copyright (c) 2016, Los Alamos National Security, LLC
All rights reserved.
Copyright 2016. Los Alamos National Security, LLC. This software was produced under U.S. Government contract DE-AC52-06NA25396 for Los Alamos National Laboratory (LANL), which is operated by Los Alamos National Security, LLC for the U.S. Department of Energy. The U.S. Government has rights to use, reproduce, and distribute this software. NEITHER THE GOVERNMENT NOR LOS ALAMOS NATIONAL SECURITY, LLC MAKES ANY WARRANTY, EXPRESS OR IMPLIED, OR ASSUMES ANY LIABILITY FOR THE USE OF THIS SOFTWARE. If software is modified to produce derivative works, such modified software should be clearly marked, so as not to confuse it with the version available from LANL.
Additionally, redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of Los Alamos National Security, LLC, Los Alamos National Laboratory, LANL, the U.S. Government, nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY LOS ALAMOS NATIONAL SECURITY, LLC AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL LOS ALAMOS NATIONAL SECURITY, LLC OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.