Jekyll2023-12-23T21:21:02+00:00https://willowahrens.io/feed.xmlWillow AhrensThe personal website of computer scientist and glassblower Willow AhrensWillow AhrensLife in the Fast Lane2016-07-06T00:00:00+00:002016-07-06T00:00:00+00:00https://willowahrens.io/life-in-the-fast-lane<p>Get the code <a href="https://github.com/peterahrens/LifeInTheFastLane">here</a>!</p>
<p>Conway’s Game of Life has been an inspiration to computer scientists since it’s creation in 1970. Life is a simulation of cells arranged in a two-dimensional grid. Each cell can be in one of two states, alive or dead. I will leave the <a href="https://en.wikipedia.org/wiki/Conway's_Game_of_Life">full explanation</a> of Life to Wikipedia, and only restate here the rules regarding the interactions between cells:</p>
<ul>
<li>Any live cell with fewer than two live neighbors dies, as if caused by under-population.</li>
<li>Any live cell with two or three live neighbors lives on to the next generation.</li>
<li>Any live cell with more than three live neighbors dies, as if by over-population.</li>
<li>Any dead cell with exactly three live neighbors becomes a live cell, as if by reproduction.</li>
</ul>
<p>Note that in our implementation of Life, like many others, the environment wraps around to the other side at the edges, like in Pac-Man.</p>
<p>This text chronicles my journey in improving the performance of a Life kernel. To make this guide representative of many problems in performance optimization for scientific kernels, I have disallowed myself from pursuing algorithmic optimizations (that said, if you haven’t seen <a href="https://en.wikipedia.org/wiki/Hashlife">Hashlife</a>, it is worth checking out). Many of the optimization techniques we see here should be applicable to a wide variety of codes, and will focus on optimizing the naive algorithm for a given architecture.</p>
<p>These techniques can make the code go faster, but they increase code complexity by several orders of magnitude and tend to need different tunings for different machines. If someone else has written an optimized version of code that does what you want to do, I would strongly recommend using that code before trying anything you see here. The general advice is to use optimized libraries whenever possible.</p>
<h2 id="referencec">reference.c</h2>
<p>Like many logical simulations, life is fully deterministic. This means that we can determine if our simulation is correct by comparing our output to a reference implementation. The reference implementation we use will also provide a starting point for optimization. The reference implementation I use has been adapted from the <a href="http://rosettacode.org/wiki/Conway's_Game_of_Life#C">RosettaCode</a> C implementation. Rather than expound on the code for ages, I will let you read it yourself. Explanatory comments are included.</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="cp">#include <stdlib.h>
</span>
<span class="kt">unsigned</span> <span class="o">*</span><span class="nf">reference_life</span> <span class="p">(</span><span class="k">const</span> <span class="kt">unsigned</span> <span class="n">height</span><span class="p">,</span>
<span class="k">const</span> <span class="kt">unsigned</span> <span class="n">width</span><span class="p">,</span>
<span class="k">const</span> <span class="kt">unsigned</span> <span class="o">*</span><span class="n">initial</span><span class="p">,</span>
<span class="k">const</span> <span class="kt">unsigned</span> <span class="n">iters</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">//"universe" is the current game of life grid. We will store "alive" as a 1</span>
<span class="c1">//and "dead" as a 0.</span>
<span class="kt">unsigned</span> <span class="o">*</span><span class="n">universe</span> <span class="o">=</span> <span class="p">(</span><span class="kt">unsigned</span><span class="o">*</span><span class="p">)</span><span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">unsigned</span><span class="p">)</span> <span class="o">*</span> <span class="n">height</span> <span class="o">*</span> <span class="n">width</span><span class="p">);</span>
<span class="c1">//"new" is a scratch array to store the next iteration as it is calculated.</span>
<span class="kt">unsigned</span> <span class="o">*</span><span class="n">new</span> <span class="o">=</span> <span class="p">(</span><span class="kt">unsigned</span><span class="o">*</span><span class="p">)</span><span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">unsigned</span><span class="p">)</span> <span class="o">*</span> <span class="n">height</span> <span class="o">*</span> <span class="n">width</span><span class="p">);</span>
<span class="c1">//We must load the initial configuration into the universe memory.</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o"><</span> <span class="n">height</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">width</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">universe</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">width</span> <span class="o">+</span> <span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="n">initial</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">width</span> <span class="o">+</span> <span class="n">x</span><span class="p">];</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c1">//The main loop: a likely target for later optimization.</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">iters</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o"><</span> <span class="n">height</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">width</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">//Here we loop over the neighbors and count how many are alive.</span>
<span class="kt">unsigned</span> <span class="n">n</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">yy</span> <span class="o">=</span> <span class="n">y</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span> <span class="n">yy</span> <span class="o"><=</span> <span class="n">y</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span> <span class="n">yy</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">xx</span> <span class="o">=</span> <span class="n">x</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span> <span class="n">xx</span> <span class="o"><=</span> <span class="n">x</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span> <span class="n">xx</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">//This is a redundant way to perform this operation. Since "alive"</span>
<span class="c1">//is represented as 1 and "dead" is represented as 0, we can just</span>
<span class="c1">//add universe[...] to n without the conditional branch.</span>
<span class="k">if</span> <span class="p">(</span><span class="n">universe</span><span class="p">[((</span><span class="n">yy</span> <span class="o">+</span> <span class="n">height</span><span class="p">)</span> <span class="o">%</span> <span class="n">height</span><span class="p">)</span> <span class="o">*</span> <span class="n">width</span>
<span class="o">+</span> <span class="p">((</span><span class="n">xx</span> <span class="o">+</span> <span class="n">width</span><span class="p">)</span> <span class="o">%</span> <span class="n">width</span><span class="p">)])</span> <span class="p">{</span>
<span class="n">n</span><span class="o">++</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c1">//This statement is to avoid counting a cell as a neighbor of itself.</span>
<span class="k">if</span> <span class="p">(</span><span class="n">universe</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">width</span> <span class="o">+</span> <span class="n">x</span><span class="p">])</span> <span class="p">{</span>
<span class="n">n</span><span class="o">--</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">//This fairly tight logic determines the status of the cell in the next</span>
<span class="c1">//iteration. We have to store this in a new array to avoid modifying</span>
<span class="c1">//the original array as we calculate the new one.</span>
<span class="n">new</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">width</span> <span class="o">+</span> <span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">n</span> <span class="o">==</span> <span class="mi">3</span> <span class="o">||</span> <span class="p">(</span><span class="n">n</span> <span class="o">==</span> <span class="mi">2</span> <span class="o">&&</span> <span class="n">universe</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">width</span> <span class="o">+</span> <span class="n">x</span><span class="p">]));</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c1">//These loops copy the new state array into the current state array,</span>
<span class="c1">//completing an iteration.</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o"><</span> <span class="n">height</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">width</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">universe</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">width</span> <span class="o">+</span> <span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="n">new</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">width</span> <span class="o">+</span> <span class="n">x</span><span class="p">];</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">free</span><span class="p">(</span><span class="n">new</span><span class="p">);</span>
<span class="k">return</span> <span class="n">universe</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>
<p>This reference implementation is easy to read and understand, but it is pretty slow. It has lots of conditional and arithmetic logic in the inner loop and it copies the entire universe at every step. The reference code is our starting point, and we will use it to check the correctness of our optimized versions.</p>
<h2 id="benchc">bench.c</h2>
<p>Before we start optimizing, lets write our benchmarking and test code. Having an accurate benchmark that tests a common case for our code gives us the information we’ll need to make optimization decisions. Our benchmark code includes test code as well. Rather than paste the whole file, I only include the highlights here.</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="c1">//Return time time of day as a double-precision floating point value.</span>
<span class="kt">double</span> <span class="nf">wall_time</span> <span class="p">(</span><span class="kt">void</span><span class="p">)</span> <span class="p">{</span>
<span class="k">struct</span> <span class="n">timeval</span> <span class="n">t</span><span class="p">;</span>
<span class="c1">//It is important to use an timer with good resolution. Many common functions</span>
<span class="c1">//that return the time are not precise enough for timing code. Since timers</span>
<span class="c1">//are typically system-specific, research timers for your system. I have</span>
<span class="c1">//found that omp_get_wtime() is usually quite good and is available</span>
<span class="c1">//everywhere there is OpenMP.</span>
<span class="n">gettimeofday</span><span class="p">(</span><span class="o">&</span><span class="n">t</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">1</span><span class="p">.</span><span class="mi">0</span><span class="o">*</span><span class="n">t</span><span class="p">.</span><span class="n">tv_sec</span> <span class="o">+</span> <span class="mf">1.0e-6</span><span class="o">*</span><span class="n">t</span><span class="p">.</span><span class="n">tv_usec</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>
<p>Note that in our benchmark, <code class="language-plaintext highlighter-rouge">TIMEOUT</code> is set to 0.1 seconds</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="kt">double</span> <span class="n">test_time</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="kt">unsigned</span> <span class="o">*</span><span class="n">test</span><span class="p">;</span>
<span class="c1">//We must run the benchmarking application for a sufficient length of time to</span>
<span class="c1">//avoid variations in processing speed. We do this by running an increasing</span>
<span class="c1">//number of trials until it takes at least TIMEOUT seconds.</span>
<span class="k">for</span> <span class="p">(</span><span class="n">trials</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="n">test_time</span> <span class="o"><</span> <span class="n">TIMEOUT</span><span class="p">;</span> <span class="n">trials</span> <span class="o">*=</span> <span class="mi">2</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">//Unless we want to measure the cache warm-up time, it is usually a good</span>
<span class="c1">//idea to run the problem for one iteration first to load the problem</span>
<span class="c1">//into cache.</span>
<span class="n">test</span> <span class="o">=</span> <span class="n">life</span><span class="p">(</span><span class="n">height</span><span class="p">,</span> <span class="n">width</span><span class="p">,</span> <span class="n">initial</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="n">free</span><span class="p">(</span><span class="n">test</span><span class="p">);</span>
<span class="c1">//Benchmark "trials" runs of life.</span>
<span class="n">test_time</span> <span class="o">=</span> <span class="o">-</span><span class="n">wall_time</span><span class="p">();</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">trials</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">){</span>
<span class="n">test</span> <span class="o">=</span> <span class="n">life</span><span class="p">(</span><span class="n">height</span><span class="p">,</span> <span class="n">width</span><span class="p">,</span> <span class="n">initial</span><span class="p">,</span> <span class="n">iters</span><span class="p">);</span>
<span class="n">free</span><span class="p">(</span><span class="n">test</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">test</span> <span class="o">=</span> <span class="n">life</span><span class="p">(</span><span class="n">height</span><span class="p">,</span> <span class="n">width</span><span class="p">,</span> <span class="n">initial</span><span class="p">,</span> <span class="n">iters</span><span class="p">);</span>
<span class="n">test_time</span> <span class="o">+=</span> <span class="n">wall_time</span><span class="p">();</span>
<span class="p">}</span>
<span class="n">trials</span> <span class="o">/=</span> <span class="mi">2</span><span class="p">;</span>
<span class="n">test_time</span> <span class="o">/=</span> <span class="n">trials</span><span class="p">;</span></code></pre></figure>
<h2 id="simplec">simple.c</h2>
<p>Before we complicate the reference implementation with our optimizations, let’s
simplify it a little bit. Here is the new inner loop:</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">iters</span><span class="p">;</span> <span class="n">i</span> <span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o"><</span> <span class="n">height</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">width</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">unsigned</span> <span class="n">n</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">yy</span> <span class="o">=</span> <span class="n">y</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span> <span class="n">yy</span> <span class="o"><=</span> <span class="n">y</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span> <span class="n">yy</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">xx</span> <span class="o">=</span> <span class="n">x</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span> <span class="n">xx</span> <span class="o"><=</span> <span class="n">x</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span> <span class="n">xx</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">//Directly add "universe" values to "n"</span>
<span class="n">n</span> <span class="o">+=</span> <span class="n">universe</span><span class="p">[((</span><span class="n">yy</span> <span class="o">+</span> <span class="n">height</span><span class="p">)</span> <span class="o">%</span> <span class="n">height</span><span class="p">)</span> <span class="o">*</span> <span class="n">width</span>
<span class="o">+</span> <span class="p">((</span><span class="n">xx</span> <span class="o">+</span> <span class="n">width</span><span class="p">)</span> <span class="o">%</span> <span class="n">width</span><span class="p">)];</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">n</span> <span class="o">-=</span> <span class="n">universe</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">width</span> <span class="o">+</span> <span class="n">x</span><span class="p">];</span>
<span class="n">new</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">width</span> <span class="o">+</span> <span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">n</span> <span class="o">==</span> <span class="mi">3</span> <span class="o">||</span> <span class="p">(</span><span class="n">n</span> <span class="o">==</span> <span class="mi">2</span> <span class="o">&&</span> <span class="n">universe</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">width</span> <span class="o">+</span> <span class="n">x</span><span class="p">]));</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c1">//Instead of copying "new" into universe every time, just swap the pointers</span>
<span class="kt">unsigned</span> <span class="o">*</span><span class="n">tmp</span> <span class="o">=</span> <span class="n">universe</span><span class="p">;</span>
<span class="n">universe</span> <span class="o">=</span> <span class="n">new</span><span class="p">;</span>
<span class="n">new</span> <span class="o">=</span> <span class="n">tmp</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>
<p>On a 8192x8192 grid for 256 iterations, these optimizations provide a 1.288x speedup over the reference implementation. Not much, but we are just getting started!</p>
<h2 id="environment">Environment</h2>
<p>Typically, programs perform differently on different platforms and with different compilers and flags. For the record, I am using gcc version 4.9.3 with the compiler flags “-O3 -march=native”. Meet my processor! Here’s the output of the command <code class="language-plaintext highlighter-rouge">lscpu</code> on my machine.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
Stepping: 2
CPU MHz: 2600.000
BogoMIPS: 5209.96
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39
</code></pre></div></div>
<p>This processor has 20 CPUs across two <a href="https://en.wikipedia.org/wiki/Non-uniform_memory_access">NUMA</a> domains, so we’ll need to be mindful of the way that we access memory. We’ll also need to write parallel code to take advantage of our multiple CPUs.</p>
<p>One last thing before we jump down this rabbit hole. We need to know the peak performance so that we know when to stop! Notice that there are 12 necessary integer operations in our inner loop (we count the comparisons to 2 and 3, and we don’t count the redundant add and subtract of the cell value itself). Assume that we can do one instruction per clock cycle. The processor runs at 2.6<em>10<sup>9</sup> clock cycles per second, and this machine supports AVX2, so we can operate on 32 8-bit sized integers at once, and there are 20 cores. Therefore, on average we can advance one cell one iteration in 7.2</em>10<sup>-12</sup> seconds. Therefore, the theoretical peak time to compute our 8192x8192 test problem over 256 iterations is 1.24*10<sup>-1</sup> seconds. Don’t forget it or you’ll never stop optimizing!</p>
<h2 id="paddingc">padding.c</h2>
<p>Since our program is so simple, we can be pretty sure where it spends most of it’s time. Our inner loop includes complicated modular arithmetic on indices and a doubly-nested for loop. Let’s fix this! We can use a technique called “padding”. In short, instead of looking to the other side of the universe to wrap around in the inner loop, we will allocate an array with extra cells on all sides (“ghost cells”) and fill these cells with values from the other side of the array. That way, when the inner loop accesses beyond the the edges of the universe, it looks like the universe is wrapping around (and we don’t need to check to see if we are falling off the edge).</p>
<p>Each time we perform an iteration, the outer layer of valid ghost cells becomes invalid (we did not calculate anything on the outermost layer of the array, and this error propagates inward by one cell each iteration). To avoid copying with every iteration, we can pad with multiple ghost cell layers at once, and then run several iterations before each copy.</p>
<p>Note that this code assumes that <code class="language-plaintext highlighter-rouge">width</code> is a multiple of <code class="language-plaintext highlighter-rouge">sizeof(unsigned)</code>.</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="cp">#include <stdlib.h>
#include <stdint.h>
</span>
<span class="cp">#define WORD sizeof(unsigned)
</span><span class="c1">//OUT_GHOST is the width of the valid ghost cells after copying IN_GHOST ghost</span>
<span class="c1">//cell values to the border and then executing one iteration. The kernel will</span>
<span class="c1">//copy IN_GHOST ghost cells, then run IN_GHOST iterations before copying ghost</span>
<span class="c1">//cells again. OUT_GHOST can be any value greater than or equal to 0.</span>
<span class="cp">#define OUT_GHOST 0
#define IN_GHOST (OUT_GHOST + 1)
#define X_IN_GHOST ((OUT_GHOST/WORD + 1) * WORD)
#define Y_IN_GHOST IN_GHOST
#define X_IN_GHOST_WORDS (X_IN_GHOST/WORD)
</span>
<span class="c1">//There are platform specific aligned malloc implementations, but it is</span>
<span class="c1">//instructive to see one written out explicitly. Allocates memory, then rounds</span>
<span class="c1">//it to a multiple of WORD. Stores a pointer to the original memory to free it.</span>
<span class="kt">void</span> <span class="o">*</span><span class="nf">aligned_malloc</span><span class="p">(</span><span class="kt">int</span> <span class="n">size</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">char</span> <span class="o">*</span><span class="n">mem</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">void</span><span class="o">*</span><span class="p">)</span> <span class="o">+</span> <span class="n">size</span> <span class="o">+</span> <span class="n">WORD</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
<span class="kt">void</span> <span class="o">**</span><span class="n">ptr</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span><span class="o">**</span><span class="p">)(((</span><span class="kt">uintptr_t</span><span class="p">)(</span><span class="n">mem</span> <span class="o">+</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">void</span><span class="o">*</span><span class="p">)</span> <span class="o">+</span> <span class="n">WORD</span> <span class="o">-</span> <span class="mi">1</span><span class="p">))</span> <span class="o">&</span> <span class="o">~</span><span class="p">((</span><span class="kt">uintptr_t</span><span class="p">)(</span><span class="n">WORD</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)));</span>
<span class="n">ptr</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">mem</span><span class="p">;</span>
<span class="k">return</span> <span class="n">ptr</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="nf">aligned_free</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">ptr</span><span class="p">)</span> <span class="p">{</span>
<span class="n">free</span><span class="p">(((</span><span class="kt">void</span><span class="o">**</span><span class="p">)</span><span class="n">ptr</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">]);</span>
<span class="p">}</span>
<span class="kt">unsigned</span> <span class="o">*</span><span class="nf">life</span> <span class="p">(</span><span class="k">const</span> <span class="kt">unsigned</span> <span class="n">height</span><span class="p">,</span>
<span class="k">const</span> <span class="kt">unsigned</span> <span class="n">width</span><span class="p">,</span>
<span class="k">const</span> <span class="kt">unsigned</span> <span class="o">*</span> <span class="k">const</span> <span class="n">initial</span><span class="p">,</span>
<span class="k">const</span> <span class="kt">unsigned</span> <span class="n">iters</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">//Padding makes things ridiculously complicated. These constant values</span>
<span class="c1">//make life a little easier.</span>
<span class="k">const</span> <span class="kt">unsigned</span> <span class="n">padded_height</span> <span class="o">=</span> <span class="n">height</span> <span class="o">+</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">Y_IN_GHOST</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">unsigned</span> <span class="n">padded_width</span> <span class="o">=</span> <span class="n">width</span> <span class="o">+</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">X_IN_GHOST</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">unsigned</span> <span class="n">width_words</span> <span class="o">=</span> <span class="n">width</span><span class="o">/</span><span class="n">WORD</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">unsigned</span> <span class="n">padded_width_words</span> <span class="o">=</span> <span class="n">padded_width</span><span class="o">/</span><span class="n">WORD</span><span class="p">;</span>
<span class="c1">//Oh! The careful reader will notice that I am allocating an array of</span>
<span class="c1">//byte-size ints! In addition to preparing us for vectorization later, this</span>
<span class="c1">//also reduces memory traffic.</span>
<span class="c1">//Also, this memory is aligned. Aligned memory access is typically faster</span>
<span class="c1">//that unaligned. To keep the memory aligned on each row, we have to pad</span>
<span class="c1">//to a multiple of the word size. We also assume the input matrix has a width</span>
<span class="c1">//that is a multiple of the word size.</span>
<span class="kt">uint8_t</span> <span class="o">*</span><span class="n">universe</span> <span class="o">=</span> <span class="p">(</span><span class="kt">uint8_t</span><span class="o">*</span><span class="p">)</span><span class="n">aligned_malloc</span><span class="p">(</span><span class="n">padded_height</span> <span class="o">*</span> <span class="n">padded_width</span><span class="p">);</span>
<span class="kt">uint8_t</span> <span class="o">*</span><span class="n">new</span> <span class="o">=</span> <span class="p">(</span><span class="kt">uint8_t</span><span class="o">*</span><span class="p">)</span><span class="n">aligned_malloc</span><span class="p">(</span><span class="n">padded_height</span> <span class="o">*</span> <span class="n">padded_width</span><span class="p">);</span>
<span class="c1">//Pack unsigned into the padded working array of uint8_t.</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">y</span> <span class="o">=</span> <span class="n">Y_IN_GHOST</span><span class="p">;</span> <span class="n">y</span> <span class="o"><</span> <span class="n">height</span> <span class="o">+</span> <span class="n">Y_IN_GHOST</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="n">X_IN_GHOST</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">width</span> <span class="o">+</span> <span class="n">X_IN_GHOST</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">universe</span><span class="p">[(</span><span class="n">y</span> <span class="o">*</span> <span class="n">padded_width</span><span class="p">)</span> <span class="o">+</span> <span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="n">initial</span><span class="p">[(</span><span class="n">y</span> <span class="o">-</span> <span class="n">Y_IN_GHOST</span><span class="p">)</span> <span class="o">*</span> <span class="n">width</span> <span class="o">+</span> <span class="n">x</span> <span class="o">-</span> <span class="n">X_IN_GHOST</span><span class="p">];</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">iters</span><span class="p">;</span> <span class="n">i</span> <span class="o">+=</span> <span class="n">IN_GHOST</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">//Copy the ghost cells once every IN_GHOST iterations. I have not only</span>
<span class="c1">//simplified much of the logic (no more mod operations!), I have also</span>
<span class="c1">//reduced the number of instructions necessary to copy by casting the</span>
<span class="c1">//uint8_t array to unsigned and working with these larger values of a size</span>
<span class="c1">//the system is used to working with.</span>
<span class="kt">unsigned</span> <span class="o">*</span><span class="n">universe_words</span> <span class="o">=</span> <span class="p">(</span><span class="kt">unsigned</span><span class="o">*</span><span class="p">)</span><span class="n">universe</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o"><</span> <span class="n">padded_height</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">y</span> <span class="o"><</span> <span class="n">Y_IN_GHOST</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">//Top left</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">universe_words</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="n">universe_words</span><span class="p">[(</span><span class="n">y</span> <span class="o">+</span> <span class="n">height</span><span class="p">)</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span> <span class="o">+</span> <span class="n">width_words</span><span class="p">];</span>
<span class="p">}</span>
<span class="c1">//Top middle</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">width_words</span> <span class="o">+</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">universe_words</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="n">universe_words</span><span class="p">[(</span><span class="n">y</span> <span class="o">+</span> <span class="n">height</span><span class="p">)</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span><span class="p">];</span>
<span class="p">}</span>
<span class="c1">//Top right</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="n">width_words</span> <span class="o">+</span> <span class="n">X_IN_GHOST_WORDS</span> <span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">padded_width_words</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">universe_words</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="n">universe_words</span><span class="p">[(</span><span class="n">y</span> <span class="o">+</span> <span class="n">height</span><span class="p">)</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span> <span class="o">-</span> <span class="n">width_words</span><span class="p">];</span>
<span class="p">}</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">y</span> <span class="o"><</span> <span class="n">height</span> <span class="o">+</span> <span class="n">Y_IN_GHOST</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">//Middle left</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">universe_words</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="n">universe_words</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span> <span class="o">+</span> <span class="n">width_words</span><span class="p">];</span>
<span class="p">}</span>
<span class="c1">//Middle right</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="n">width_words</span> <span class="o">+</span> <span class="n">X_IN_GHOST_WORDS</span> <span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">padded_width_words</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">universe_words</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="n">universe_words</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span> <span class="o">-</span> <span class="n">width_words</span><span class="p">];</span>
<span class="p">}</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="c1">//Bottom left</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">universe_words</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="n">universe_words</span><span class="p">[(</span><span class="n">y</span> <span class="o">-</span> <span class="n">height</span><span class="p">)</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span> <span class="o">+</span> <span class="n">width_words</span><span class="p">];</span>
<span class="p">}</span>
<span class="c1">//Bottom middle</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">width_words</span> <span class="o">+</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">universe_words</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="n">universe_words</span><span class="p">[(</span><span class="n">y</span> <span class="o">-</span> <span class="n">height</span><span class="p">)</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span><span class="p">];</span>
<span class="p">}</span>
<span class="c1">//Bottom right</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="n">width_words</span> <span class="o">+</span> <span class="n">X_IN_GHOST_WORDS</span> <span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">padded_width_words</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">universe_words</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="n">universe_words</span><span class="p">[(</span><span class="n">y</span> <span class="o">-</span> <span class="n">height</span><span class="p">)</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span> <span class="o">-</span> <span class="n">width_words</span><span class="p">];</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c1">//The valid ghost zone shrinks by one with each iteration.</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o"><</span> <span class="n">IN_GHOST</span> <span class="o">&&</span> <span class="n">i</span> <span class="o">+</span> <span class="n">j</span> <span class="o"><</span> <span class="n">iters</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">y</span> <span class="o">=</span> <span class="p">(</span><span class="n">Y_IN_GHOST</span> <span class="o">-</span> <span class="n">OUT_GHOST</span><span class="p">);</span> <span class="n">y</span> <span class="o"><</span> <span class="n">height</span> <span class="o">+</span> <span class="n">Y_IN_GHOST</span> <span class="o">+</span> <span class="n">OUT_GHOST</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="p">(</span><span class="n">X_IN_GHOST</span> <span class="o">-</span> <span class="n">OUT_GHOST</span><span class="p">);</span> <span class="n">x</span> <span class="o"><</span> <span class="n">width</span> <span class="o">+</span> <span class="n">X_IN_GHOST</span> <span class="o">+</span> <span class="n">OUT_GHOST</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">//The inner loop gets much simpler when you pad the array, doesn't it?</span>
<span class="c1">//This is the main reason people pad their arrays before computation.</span>
<span class="kt">unsigned</span> <span class="n">n</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="kt">uint8_t</span> <span class="o">*</span><span class="n">u</span> <span class="o">=</span> <span class="n">universe</span> <span class="o">+</span> <span class="p">(</span><span class="n">y</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="n">padded_width</span> <span class="o">+</span> <span class="n">x</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
<span class="c1">//Note that constant offsets into memory are faster.</span>
<span class="n">n</span> <span class="o">+=</span> <span class="n">u</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
<span class="n">n</span> <span class="o">+=</span> <span class="n">u</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
<span class="n">n</span> <span class="o">+=</span> <span class="n">u</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>
<span class="n">u</span> <span class="o">+=</span> <span class="n">padded_width</span><span class="p">;</span>
<span class="n">n</span> <span class="o">+=</span> <span class="n">u</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
<span class="kt">unsigned</span> <span class="n">alive</span> <span class="o">=</span> <span class="n">u</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
<span class="n">n</span> <span class="o">+=</span> <span class="n">u</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>
<span class="n">u</span> <span class="o">+=</span> <span class="n">padded_width</span><span class="p">;</span>
<span class="n">n</span> <span class="o">+=</span> <span class="n">u</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
<span class="n">n</span> <span class="o">+=</span> <span class="n">u</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
<span class="n">n</span> <span class="o">+=</span> <span class="n">u</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>
<span class="n">new</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">padded_width</span> <span class="o">+</span> <span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">n</span> <span class="o">==</span> <span class="mi">3</span> <span class="o">||</span> <span class="p">(</span><span class="n">n</span> <span class="o">==</span> <span class="mi">2</span> <span class="o">&&</span> <span class="n">alive</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">uint8_t</span> <span class="o">*</span><span class="n">tmp</span> <span class="o">=</span> <span class="n">universe</span><span class="p">;</span>
<span class="n">universe</span> <span class="o">=</span> <span class="n">new</span><span class="p">;</span>
<span class="n">new</span> <span class="o">=</span> <span class="n">tmp</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c1">//Unpack uint8_t into output array of unsigned.</span>
<span class="kt">unsigned</span> <span class="o">*</span><span class="n">out</span> <span class="o">=</span> <span class="p">(</span><span class="kt">unsigned</span><span class="o">*</span><span class="p">)</span><span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">unsigned</span><span class="p">)</span> <span class="o">*</span> <span class="n">height</span> <span class="o">*</span> <span class="n">width</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">y</span> <span class="o">=</span> <span class="n">Y_IN_GHOST</span><span class="p">;</span> <span class="n">y</span> <span class="o"><</span> <span class="n">height</span> <span class="o">+</span> <span class="n">Y_IN_GHOST</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="n">X_IN_GHOST</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">width</span> <span class="o">+</span> <span class="n">X_IN_GHOST</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">out</span><span class="p">[(</span><span class="n">y</span> <span class="o">-</span> <span class="n">Y_IN_GHOST</span><span class="p">)</span> <span class="o">*</span> <span class="n">width</span> <span class="o">+</span> <span class="n">x</span> <span class="o">-</span> <span class="n">X_IN_GHOST</span><span class="p">]</span> <span class="o">=</span> <span class="n">universe</span><span class="p">[(</span><span class="n">y</span> <span class="o">*</span> <span class="n">padded_width</span><span class="p">)</span> <span class="o">+</span> <span class="n">x</span><span class="p">];</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">aligned_free</span><span class="p">(</span><span class="n">new</span><span class="p">);</span>
<span class="n">aligned_free</span><span class="p">(</span><span class="n">universe</span><span class="p">);</span>
<span class="k">return</span> <span class="n">out</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>
<p>On a 8192x8192 grid for 256 iterations, this code achieves our first significant speedup of 5.201x over the reference version!</p>
<h2 id="blockedc">blocked.c</h2>
<p>Our calculation progress linearly across each row, accessing only the rows above and below it. Life needs to access each element nine times (eight times while counting among neighbors, and once to calculate the cell itself). As computation proceeds row by row, these accesses occur in groups of three (once group for each row), and if three rows of the matrix can fit in L1 cache, then the data is only loaded once from cache per iteration. However, if our computation were more data intensive, it might benefit from a technique called blocking.</p>
<p>Blocking is the practice of restructuring the computation so that data in registers cache is reused before it is evicted from these locations. This keeps the relevant data for a computation in the higher (faster) levels of the memory hierarchy. Register blocking involves rewriting the inner loop of your code to reuse values you have loaded from memory instead of loading them multiple times. Cache blocking involves restructuring the ordering of loops so that the same or nearby values are accessed soon after each other. Typically, we size the cache blocks so that the entire computation fills the L1 cache.</p>
<p>It doesn’t help much in this case (our kernel spends more time computing than loading from memory), here’s an example of how to restructure the padded inner loop for cache blocking:</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="cp">#define X_BLOCK WORD * 256
#define Y_BLOCK 256</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-c" data-lang="c"> <span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o"><</span> <span class="n">IN_GHOST</span> <span class="o">&&</span> <span class="n">i</span> <span class="o">+</span> <span class="n">j</span> <span class="o"><</span> <span class="n">iters</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">//Now the outer loops progress block by block.</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">y</span> <span class="o">=</span> <span class="p">(</span><span class="n">Y_IN_GHOST</span> <span class="o">-</span> <span class="n">OUT_GHOST</span><span class="p">);</span> <span class="n">y</span> <span class="o"><</span> <span class="n">height</span> <span class="o">+</span> <span class="n">Y_IN_GHOST</span> <span class="o">+</span> <span class="n">OUT_GHOST</span><span class="p">;</span> <span class="n">y</span> <span class="o">+=</span> <span class="n">Y_BLOCK</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="p">(</span><span class="n">X_IN_GHOST</span> <span class="o">-</span> <span class="n">OUT_GHOST</span><span class="p">);</span> <span class="n">x</span> <span class="o"><</span> <span class="n">width</span> <span class="o">+</span> <span class="n">X_IN_GHOST</span> <span class="o">+</span> <span class="n">OUT_GHOST</span><span class="p">;</span> <span class="n">x</span> <span class="o">+=</span> <span class="n">X_BLOCK</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">//The inner loops progress one by one.</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">yy</span> <span class="o">=</span> <span class="n">y</span><span class="p">;</span> <span class="n">yy</span> <span class="o"><</span> <span class="n">y</span> <span class="o">+</span> <span class="n">Y_BLOCK</span> <span class="o">&&</span> <span class="n">yy</span> <span class="o"><</span> <span class="n">height</span> <span class="o">+</span> <span class="n">Y_IN_GHOST</span> <span class="o">+</span> <span class="n">OUT_GHOST</span><span class="p">;</span> <span class="n">yy</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">xx</span> <span class="o">=</span> <span class="n">x</span><span class="p">;</span> <span class="n">xx</span> <span class="o"><</span> <span class="n">x</span> <span class="o">+</span> <span class="n">X_BLOCK</span> <span class="o">&&</span> <span class="n">xx</span> <span class="o"><</span> <span class="n">width</span> <span class="o">+</span> <span class="n">X_IN_GHOST</span> <span class="o">+</span> <span class="n">OUT_GHOST</span><span class="p">;</span> <span class="n">xx</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">unsigned</span> <span class="n">n</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="kt">uint8_t</span> <span class="o">*</span><span class="n">u</span> <span class="o">=</span> <span class="n">universe</span> <span class="o">+</span> <span class="p">(</span><span class="n">yy</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="n">padded_width</span> <span class="o">+</span> <span class="n">xx</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
<span class="n">n</span> <span class="o">+=</span> <span class="n">u</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
<span class="n">n</span> <span class="o">+=</span> <span class="n">u</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
<span class="n">n</span> <span class="o">+=</span> <span class="n">u</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>
<span class="n">u</span> <span class="o">+=</span> <span class="n">padded_width</span><span class="p">;</span>
<span class="n">n</span> <span class="o">+=</span> <span class="n">u</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
<span class="kt">unsigned</span> <span class="n">alive</span> <span class="o">=</span> <span class="n">u</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
<span class="n">n</span> <span class="o">+=</span> <span class="n">u</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>
<span class="n">u</span> <span class="o">+=</span> <span class="n">padded_width</span><span class="p">;</span>
<span class="n">n</span> <span class="o">+=</span> <span class="n">u</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
<span class="n">n</span> <span class="o">+=</span> <span class="n">u</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
<span class="n">n</span> <span class="o">+=</span> <span class="n">u</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>
<span class="n">new</span><span class="p">[</span><span class="n">yy</span> <span class="o">*</span> <span class="n">padded_width</span> <span class="o">+</span> <span class="n">xx</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="n">n</span> <span class="o">==</span> <span class="mi">3</span> <span class="o">||</span> <span class="p">(</span><span class="n">n</span> <span class="o">==</span> <span class="mi">2</span> <span class="o">&&</span> <span class="n">alive</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">uint8_t</span> <span class="o">*</span><span class="n">tmp</span> <span class="o">=</span> <span class="n">universe</span><span class="p">;</span>
<span class="n">universe</span> <span class="o">=</span> <span class="n">new</span><span class="p">;</span>
<span class="n">new</span> <span class="o">=</span> <span class="n">tmp</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>
<h2 id="sse2c">sse2.c</h2>
<p>Let’s cram more operations into the inner loop using vectorization. Intel’s <a href="https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions">SSE</a> vector intrinsics are 128 bits wide, so we can cram 16 <code class="language-plaintext highlighter-rouge">uint8_t</code> types into a single vector register, and operate on them all at once. To keep the code nice, we require that the width of the input is a multiple of 16. A good resource for Intel vector intrinsics is the <a href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#">Intel Intrinsics Guide</a>. The best way for me to show you what the inner loop looks like at this point would be to write it out:</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="cp">#define WORD (128/8)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-c" data-lang="c"> <span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">iters</span><span class="p">;</span> <span class="n">i</span><span class="o">+=</span> <span class="n">IN_GHOST</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">//Because we assume the width is a multiple of the size of a SSE register,</span>
<span class="c1">//we can use aligned loads and stores.</span>
<span class="n">__m128i</span> <span class="o">*</span><span class="n">universe_words</span> <span class="o">=</span> <span class="p">(</span><span class="n">__m128i</span><span class="o">*</span><span class="p">)</span><span class="n">universe</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o"><</span> <span class="n">padded_height</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">y</span> <span class="o"><</span> <span class="n">Y_IN_GHOST</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">_mm_store_si128</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span><span class="p">,</span>
<span class="n">_mm_load_si128</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="p">(</span><span class="n">y</span> <span class="o">+</span> <span class="n">height</span><span class="p">)</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span> <span class="o">+</span> <span class="n">width_words</span><span class="p">));</span>
<span class="p">}</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">width_words</span> <span class="o">+</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">_mm_store_si128</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span><span class="p">,</span>
<span class="n">_mm_load_si128</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="p">(</span><span class="n">y</span> <span class="o">+</span> <span class="n">height</span><span class="p">)</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span><span class="p">));</span>
<span class="p">}</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="n">width_words</span> <span class="o">+</span> <span class="n">X_IN_GHOST_WORDS</span> <span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">padded_width_words</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">_mm_store_si128</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span><span class="p">,</span>
<span class="n">_mm_load_si128</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="p">(</span><span class="n">y</span> <span class="o">+</span> <span class="n">height</span><span class="p">)</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span> <span class="o">-</span> <span class="n">width_words</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">y</span> <span class="o"><</span> <span class="n">height</span> <span class="o">+</span> <span class="n">Y_IN_GHOST</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">_mm_store_si128</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span><span class="p">,</span>
<span class="n">_mm_load_si128</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span> <span class="o">+</span> <span class="n">width_words</span><span class="p">));</span>
<span class="p">}</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="n">width_words</span> <span class="o">+</span> <span class="n">X_IN_GHOST_WORDS</span> <span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">padded_width_words</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">_mm_store_si128</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span><span class="p">,</span>
<span class="n">_mm_load_si128</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span> <span class="o">-</span> <span class="n">width_words</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">_mm_store_si128</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span><span class="p">,</span>
<span class="n">_mm_load_si128</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="p">(</span><span class="n">y</span> <span class="o">-</span> <span class="n">height</span><span class="p">)</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span> <span class="o">+</span> <span class="n">width_words</span><span class="p">));</span>
<span class="p">}</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">width_words</span> <span class="o">+</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">_mm_store_si128</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span><span class="p">,</span>
<span class="n">_mm_load_si128</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="p">(</span><span class="n">y</span> <span class="o">-</span> <span class="n">height</span><span class="p">)</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span><span class="p">));</span>
<span class="p">}</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="n">width_words</span> <span class="o">+</span> <span class="n">X_IN_GHOST_WORDS</span> <span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">padded_width_words</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">_mm_store_si128</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span><span class="p">,</span>
<span class="n">_mm_load_si128</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="p">(</span><span class="n">y</span> <span class="o">-</span> <span class="n">height</span><span class="p">)</span> <span class="o">*</span> <span class="n">padded_width_words</span> <span class="o">+</span> <span class="n">x</span> <span class="o">-</span> <span class="n">width_words</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o"><</span> <span class="n">IN_GHOST</span> <span class="o">&</span> <span class="n">j</span> <span class="o">+</span> <span class="n">i</span> <span class="o"><</span> <span class="n">iters</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">//Set up a vector of ones</span>
<span class="k">const</span> <span class="n">__m128i</span> <span class="n">ones</span> <span class="o">=</span> <span class="n">_mm_set_epi8</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span>
<span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="c1">//Set up a vector of twos</span>
<span class="k">const</span> <span class="n">__m128i</span> <span class="n">twos</span> <span class="o">=</span> <span class="n">_mm_slli_epi32</span><span class="p">(</span><span class="n">ones</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="c1">//Set up a vector of threes</span>
<span class="k">const</span> <span class="n">__m128i</span> <span class="n">threes</span> <span class="o">=</span> <span class="n">_mm_or_si128</span><span class="p">(</span><span class="n">ones</span><span class="p">,</span> <span class="n">twos</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">y</span> <span class="o">=</span> <span class="p">(</span><span class="n">Y_IN_GHOST</span> <span class="o">-</span> <span class="n">Y_OUT_GHOST</span><span class="p">);</span> <span class="n">y</span> <span class="o"><</span> <span class="n">height</span> <span class="o">+</span> <span class="n">Y_IN_GHOST</span> <span class="o">+</span> <span class="n">Y_OUT_GHOST</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="p">(</span><span class="n">X_IN_GHOST</span> <span class="o">-</span> <span class="n">X_OUT_GHOST</span><span class="p">);</span> <span class="n">x</span> <span class="o">+</span> <span class="n">WORD</span> <span class="o"><=</span> <span class="n">width</span> <span class="o">+</span> <span class="n">X_IN_GHOST</span> <span class="o">+</span> <span class="n">X_OUT_GHOST</span><span class="p">;</span> <span class="n">x</span> <span class="o">+=</span> <span class="n">WORD</span><span class="p">)</span> <span class="p">{</span>
<span class="n">__m128i</span> <span class="n">n</span><span class="p">;</span>
<span class="n">__m128i</span> <span class="n">alive</span><span class="p">;</span>
<span class="kt">uint8_t</span> <span class="o">*</span><span class="n">u</span> <span class="o">=</span> <span class="n">universe</span> <span class="o">+</span> <span class="p">(</span><span class="n">y</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="n">padded_width</span> <span class="o">+</span> <span class="n">x</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
<span class="c1">//This is an unaligned load</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">_mm_lddqu_si128</span><span class="p">((</span><span class="n">__m128i</span><span class="o">*</span><span class="p">)</span><span class="n">u</span><span class="p">);</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">_mm_add_epi8</span><span class="p">(</span><span class="n">_mm_load_si128</span><span class="p">((</span><span class="n">__m128i</span><span class="o">*</span><span class="p">)(</span><span class="n">u</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)),</span> <span class="n">n</span><span class="p">);</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">_mm_add_epi8</span><span class="p">(</span><span class="n">_mm_lddqu_si128</span><span class="p">((</span><span class="n">__m128i</span><span class="o">*</span><span class="p">)(</span><span class="n">u</span> <span class="o">+</span> <span class="mi">2</span><span class="p">)),</span> <span class="n">n</span><span class="p">);</span>
<span class="n">u</span> <span class="o">+=</span> <span class="n">padded_width</span><span class="p">;</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">_mm_add_epi8</span><span class="p">(</span><span class="n">_mm_lddqu_si128</span><span class="p">((</span><span class="n">__m128i</span><span class="o">*</span><span class="p">)</span><span class="n">u</span><span class="p">),</span> <span class="n">n</span><span class="p">);</span>
<span class="c1">//This is an aligned load</span>
<span class="n">alive</span> <span class="o">=</span> <span class="n">_mm_load_si128</span><span class="p">((</span><span class="n">__m128i</span><span class="o">*</span><span class="p">)(</span><span class="n">u</span> <span class="o">+</span> <span class="mi">1</span><span class="p">));</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">_mm_add_epi8</span><span class="p">(</span><span class="n">_mm_lddqu_si128</span><span class="p">((</span><span class="n">__m128i</span><span class="o">*</span><span class="p">)(</span><span class="n">u</span> <span class="o">+</span> <span class="mi">2</span><span class="p">)),</span> <span class="n">n</span><span class="p">);</span>
<span class="n">u</span> <span class="o">+=</span> <span class="n">padded_width</span><span class="p">;</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">_mm_add_epi8</span><span class="p">(</span><span class="n">_mm_lddqu_si128</span><span class="p">((</span><span class="n">__m128i</span><span class="o">*</span><span class="p">)</span><span class="n">u</span><span class="p">),</span> <span class="n">n</span><span class="p">);</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">_mm_add_epi8</span><span class="p">(</span><span class="n">_mm_load_si128</span><span class="p">((</span><span class="n">__m128i</span><span class="o">*</span><span class="p">)(</span><span class="n">u</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)),</span> <span class="n">n</span><span class="p">);</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">_mm_add_epi8</span><span class="p">(</span><span class="n">_mm_lddqu_si128</span><span class="p">((</span><span class="n">__m128i</span><span class="o">*</span><span class="p">)(</span><span class="n">u</span> <span class="o">+</span> <span class="mi">2</span><span class="p">)),</span> <span class="n">n</span><span class="p">);</span>
<span class="c1">//The operation we are performing here is the same, but it looks</span>
<span class="c1">//very different when written in SIMD instructions</span>
<span class="n">_mm_store_si128</span><span class="p">((</span><span class="n">__m128i</span><span class="o">*</span><span class="p">)(</span><span class="n">new</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">padded_width</span> <span class="o">+</span> <span class="n">x</span><span class="p">),</span>
<span class="n">_mm_or_si128</span><span class="p">(</span>
<span class="c1">//We need to and with the ones vector here because the result of</span>
<span class="c1">//comparison is either 0xFF or 0, and we need 1 or 0.</span>
<span class="n">_mm_and_si128</span><span class="p">(</span><span class="n">ones</span><span class="p">,</span> <span class="n">_mm_cmpeq_epi8</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">threes</span><span class="p">)),</span>
<span class="n">_mm_and_si128</span><span class="p">(</span><span class="n">alive</span><span class="p">,</span> <span class="n">_mm_cmpeq_epi8</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">twos</span><span class="p">))));</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">uint8_t</span> <span class="o">*</span><span class="n">tmp</span> <span class="o">=</span> <span class="n">universe</span><span class="p">;</span>
<span class="n">universe</span> <span class="o">=</span> <span class="n">new</span><span class="p">;</span>
<span class="n">new</span> <span class="o">=</span> <span class="n">tmp</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span></code></pre></figure>
<p>This code achieves a speedup of 58.06x over the reference on the 8192x8192 grid for 256 iterations. Not bad.</p>
<h2 id="avx2c">avx2.c</h2>
<p><a href="https://en.wikipedia.org/wiki/Advanced_Vector_Extensions">AVX</a> instructions are like double-wide SSE instructions. They can hold 256 bits (meaning 32 <code class="language-plaintext highlighter-rouge">uint8_t</code>), so we require that the width of the input is a multiple of 32. This code achieves a speedup of 72.42x (66.87% of the single CPU peak) over the reference on the 8192x8192 grid for 256 iterations, but because it is so similar to the SSE version we do not include it.</p>
<h2 id="streamingc">streaming.c</h2>
<p>Since we perform 12 operations per byte and an AVX instruction can operate on 32 bytes simultaneously, we might benefit from some memory optimizations. Let’s put a <a href="https://software.intel.com/sites/default/files/article/326703/streaming-stores-2.pdf">streaming store</a> in the inner loop. A streaming store writes to memory without first reading the value into cache (leaving more room for useful values in the cache). Since we know we do not need to read the value in the <code class="language-plaintext highlighter-rouge">new</code> array, this is the perfect operation for us. The last line of our inner loop moves from this:</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"> <span class="n">_mm256_store_si256</span><span class="p">((</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)(</span><span class="n">new</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">padded_width</span> <span class="o">+</span> <span class="n">x</span><span class="p">),</span>
<span class="n">_mm256_or_si256</span><span class="p">(</span>
<span class="n">_mm256_and_si256</span><span class="p">(</span><span class="n">ones</span><span class="p">,</span> <span class="n">_mm256_cmpeq_epi8</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">threes</span><span class="p">)),</span>
<span class="n">_mm256_and_si256</span><span class="p">(</span><span class="n">alive</span><span class="p">,</span> <span class="n">_mm256_cmpeq_epi8</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">twos</span><span class="p">))));</span></code></pre></figure>
<p>To this:</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"> <span class="n">_mm256_stream_si256</span><span class="p">((</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)(</span><span class="n">new</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">padded_width</span> <span class="o">+</span> <span class="n">x</span><span class="p">),</span>
<span class="n">_mm256_or_si256</span><span class="p">(</span>
<span class="n">_mm256_and_si256</span><span class="p">(</span><span class="n">ones</span><span class="p">,</span> <span class="n">_mm256_cmpeq_epi8</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">threes</span><span class="p">)),</span>
<span class="n">_mm256_and_si256</span><span class="p">(</span><span class="n">alive</span><span class="p">,</span> <span class="n">_mm256_cmpeq_epi8</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">twos</span><span class="p">))));</span></code></pre></figure>
<p>And our code now achieves a speedup of 86.44x over the reference on the 8192x8192 grid for 256 iterations. We are now running at 79.82% of the single CPU peak, and I don’t think we’re going to get very many additional speedups. It’s time to go parallel!</p>
<h2 id="ompc">omp.c</h2>
<p>We have a pretty decent single core utilization, so why don’t we move to multiple cores? <a href="https://en.wikipedia.org/wiki/OpenMP">OpenMP</a> is a library that makes it easy to distribute loop iterations among threads. Here’s what it looks like:</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"> <span class="cp">#pragma omp parallel
</span> <span class="p">{</span>
<span class="c1">//To avoid race conditions, each thread keeps their own copy of the</span>
<span class="c1">//universe and new pointers</span>
<span class="kt">uint8_t</span> <span class="o">*</span><span class="n">my_universe</span> <span class="o">=</span> <span class="n">universe</span><span class="p">;</span>
<span class="kt">uint8_t</span> <span class="o">*</span><span class="n">my_new</span> <span class="o">=</span> <span class="n">new</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o"><</span> <span class="n">IN_GHOST</span> <span class="o">&</span> <span class="n">j</span> <span class="o">+</span> <span class="n">i</span> <span class="o"><</span> <span class="n">iters</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">//We distribute the loop over y, not x, because we want to avoid writing</span>
<span class="c1">//to the same cache lines</span>
<span class="cp">#pragma omp for
</span> <span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">y</span> <span class="o">=</span> <span class="p">(</span><span class="n">Y_IN_GHOST</span> <span class="o">-</span> <span class="n">Y_OUT_GHOST</span><span class="p">);</span> <span class="n">y</span> <span class="o"><</span> <span class="n">height</span> <span class="o">+</span> <span class="n">Y_IN_GHOST</span> <span class="o">+</span> <span class="n">Y_OUT_GHOST</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="p">(</span><span class="n">X_IN_GHOST</span> <span class="o">-</span> <span class="n">X_OUT_GHOST</span><span class="p">);</span> <span class="n">x</span> <span class="o">+</span> <span class="n">WORD</span> <span class="o"><=</span> <span class="n">width</span> <span class="o">+</span> <span class="n">X_IN_GHOST</span> <span class="o">+</span> <span class="n">X_OUT_GHOST</span><span class="p">;</span> <span class="n">x</span> <span class="o">+=</span> <span class="n">WORD</span><span class="p">)</span> <span class="p">{</span>
<span class="n">__m256i</span> <span class="n">n</span><span class="p">;</span>
<span class="n">__m256i</span> <span class="n">alive</span><span class="p">;</span>
<span class="kt">uint8_t</span> <span class="o">*</span><span class="n">u</span> <span class="o">=</span> <span class="n">my_universe</span> <span class="o">+</span> <span class="p">(</span><span class="n">y</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="n">padded_width</span> <span class="o">+</span> <span class="n">x</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">_mm256_lddqu_si256</span><span class="p">((</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)</span><span class="n">u</span><span class="p">);</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">_mm256_add_epi8</span><span class="p">(</span><span class="n">_mm256_load_si256</span><span class="p">((</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)(</span><span class="n">u</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)),</span> <span class="n">n</span><span class="p">);</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">_mm256_add_epi8</span><span class="p">(</span><span class="n">_mm256_lddqu_si256</span><span class="p">((</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)(</span><span class="n">u</span> <span class="o">+</span> <span class="mi">2</span><span class="p">)),</span> <span class="n">n</span><span class="p">);</span>
<span class="n">u</span> <span class="o">+=</span> <span class="n">padded_width</span><span class="p">;</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">_mm256_add_epi8</span><span class="p">(</span><span class="n">_mm256_lddqu_si256</span><span class="p">((</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)</span><span class="n">u</span><span class="p">),</span> <span class="n">n</span><span class="p">);</span>
<span class="n">alive</span> <span class="o">=</span> <span class="n">_mm256_load_si256</span><span class="p">((</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)(</span><span class="n">u</span> <span class="o">+</span> <span class="mi">1</span><span class="p">));</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">_mm256_add_epi8</span><span class="p">(</span><span class="n">_mm256_lddqu_si256</span><span class="p">((</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)(</span><span class="n">u</span> <span class="o">+</span> <span class="mi">2</span><span class="p">)),</span> <span class="n">n</span><span class="p">);</span>
<span class="n">u</span> <span class="o">+=</span> <span class="n">padded_width</span><span class="p">;</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">_mm256_add_epi8</span><span class="p">(</span><span class="n">_mm256_lddqu_si256</span><span class="p">((</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)</span><span class="n">u</span><span class="p">),</span> <span class="n">n</span><span class="p">);</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">_mm256_add_epi8</span><span class="p">(</span><span class="n">_mm256_load_si256</span><span class="p">((</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)(</span><span class="n">u</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)),</span> <span class="n">n</span><span class="p">);</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">_mm256_add_epi8</span><span class="p">(</span><span class="n">_mm256_lddqu_si256</span><span class="p">((</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)(</span><span class="n">u</span> <span class="o">+</span> <span class="mi">2</span><span class="p">)),</span> <span class="n">n</span><span class="p">);</span>
<span class="n">_mm256_stream_si256</span><span class="p">((</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)(</span><span class="n">my_new</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">padded_width</span> <span class="o">+</span> <span class="n">x</span><span class="p">),</span>
<span class="n">_mm256_or_si256</span><span class="p">(</span>
<span class="n">_mm256_and_si256</span><span class="p">(</span><span class="n">ones</span><span class="p">,</span> <span class="n">_mm256_cmpeq_epi8</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">threes</span><span class="p">)),</span>
<span class="n">_mm256_and_si256</span><span class="p">(</span><span class="n">alive</span><span class="p">,</span> <span class="n">_mm256_cmpeq_epi8</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">twos</span><span class="p">))));</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">uint8_t</span> <span class="o">*</span><span class="n">tmp</span> <span class="o">=</span> <span class="n">my_universe</span><span class="p">;</span>
<span class="n">my_universe</span> <span class="o">=</span> <span class="n">my_new</span><span class="p">;</span>
<span class="n">my_new</span> <span class="o">=</span> <span class="n">tmp</span><span class="p">;</span>
<span class="p">}</span>
<span class="cp">#pragma omp single
</span> <span class="p">{</span>
<span class="c1">//Again to avoid race conditions, a single thread (it doesn't matter</span>
<span class="c1">//since all the threads have the same copies of everything) writes their</span>
<span class="c1">//copies of the universe and new pointers to the shared copies for the</span>
<span class="c1">//next time</span>
<span class="n">universe</span> <span class="o">=</span> <span class="n">my_universe</span><span class="p">;</span>
<span class="n">new</span> <span class="o">=</span> <span class="n">my_new</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="err">}</span></code></pre></figure>
<p>Now we are getting the speedups we deserve! This gets a speedup of 375.6x over the reference version on the 8192x8192 grid for 256 iterations, and we are only using 10 of the available 40 processors on this CPU! Keep in mind that we are currently only running at 17.34% of our theoretical peak processing rate!</p>
<p>Our OpenMP code does not scale beyond a single <a href="https://en.wikipedia.org/wiki/Non-uniform_memory_access">NUMA</a> domain (where communication is cheap). From the following graph, we see that our code gets no real performance gain beyond 10 threads.</p>
<p><img src="/assets/images/life-in-the-fast-lane-omp-plot.png" alt="Oh no my graph didn't load!" title="NUMA Problems" /></p>
<p>There are a few reasons that this might happen. My guess is that either our processor cannot load memory fast enough to satisfy hungry cpus, or communication is too costly. We should perform a more thorough analysis of why our implementation isn’t scaling, but I like writing code more than I like running it so let’s make a brash decision!</p>
<h2 id="mpic">mpi.c</h2>
<p>If our code were communication-bound, we might benefit from explicitly managing our communication patterns. We can do this using <a href="https://en.wikipedia.org/wiki/Message_Passing_Interface">MPI</a>, to run multiple processes that do not share an address space, and binding processes to physical CPUs. This way, each NUMA domain has a separate MPI task. Our MPI tasks can run OpenMP on their own NUMA domains, and communicate to other explicitly.</p>
<p>Our MPI program makes several simplifications, assuming that the number of processors is a perfect square, that the height is divisible by the square root of the number of processors, and that the width is divisible by the word size times the square root of the number of processes.</p>
<p>Watch out! This MPI code is more that 10 times longer than our reference code. TL;DR: each process is part of a grid, and instead of copying the ghost cells like we did in previous versions, we will send them to our neighbors.</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"><span class="cp">#include <stdlib.h>
#include <stdint.h>
#include <immintrin.h>
#include <math.h>
#include <omp.h>
#include <mpi.h>
</span>
<span class="cp">#define WORD (256/8)
#define OUT_GHOST 7
#define X_OUT_GHOST (((OUT_GHOST - 1)/WORD + 1) * WORD)
#define Y_OUT_GHOST OUT_GHOST
#define IN_GHOST (OUT_GHOST + 1)
#define X_IN_GHOST ((OUT_GHOST/WORD + 1) * WORD)
#define Y_IN_GHOST IN_GHOST
#define X_IN_GHOST_WORDS (X_IN_GHOST/WORD)
</span>
<span class="c1">//Here are the tags we will use to distinguish where the data is coming from</span>
<span class="c1">//and going to. Notice that the top left corner is sent to the bottom right</span>
<span class="c1">//corner of the top left neighbor.</span>
<span class="cp">#define TOP_LEFT_SEND 0
#define BOTTOM_RIGHT_RECV 0
#define TOP_SEND 1
#define BOTTOM_RECV 1
#define TOP_RIGHT_SEND 2
#define BOTTOM_LEFT_RECV 2
#define RIGHT_SEND 3
#define LEFT_RECV 3
#define BOTTOM_RIGHT_SEND 4
#define TOP_LEFT_RECV 4
#define BOTTOM_SEND 5
#define TOP_RECV 5
#define BOTTOM_LEFT_SEND 6
#define TOP_RIGHT_RECV 6
#define LEFT_SEND 7
#define RIGHT_RECV 7
</span>
<span class="kt">void</span> <span class="o">*</span><span class="nf">aligned_malloc</span><span class="p">(</span><span class="kt">int</span> <span class="n">size</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">char</span> <span class="o">*</span><span class="n">mem</span> <span class="o">=</span> <span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">void</span><span class="o">*</span><span class="p">)</span> <span class="o">+</span> <span class="n">size</span> <span class="o">+</span> <span class="n">WORD</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span>
<span class="kt">void</span> <span class="o">**</span><span class="n">ptr</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span><span class="o">**</span><span class="p">)(((</span><span class="kt">uintptr_t</span><span class="p">)(</span><span class="n">mem</span> <span class="o">+</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">void</span><span class="o">*</span><span class="p">)</span> <span class="o">+</span> <span class="n">WORD</span> <span class="o">-</span> <span class="mi">1</span><span class="p">))</span> <span class="o">&</span> <span class="o">~</span><span class="p">((</span><span class="kt">uintptr_t</span><span class="p">)(</span><span class="n">WORD</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)));</span>
<span class="n">ptr</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">mem</span><span class="p">;</span>
<span class="k">return</span> <span class="n">ptr</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="nf">aligned_free</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">ptr</span><span class="p">)</span> <span class="p">{</span>
<span class="n">free</span><span class="p">(((</span><span class="kt">void</span><span class="o">**</span><span class="p">)</span><span class="n">ptr</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">]);</span>
<span class="p">}</span>
<span class="kt">unsigned</span> <span class="o">*</span><span class="nf">life</span> <span class="p">(</span><span class="k">const</span> <span class="kt">unsigned</span> <span class="n">height</span><span class="p">,</span>
<span class="k">const</span> <span class="kt">unsigned</span> <span class="n">width</span><span class="p">,</span>
<span class="k">const</span> <span class="kt">unsigned</span> <span class="o">*</span> <span class="k">const</span> <span class="n">initial</span><span class="p">,</span>
<span class="k">const</span> <span class="kt">unsigned</span> <span class="n">iters</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">rank</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">size</span><span class="p">;</span>
<span class="n">MPI_Comm_rank</span><span class="p">(</span><span class="n">MPI_COMM_WORLD</span><span class="p">,</span> <span class="o">&</span><span class="n">rank</span><span class="p">);</span>
<span class="n">MPI_Comm_size</span><span class="p">(</span><span class="n">MPI_COMM_WORLD</span><span class="p">,</span> <span class="o">&</span><span class="n">size</span><span class="p">);</span>
<span class="c1">//We will be arranging our processes in a grid. We are assuming that the</span>
<span class="c1">//problem width is divisible by the square root of the number of processors</span>
<span class="c1">//times the width of a word, and that the height is divisible by the</span>
<span class="c1">//square root of the number of processes, and the number of processes</span>
<span class="c1">//is a perfect square</span>
<span class="k">const</span> <span class="kt">unsigned</span> <span class="n">side</span> <span class="o">=</span> <span class="p">(</span><span class="kt">int</span><span class="p">)</span><span class="n">sqrt</span><span class="p">((</span><span class="kt">double</span><span class="p">)</span><span class="n">size</span><span class="p">);</span>
<span class="k">const</span> <span class="kt">unsigned</span> <span class="n">my_height</span> <span class="o">=</span> <span class="n">height</span><span class="o">/</span><span class="n">side</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">unsigned</span> <span class="n">my_width</span> <span class="o">=</span> <span class="n">width</span><span class="o">/</span><span class="n">side</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">unsigned</span> <span class="n">my_y</span> <span class="o">=</span> <span class="p">(</span><span class="n">rank</span> <span class="o">/</span> <span class="n">side</span><span class="p">);</span>
<span class="k">const</span> <span class="kt">unsigned</span> <span class="n">my_x</span> <span class="o">=</span> <span class="p">(</span><span class="n">rank</span> <span class="o">%</span> <span class="n">side</span><span class="p">);</span>
<span class="k">const</span> <span class="kt">unsigned</span> <span class="n">my_padded_height</span> <span class="o">=</span> <span class="n">my_height</span> <span class="o">+</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">Y_IN_GHOST</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">unsigned</span> <span class="n">my_padded_width</span> <span class="o">=</span> <span class="n">my_width</span> <span class="o">+</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">X_IN_GHOST</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">unsigned</span> <span class="n">my_width_words</span> <span class="o">=</span> <span class="n">my_width</span><span class="o">/</span><span class="n">WORD</span><span class="p">;</span>
<span class="k">const</span> <span class="kt">unsigned</span> <span class="n">my_padded_width_words</span> <span class="o">=</span> <span class="n">my_padded_width</span><span class="o">/</span><span class="n">WORD</span><span class="p">;</span>
<span class="kt">uint8_t</span> <span class="o">*</span><span class="n">universe</span> <span class="o">=</span> <span class="p">(</span><span class="kt">uint8_t</span><span class="o">*</span><span class="p">)</span><span class="n">aligned_malloc</span><span class="p">(</span><span class="n">my_padded_height</span> <span class="o">*</span> <span class="n">my_padded_width</span><span class="p">);</span>
<span class="kt">uint8_t</span> <span class="o">*</span><span class="n">new</span> <span class="o">=</span> <span class="p">(</span><span class="kt">uint8_t</span><span class="o">*</span><span class="p">)</span><span class="n">aligned_malloc</span><span class="p">(</span><span class="n">my_padded_height</span> <span class="o">*</span> <span class="n">my_padded_width</span><span class="p">);</span>
<span class="k">const</span> <span class="n">__m256i</span> <span class="n">ones</span> <span class="o">=</span> <span class="n">_mm256_set_epi8</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span>
<span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span>
<span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span>
<span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="k">const</span> <span class="n">__m256i</span> <span class="n">twos</span> <span class="o">=</span> <span class="n">_mm256_slli_epi32</span><span class="p">(</span><span class="n">ones</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="k">const</span> <span class="n">__m256i</span> <span class="n">threes</span> <span class="o">=</span> <span class="n">_mm256_or_si256</span><span class="p">(</span><span class="n">ones</span><span class="p">,</span> <span class="n">twos</span><span class="p">);</span>
<span class="c1">//We start by sending the data to all the processes. The data is first</span>
<span class="c1">//partitioned into a grid of rectangles (one for each processor).</span>
<span class="c1">//Here we first break up the initial data into rectangles.</span>
<span class="kt">uint8_t</span> <span class="o">*</span><span class="n">scatter_buffer_send</span><span class="p">;</span>
<span class="kt">uint8_t</span> <span class="o">*</span><span class="n">scatter_buffer_recv</span> <span class="o">=</span> <span class="p">(</span><span class="kt">uint8_t</span><span class="o">*</span><span class="p">)</span><span class="n">aligned_malloc</span><span class="p">(</span><span class="n">my_height</span> <span class="o">*</span> <span class="n">my_width</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">rank</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">scatter_buffer_send</span> <span class="o">=</span> <span class="p">(</span><span class="kt">uint8_t</span><span class="o">*</span><span class="p">)</span><span class="n">aligned_malloc</span><span class="p">(</span><span class="n">height</span> <span class="o">*</span> <span class="n">width</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">their_y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">their_y</span> <span class="o"><</span> <span class="n">side</span><span class="p">;</span> <span class="n">their_y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">their_x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">their_x</span> <span class="o"><</span> <span class="n">side</span><span class="p">;</span> <span class="n">their_x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o"><</span> <span class="n">my_height</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">my_width</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">scatter_buffer_send</span><span class="p">[(</span><span class="n">their_y</span> <span class="o">*</span> <span class="n">side</span> <span class="o">+</span> <span class="n">their_x</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">my_width</span> <span class="o">*</span> <span class="n">my_height</span><span class="p">)</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">my_width</span> <span class="o">+</span> <span class="n">x</span><span class="p">]</span> <span class="o">=</span>
<span class="n">initial</span><span class="p">[(</span><span class="n">their_y</span> <span class="o">*</span> <span class="n">my_height</span> <span class="o">+</span> <span class="n">y</span><span class="p">)</span> <span class="o">*</span> <span class="n">width</span> <span class="o">+</span> <span class="n">their_x</span> <span class="o">*</span> <span class="n">my_width</span> <span class="o">+</span> <span class="n">x</span><span class="p">];</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">MPI_Scatter</span><span class="p">((</span><span class="k">const</span> <span class="kt">void</span><span class="o">*</span><span class="p">)</span><span class="n">scatter_buffer_send</span><span class="p">,</span>
<span class="n">my_height</span> <span class="o">*</span> <span class="n">my_width</span><span class="p">,</span>
<span class="n">MPI_UNSIGNED_CHAR</span><span class="p">,</span>
<span class="p">(</span><span class="kt">void</span><span class="o">*</span><span class="p">)</span><span class="n">scatter_buffer_recv</span><span class="p">,</span>
<span class="n">my_height</span> <span class="o">*</span> <span class="n">my_width</span><span class="p">,</span>
<span class="n">MPI_UNSIGNED_CHAR</span><span class="p">,</span>
<span class="mi">0</span><span class="p">,</span>
<span class="n">MPI_COMM_WORLD</span><span class="p">);</span>
<span class="c1">//Now that the data has been scattered, we copy our personal rectangle into</span>
<span class="c1">//our local universe.</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">y</span> <span class="o">=</span> <span class="n">Y_IN_GHOST</span><span class="p">;</span> <span class="n">y</span> <span class="o"><</span> <span class="n">Y_IN_GHOST</span> <span class="o">+</span> <span class="n">my_height</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="n">X_IN_GHOST</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">X_IN_GHOST</span> <span class="o">+</span> <span class="n">my_width</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">universe</span><span class="p">[(</span><span class="n">y</span> <span class="o">*</span> <span class="n">my_padded_width</span><span class="p">)</span> <span class="o">+</span> <span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="n">scatter_buffer_recv</span><span class="p">[(</span><span class="n">y</span> <span class="o">-</span> <span class="n">Y_IN_GHOST</span><span class="p">)</span> <span class="o">*</span> <span class="n">my_width</span> <span class="o">+</span> <span class="n">x</span> <span class="o">-</span> <span class="n">X_IN_GHOST</span><span class="p">];</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c1">//There's a bunch of send buffers aren't there?</span>
<span class="n">__m256i</span> <span class="o">*</span><span class="n">ghost_buffer_top_left_send</span> <span class="o">=</span> <span class="p">(</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)</span><span class="n">aligned_malloc</span><span class="p">(</span><span class="n">X_IN_GHOST</span> <span class="o">*</span> <span class="n">Y_IN_GHOST</span><span class="p">);</span>
<span class="n">__m256i</span> <span class="o">*</span><span class="n">ghost_buffer_top_send</span> <span class="o">=</span> <span class="p">(</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)</span><span class="n">aligned_malloc</span><span class="p">(</span> <span class="n">my_width</span> <span class="o">*</span> <span class="n">Y_IN_GHOST</span><span class="p">);</span>
<span class="n">__m256i</span> <span class="o">*</span><span class="n">ghost_buffer_top_right_send</span> <span class="o">=</span> <span class="p">(</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)</span><span class="n">aligned_malloc</span><span class="p">(</span><span class="n">X_IN_GHOST</span> <span class="o">*</span> <span class="n">Y_IN_GHOST</span><span class="p">);</span>
<span class="n">__m256i</span> <span class="o">*</span><span class="n">ghost_buffer_right_send</span> <span class="o">=</span> <span class="p">(</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)</span><span class="n">aligned_malloc</span><span class="p">(</span><span class="n">X_IN_GHOST</span> <span class="o">*</span> <span class="n">my_height</span> <span class="p">);</span>
<span class="n">__m256i</span> <span class="o">*</span><span class="n">ghost_buffer_bottom_right_send</span> <span class="o">=</span> <span class="p">(</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)</span><span class="n">aligned_malloc</span><span class="p">(</span><span class="n">X_IN_GHOST</span> <span class="o">*</span> <span class="n">Y_IN_GHOST</span><span class="p">);</span>
<span class="n">__m256i</span> <span class="o">*</span><span class="n">ghost_buffer_bottom_send</span> <span class="o">=</span> <span class="p">(</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)</span><span class="n">aligned_malloc</span><span class="p">(</span> <span class="n">my_width</span> <span class="o">*</span> <span class="n">Y_IN_GHOST</span><span class="p">);</span>
<span class="n">__m256i</span> <span class="o">*</span><span class="n">ghost_buffer_bottom_left_send</span> <span class="o">=</span> <span class="p">(</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)</span><span class="n">aligned_malloc</span><span class="p">(</span><span class="n">X_IN_GHOST</span> <span class="o">*</span> <span class="n">Y_IN_GHOST</span><span class="p">);</span>
<span class="n">__m256i</span> <span class="o">*</span><span class="n">ghost_buffer_left_send</span> <span class="o">=</span> <span class="p">(</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)</span><span class="n">aligned_malloc</span><span class="p">(</span><span class="n">X_IN_GHOST</span> <span class="o">*</span> <span class="n">my_height</span> <span class="p">);</span>
<span class="n">__m256i</span> <span class="o">*</span><span class="n">ghost_buffer_bottom_right_recv</span> <span class="o">=</span> <span class="p">(</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)</span><span class="n">aligned_malloc</span><span class="p">(</span><span class="n">X_IN_GHOST</span> <span class="o">*</span> <span class="n">Y_IN_GHOST</span><span class="p">);</span>
<span class="n">__m256i</span> <span class="o">*</span><span class="n">ghost_buffer_bottom_recv</span> <span class="o">=</span> <span class="p">(</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)</span><span class="n">aligned_malloc</span><span class="p">(</span> <span class="n">my_width</span> <span class="o">*</span> <span class="n">Y_IN_GHOST</span><span class="p">);</span>
<span class="n">__m256i</span> <span class="o">*</span><span class="n">ghost_buffer_bottom_left_recv</span> <span class="o">=</span> <span class="p">(</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)</span><span class="n">aligned_malloc</span><span class="p">(</span><span class="n">X_IN_GHOST</span> <span class="o">*</span> <span class="n">Y_IN_GHOST</span><span class="p">);</span>
<span class="n">__m256i</span> <span class="o">*</span><span class="n">ghost_buffer_left_recv</span> <span class="o">=</span> <span class="p">(</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)</span><span class="n">aligned_malloc</span><span class="p">(</span><span class="n">X_IN_GHOST</span> <span class="o">*</span> <span class="n">my_height</span> <span class="p">);</span>
<span class="n">__m256i</span> <span class="o">*</span><span class="n">ghost_buffer_top_left_recv</span> <span class="o">=</span> <span class="p">(</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)</span><span class="n">aligned_malloc</span><span class="p">(</span><span class="n">X_IN_GHOST</span> <span class="o">*</span> <span class="n">Y_IN_GHOST</span><span class="p">);</span>
<span class="n">__m256i</span> <span class="o">*</span><span class="n">ghost_buffer_top_recv</span> <span class="o">=</span> <span class="p">(</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)</span><span class="n">aligned_malloc</span><span class="p">(</span> <span class="n">my_width</span> <span class="o">*</span> <span class="n">Y_IN_GHOST</span><span class="p">);</span>
<span class="n">__m256i</span> <span class="o">*</span><span class="n">ghost_buffer_top_right_recv</span> <span class="o">=</span> <span class="p">(</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)</span><span class="n">aligned_malloc</span><span class="p">(</span><span class="n">X_IN_GHOST</span> <span class="o">*</span> <span class="n">Y_IN_GHOST</span><span class="p">);</span>
<span class="n">__m256i</span> <span class="o">*</span><span class="n">ghost_buffer_right_recv</span> <span class="o">=</span> <span class="p">(</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)</span><span class="n">aligned_malloc</span><span class="p">(</span><span class="n">X_IN_GHOST</span> <span class="o">*</span> <span class="n">my_height</span> <span class="p">);</span>
<span class="n">MPI_Request</span> <span class="n">top_left_req</span><span class="p">;</span>
<span class="n">MPI_Request</span> <span class="n">top_req</span><span class="p">;</span>
<span class="n">MPI_Request</span> <span class="n">top_right_req</span><span class="p">;</span>
<span class="n">MPI_Request</span> <span class="n">right_req</span><span class="p">;</span>
<span class="n">MPI_Request</span> <span class="n">bottom_right_req</span><span class="p">;</span>
<span class="n">MPI_Request</span> <span class="n">bottom_req</span><span class="p">;</span>
<span class="n">MPI_Request</span> <span class="n">bottom_left_req</span><span class="p">;</span>
<span class="n">MPI_Request</span> <span class="n">left_req</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">iters</span><span class="p">;</span> <span class="n">i</span><span class="o">+=</span> <span class="n">IN_GHOST</span><span class="p">)</span> <span class="p">{</span>
<span class="n">__m256i</span> <span class="o">*</span><span class="n">universe_words</span> <span class="o">=</span> <span class="p">(</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)</span><span class="n">universe</span><span class="p">;</span>
<span class="c1">//Here are all of the sends to our neighbors in every cardinal direction.</span>
<span class="c1">//The sends are nonblocking, so that we can move right on to the next send</span>
<span class="c1">//without waiting for our neighbors to receive.</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o"><</span> <span class="n">Y_IN_GHOST</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">_mm256_store_si256</span><span class="p">(</span><span class="n">ghost_buffer_top_left_send</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">X_IN_GHOST_WORDS</span> <span class="o">+</span> <span class="n">x</span><span class="p">,</span>
<span class="n">_mm256_load_si256</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="p">(</span><span class="n">Y_IN_GHOST</span> <span class="o">+</span> <span class="n">y</span><span class="p">)</span> <span class="o">*</span> <span class="n">my_padded_width_words</span> <span class="o">+</span> <span class="n">X_IN_GHOST_WORDS</span> <span class="o">+</span> <span class="n">x</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">MPI_Isend</span><span class="p">(</span><span class="n">ghost_buffer_top_left_send</span><span class="p">,</span>
<span class="n">X_IN_GHOST</span> <span class="o">*</span> <span class="n">Y_IN_GHOST</span><span class="p">,</span>
<span class="n">MPI_UNSIGNED_CHAR</span><span class="p">,</span>
<span class="p">((</span><span class="n">my_y</span> <span class="o">+</span> <span class="n">side</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">side</span><span class="p">)</span> <span class="o">*</span> <span class="n">side</span> <span class="o">+</span> <span class="p">((</span><span class="n">my_x</span> <span class="o">+</span> <span class="n">side</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">side</span><span class="p">),</span>
<span class="n">TOP_LEFT_SEND</span><span class="p">,</span>
<span class="n">MPI_COMM_WORLD</span><span class="p">,</span>
<span class="o">&</span><span class="n">top_left_req</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o"><</span> <span class="n">Y_IN_GHOST</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">my_width_words</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">_mm256_store_si256</span><span class="p">(</span><span class="n">ghost_buffer_top_send</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">my_width_words</span> <span class="o">+</span> <span class="n">x</span><span class="p">,</span>
<span class="n">_mm256_load_si256</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="p">(</span><span class="n">Y_IN_GHOST</span> <span class="o">+</span> <span class="n">y</span><span class="p">)</span> <span class="o">*</span> <span class="n">my_padded_width_words</span> <span class="o">+</span> <span class="n">X_IN_GHOST_WORDS</span> <span class="o">+</span> <span class="n">x</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">MPI_Isend</span><span class="p">(</span><span class="n">ghost_buffer_top_send</span><span class="p">,</span>
<span class="n">my_width</span> <span class="o">*</span> <span class="n">Y_IN_GHOST</span><span class="p">,</span>
<span class="n">MPI_UNSIGNED_CHAR</span><span class="p">,</span>
<span class="p">((</span><span class="n">my_y</span> <span class="o">+</span> <span class="n">side</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">side</span><span class="p">)</span> <span class="o">*</span> <span class="n">side</span> <span class="o">+</span> <span class="n">my_x</span><span class="p">,</span>
<span class="n">TOP_SEND</span><span class="p">,</span>
<span class="n">MPI_COMM_WORLD</span><span class="p">,</span>
<span class="o">&</span><span class="n">top_req</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o"><</span> <span class="n">Y_IN_GHOST</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">_mm256_store_si256</span><span class="p">(</span><span class="n">ghost_buffer_top_right_send</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">X_IN_GHOST_WORDS</span> <span class="o">+</span> <span class="n">x</span><span class="p">,</span>
<span class="n">_mm256_load_si256</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="p">(</span><span class="n">y</span> <span class="o">+</span> <span class="n">Y_IN_GHOST</span><span class="p">)</span> <span class="o">*</span> <span class="n">my_padded_width_words</span> <span class="o">+</span> <span class="n">x</span> <span class="o">+</span> <span class="n">my_width_words</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">MPI_Isend</span><span class="p">(</span><span class="n">ghost_buffer_top_right_send</span><span class="p">,</span>
<span class="n">X_IN_GHOST</span> <span class="o">*</span> <span class="n">Y_IN_GHOST</span><span class="p">,</span>
<span class="n">MPI_UNSIGNED_CHAR</span><span class="p">,</span>
<span class="p">((</span><span class="n">my_y</span> <span class="o">+</span> <span class="n">side</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">side</span><span class="p">)</span> <span class="o">*</span> <span class="n">side</span> <span class="o">+</span> <span class="p">((</span><span class="n">my_x</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">side</span><span class="p">),</span>
<span class="n">TOP_RIGHT_SEND</span><span class="p">,</span>
<span class="n">MPI_COMM_WORLD</span><span class="p">,</span>
<span class="o">&</span><span class="n">top_right_req</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o"><</span> <span class="n">my_height</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">_mm256_store_si256</span><span class="p">(</span><span class="n">ghost_buffer_right_send</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">X_IN_GHOST_WORDS</span> <span class="o">+</span> <span class="n">x</span><span class="p">,</span>
<span class="n">_mm256_load_si256</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="p">(</span><span class="n">y</span> <span class="o">+</span> <span class="n">Y_IN_GHOST</span><span class="p">)</span> <span class="o">*</span> <span class="n">my_padded_width_words</span> <span class="o">+</span> <span class="n">x</span> <span class="o">+</span> <span class="n">my_width_words</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">MPI_Isend</span><span class="p">(</span><span class="n">ghost_buffer_right_send</span><span class="p">,</span>
<span class="n">X_IN_GHOST</span> <span class="o">*</span> <span class="n">my_height</span><span class="p">,</span>
<span class="n">MPI_UNSIGNED_CHAR</span><span class="p">,</span>
<span class="n">my_y</span> <span class="o">*</span> <span class="n">side</span> <span class="o">+</span> <span class="p">((</span><span class="n">my_x</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">side</span><span class="p">),</span>
<span class="n">RIGHT_SEND</span><span class="p">,</span>
<span class="n">MPI_COMM_WORLD</span><span class="p">,</span>
<span class="o">&</span><span class="n">right_req</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o"><</span> <span class="n">Y_IN_GHOST</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">_mm256_store_si256</span><span class="p">(</span><span class="n">ghost_buffer_bottom_right_send</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">X_IN_GHOST_WORDS</span> <span class="o">+</span> <span class="n">x</span><span class="p">,</span>
<span class="n">_mm256_load_si256</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="p">(</span><span class="n">y</span> <span class="o">+</span> <span class="n">my_height</span><span class="p">)</span> <span class="o">*</span> <span class="n">my_padded_width_words</span> <span class="o">+</span> <span class="n">x</span> <span class="o">+</span> <span class="n">my_width_words</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">MPI_Isend</span><span class="p">(</span><span class="n">ghost_buffer_bottom_right_send</span><span class="p">,</span>
<span class="n">X_IN_GHOST</span> <span class="o">*</span> <span class="n">Y_IN_GHOST</span><span class="p">,</span>
<span class="n">MPI_UNSIGNED_CHAR</span><span class="p">,</span>
<span class="p">((</span><span class="n">my_y</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">side</span><span class="p">)</span> <span class="o">*</span> <span class="n">side</span> <span class="o">+</span> <span class="p">((</span><span class="n">my_x</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">side</span><span class="p">),</span>
<span class="n">BOTTOM_RIGHT_SEND</span><span class="p">,</span>
<span class="n">MPI_COMM_WORLD</span><span class="p">,</span>
<span class="o">&</span><span class="n">bottom_right_req</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o"><</span> <span class="n">Y_IN_GHOST</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">my_width_words</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">_mm256_store_si256</span><span class="p">(</span><span class="n">ghost_buffer_bottom_send</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">my_width_words</span> <span class="o">+</span> <span class="n">x</span><span class="p">,</span>
<span class="n">_mm256_load_si256</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="p">(</span><span class="n">y</span> <span class="o">+</span> <span class="n">my_height</span><span class="p">)</span> <span class="o">*</span> <span class="n">my_padded_width_words</span> <span class="o">+</span> <span class="n">x</span> <span class="o">+</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">MPI_Isend</span><span class="p">(</span><span class="n">ghost_buffer_bottom_send</span><span class="p">,</span>
<span class="n">my_width</span> <span class="o">*</span> <span class="n">Y_IN_GHOST</span><span class="p">,</span>
<span class="n">MPI_UNSIGNED_CHAR</span><span class="p">,</span>
<span class="p">((</span><span class="n">my_y</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">side</span><span class="p">)</span> <span class="o">*</span> <span class="n">side</span> <span class="o">+</span> <span class="n">my_x</span><span class="p">,</span>
<span class="n">BOTTOM_SEND</span><span class="p">,</span>
<span class="n">MPI_COMM_WORLD</span><span class="p">,</span>
<span class="o">&</span><span class="n">bottom_req</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o"><</span> <span class="n">Y_IN_GHOST</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">_mm256_store_si256</span><span class="p">(</span><span class="n">ghost_buffer_bottom_left_send</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">X_IN_GHOST_WORDS</span> <span class="o">+</span> <span class="n">x</span><span class="p">,</span>
<span class="n">_mm256_load_si256</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="p">(</span><span class="n">y</span> <span class="o">+</span> <span class="n">my_height</span><span class="p">)</span> <span class="o">*</span> <span class="n">my_padded_width_words</span> <span class="o">+</span> <span class="n">x</span> <span class="o">+</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">MPI_Isend</span><span class="p">(</span><span class="n">ghost_buffer_bottom_left_send</span><span class="p">,</span>
<span class="n">X_IN_GHOST</span> <span class="o">*</span> <span class="n">Y_IN_GHOST</span><span class="p">,</span>
<span class="n">MPI_UNSIGNED_CHAR</span><span class="p">,</span>
<span class="p">((</span><span class="n">my_y</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">side</span><span class="p">)</span> <span class="o">*</span> <span class="n">side</span> <span class="o">+</span> <span class="p">((</span><span class="n">my_x</span> <span class="o">+</span> <span class="n">side</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">side</span><span class="p">),</span>
<span class="n">BOTTOM_LEFT_SEND</span><span class="p">,</span>
<span class="n">MPI_COMM_WORLD</span><span class="p">,</span>
<span class="o">&</span><span class="n">bottom_left_req</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o"><</span> <span class="n">my_height</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">_mm256_store_si256</span><span class="p">(</span><span class="n">ghost_buffer_left_send</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">X_IN_GHOST_WORDS</span> <span class="o">+</span> <span class="n">x</span><span class="p">,</span>
<span class="n">_mm256_load_si256</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="p">(</span><span class="n">y</span> <span class="o">+</span> <span class="n">Y_IN_GHOST</span><span class="p">)</span> <span class="o">*</span> <span class="n">my_padded_width_words</span> <span class="o">+</span> <span class="n">x</span> <span class="o">+</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">MPI_Isend</span><span class="p">(</span><span class="n">ghost_buffer_left_send</span><span class="p">,</span>
<span class="n">X_IN_GHOST</span> <span class="o">*</span> <span class="n">my_height</span><span class="p">,</span>
<span class="n">MPI_UNSIGNED_CHAR</span><span class="p">,</span>
<span class="n">my_y</span> <span class="o">*</span> <span class="n">side</span> <span class="o">+</span> <span class="p">((</span><span class="n">my_x</span> <span class="o">+</span> <span class="n">side</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">side</span><span class="p">),</span>
<span class="n">LEFT_SEND</span><span class="p">,</span>
<span class="n">MPI_COMM_WORLD</span><span class="p">,</span>
<span class="o">&</span><span class="n">left_req</span><span class="p">);</span>
<span class="c1">//Now we receive ghost zones from all of our neighbors. Since we need to</span>
<span class="c1">//process our received data immediately, the received data is blocking.</span>
<span class="n">MPI_Recv</span><span class="p">((</span><span class="kt">void</span><span class="o">*</span><span class="p">)</span><span class="n">ghost_buffer_bottom_right_recv</span><span class="p">,</span>
<span class="n">X_IN_GHOST</span> <span class="o">*</span> <span class="n">Y_IN_GHOST</span><span class="p">,</span>
<span class="n">MPI_UNSIGNED_CHAR</span><span class="p">,</span>
<span class="p">((</span><span class="n">my_y</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">side</span><span class="p">)</span> <span class="o">*</span> <span class="n">side</span> <span class="o">+</span> <span class="p">((</span><span class="n">my_x</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">side</span><span class="p">),</span>
<span class="n">BOTTOM_RIGHT_RECV</span><span class="p">,</span>
<span class="n">MPI_COMM_WORLD</span><span class="p">,</span>
<span class="n">MPI_STATUS_IGNORE</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o"><</span> <span class="n">Y_IN_GHOST</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">_mm256_store_si256</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="p">(</span><span class="n">y</span> <span class="o">+</span> <span class="n">Y_IN_GHOST</span> <span class="o">+</span> <span class="n">my_height</span><span class="p">)</span> <span class="o">*</span> <span class="n">my_padded_width_words</span> <span class="o">+</span> <span class="n">x</span> <span class="o">+</span> <span class="n">X_IN_GHOST_WORDS</span> <span class="o">+</span> <span class="n">my_width_words</span><span class="p">,</span>
<span class="n">_mm256_load_si256</span><span class="p">(</span><span class="n">ghost_buffer_bottom_right_recv</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">X_IN_GHOST_WORDS</span> <span class="o">+</span> <span class="n">x</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">MPI_Recv</span><span class="p">((</span><span class="kt">void</span><span class="o">*</span><span class="p">)</span><span class="n">ghost_buffer_bottom_recv</span><span class="p">,</span>
<span class="n">my_width</span> <span class="o">*</span> <span class="n">Y_IN_GHOST</span><span class="p">,</span>
<span class="n">MPI_UNSIGNED_CHAR</span><span class="p">,</span>
<span class="p">((</span><span class="n">my_y</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">side</span><span class="p">)</span> <span class="o">*</span> <span class="n">side</span> <span class="o">+</span> <span class="n">my_x</span><span class="p">,</span>
<span class="n">BOTTOM_RECV</span><span class="p">,</span>
<span class="n">MPI_COMM_WORLD</span><span class="p">,</span>
<span class="n">MPI_STATUS_IGNORE</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o"><</span> <span class="n">Y_IN_GHOST</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">my_width_words</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">_mm256_store_si256</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="p">(</span><span class="n">y</span> <span class="o">+</span> <span class="n">Y_IN_GHOST</span> <span class="o">+</span> <span class="n">my_height</span><span class="p">)</span> <span class="o">*</span> <span class="n">my_padded_width_words</span> <span class="o">+</span> <span class="n">x</span> <span class="o">+</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">,</span>
<span class="n">_mm256_load_si256</span><span class="p">(</span><span class="n">ghost_buffer_bottom_recv</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">my_width_words</span> <span class="o">+</span> <span class="n">x</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">MPI_Recv</span><span class="p">((</span><span class="kt">void</span><span class="o">*</span><span class="p">)</span><span class="n">ghost_buffer_bottom_left_recv</span><span class="p">,</span>
<span class="n">X_IN_GHOST</span> <span class="o">*</span> <span class="n">Y_IN_GHOST</span><span class="p">,</span>
<span class="n">MPI_UNSIGNED_CHAR</span><span class="p">,</span>
<span class="p">((</span><span class="n">my_y</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">side</span><span class="p">)</span> <span class="o">*</span> <span class="n">side</span> <span class="o">+</span> <span class="p">((</span><span class="n">my_x</span> <span class="o">+</span> <span class="n">side</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">side</span><span class="p">),</span>
<span class="n">BOTTOM_LEFT_RECV</span><span class="p">,</span>
<span class="n">MPI_COMM_WORLD</span><span class="p">,</span>
<span class="n">MPI_STATUS_IGNORE</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o"><</span> <span class="n">Y_IN_GHOST</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">_mm256_store_si256</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="p">(</span><span class="n">y</span> <span class="o">+</span> <span class="n">Y_IN_GHOST</span> <span class="o">+</span> <span class="n">my_height</span><span class="p">)</span> <span class="o">*</span> <span class="n">my_padded_width_words</span> <span class="o">+</span> <span class="n">x</span><span class="p">,</span>
<span class="n">_mm256_load_si256</span><span class="p">(</span><span class="n">ghost_buffer_bottom_left_recv</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">X_IN_GHOST_WORDS</span> <span class="o">+</span> <span class="n">x</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">MPI_Recv</span><span class="p">((</span><span class="kt">void</span><span class="o">*</span><span class="p">)</span><span class="n">ghost_buffer_left_recv</span><span class="p">,</span>
<span class="n">X_IN_GHOST</span> <span class="o">*</span> <span class="n">my_height</span><span class="p">,</span>
<span class="n">MPI_UNSIGNED_CHAR</span><span class="p">,</span>
<span class="n">my_y</span> <span class="o">*</span> <span class="n">side</span> <span class="o">+</span> <span class="p">((</span><span class="n">my_x</span> <span class="o">+</span> <span class="n">side</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">side</span><span class="p">),</span>
<span class="n">LEFT_RECV</span><span class="p">,</span>
<span class="n">MPI_COMM_WORLD</span><span class="p">,</span>
<span class="n">MPI_STATUS_IGNORE</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o"><</span> <span class="n">my_height</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">_mm256_store_si256</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="p">(</span><span class="n">y</span> <span class="o">+</span> <span class="n">Y_IN_GHOST</span><span class="p">)</span> <span class="o">*</span> <span class="n">my_padded_width_words</span> <span class="o">+</span> <span class="n">x</span><span class="p">,</span>
<span class="n">_mm256_load_si256</span><span class="p">(</span><span class="n">ghost_buffer_left_recv</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">X_IN_GHOST_WORDS</span> <span class="o">+</span> <span class="n">x</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">MPI_Recv</span><span class="p">((</span><span class="kt">void</span><span class="o">*</span><span class="p">)</span><span class="n">ghost_buffer_top_left_recv</span><span class="p">,</span>
<span class="n">X_IN_GHOST</span> <span class="o">*</span> <span class="n">Y_IN_GHOST</span><span class="p">,</span>
<span class="n">MPI_UNSIGNED_CHAR</span><span class="p">,</span>
<span class="p">((</span><span class="n">my_y</span> <span class="o">+</span> <span class="n">side</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">side</span><span class="p">)</span> <span class="o">*</span> <span class="n">side</span> <span class="o">+</span> <span class="p">((</span><span class="n">my_x</span> <span class="o">+</span> <span class="n">side</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">side</span><span class="p">),</span>
<span class="n">TOP_LEFT_RECV</span><span class="p">,</span>
<span class="n">MPI_COMM_WORLD</span><span class="p">,</span>
<span class="n">MPI_STATUS_IGNORE</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o"><</span> <span class="n">Y_IN_GHOST</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">_mm256_store_si256</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">my_padded_width_words</span> <span class="o">+</span> <span class="n">x</span><span class="p">,</span>
<span class="n">_mm256_load_si256</span><span class="p">(</span><span class="n">ghost_buffer_top_left_recv</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">X_IN_GHOST_WORDS</span> <span class="o">+</span> <span class="n">x</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">MPI_Recv</span><span class="p">((</span><span class="kt">void</span><span class="o">*</span><span class="p">)</span><span class="n">ghost_buffer_top_recv</span><span class="p">,</span>
<span class="n">my_width</span> <span class="o">*</span> <span class="n">Y_IN_GHOST</span><span class="p">,</span>
<span class="n">MPI_UNSIGNED_CHAR</span><span class="p">,</span>
<span class="p">((</span><span class="n">my_y</span> <span class="o">+</span> <span class="n">side</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">side</span><span class="p">)</span> <span class="o">*</span> <span class="n">side</span> <span class="o">+</span> <span class="n">my_x</span><span class="p">,</span>
<span class="n">TOP_RECV</span><span class="p">,</span>
<span class="n">MPI_COMM_WORLD</span><span class="p">,</span>
<span class="n">MPI_STATUS_IGNORE</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o"><</span> <span class="n">Y_IN_GHOST</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">my_width_words</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">_mm256_store_si256</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">my_padded_width_words</span> <span class="o">+</span> <span class="n">x</span> <span class="o">+</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">,</span>
<span class="n">_mm256_load_si256</span><span class="p">(</span><span class="n">ghost_buffer_top_recv</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">my_width_words</span> <span class="o">+</span> <span class="n">x</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">MPI_Recv</span><span class="p">((</span><span class="kt">void</span><span class="o">*</span><span class="p">)</span><span class="n">ghost_buffer_top_right_recv</span><span class="p">,</span>
<span class="n">X_IN_GHOST</span> <span class="o">*</span> <span class="n">Y_IN_GHOST</span><span class="p">,</span>
<span class="n">MPI_UNSIGNED_CHAR</span><span class="p">,</span>
<span class="p">((</span><span class="n">my_y</span> <span class="o">+</span> <span class="n">side</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">side</span><span class="p">)</span> <span class="o">*</span> <span class="n">side</span> <span class="o">+</span> <span class="p">((</span><span class="n">my_x</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">side</span><span class="p">),</span>
<span class="n">TOP_RIGHT_RECV</span><span class="p">,</span>
<span class="n">MPI_COMM_WORLD</span><span class="p">,</span>
<span class="n">MPI_STATUS_IGNORE</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o"><</span> <span class="n">Y_IN_GHOST</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">_mm256_store_si256</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">my_padded_width_words</span> <span class="o">+</span> <span class="n">x</span> <span class="o">+</span> <span class="n">X_IN_GHOST_WORDS</span> <span class="o">+</span> <span class="n">my_width_words</span><span class="p">,</span>
<span class="n">_mm256_load_si256</span><span class="p">(</span><span class="n">ghost_buffer_top_right_recv</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">X_IN_GHOST_WORDS</span> <span class="o">+</span> <span class="n">x</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">MPI_Recv</span><span class="p">((</span><span class="kt">void</span><span class="o">*</span><span class="p">)</span><span class="n">ghost_buffer_right_recv</span><span class="p">,</span>
<span class="n">X_IN_GHOST</span> <span class="o">*</span> <span class="n">my_height</span><span class="p">,</span>
<span class="n">MPI_UNSIGNED_CHAR</span><span class="p">,</span>
<span class="n">my_y</span> <span class="o">*</span> <span class="n">side</span> <span class="o">+</span> <span class="p">((</span><span class="n">my_x</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">side</span><span class="p">),</span>
<span class="n">RIGHT_RECV</span><span class="p">,</span>
<span class="n">MPI_COMM_WORLD</span><span class="p">,</span>
<span class="n">MPI_STATUS_IGNORE</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o"><</span> <span class="n">my_height</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">X_IN_GHOST_WORDS</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">_mm256_store_si256</span><span class="p">(</span><span class="n">universe_words</span> <span class="o">+</span> <span class="p">(</span><span class="n">y</span> <span class="o">+</span> <span class="n">Y_IN_GHOST</span><span class="p">)</span> <span class="o">*</span> <span class="n">my_padded_width_words</span> <span class="o">+</span> <span class="n">x</span> <span class="o">+</span> <span class="n">X_IN_GHOST_WORDS</span> <span class="o">+</span> <span class="n">my_width_words</span><span class="p">,</span>
<span class="n">_mm256_load_si256</span><span class="p">(</span><span class="n">ghost_buffer_right_recv</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">X_IN_GHOST_WORDS</span> <span class="o">+</span> <span class="n">x</span><span class="p">));</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c1">//The inner loop is the same.</span>
<span class="cp">#pragma omp parallel
</span> <span class="p">{</span>
<span class="kt">uint8_t</span> <span class="o">*</span><span class="n">my_universe</span> <span class="o">=</span> <span class="n">universe</span><span class="p">;</span>
<span class="kt">uint8_t</span> <span class="o">*</span><span class="n">my_new</span> <span class="o">=</span> <span class="n">new</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">j</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">j</span> <span class="o"><</span> <span class="n">IN_GHOST</span> <span class="o">&</span> <span class="n">j</span> <span class="o">+</span> <span class="n">i</span> <span class="o"><</span> <span class="n">iters</span><span class="p">;</span> <span class="n">j</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="cp">#pragma omp for
</span> <span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">y</span> <span class="o">=</span> <span class="p">(</span><span class="n">Y_IN_GHOST</span> <span class="o">-</span> <span class="n">Y_OUT_GHOST</span><span class="p">);</span> <span class="n">y</span> <span class="o"><</span> <span class="n">my_height</span> <span class="o">+</span> <span class="n">Y_IN_GHOST</span> <span class="o">+</span> <span class="n">Y_OUT_GHOST</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="p">(</span><span class="n">X_IN_GHOST</span> <span class="o">-</span> <span class="n">X_OUT_GHOST</span><span class="p">);</span> <span class="n">x</span> <span class="o">+</span> <span class="n">WORD</span> <span class="o"><=</span> <span class="n">my_width</span> <span class="o">+</span> <span class="n">X_IN_GHOST</span> <span class="o">+</span> <span class="n">X_OUT_GHOST</span><span class="p">;</span> <span class="n">x</span> <span class="o">+=</span> <span class="n">WORD</span><span class="p">)</span> <span class="p">{</span>
<span class="n">__m256i</span> <span class="n">n</span><span class="p">;</span>
<span class="n">__m256i</span> <span class="n">alive</span><span class="p">;</span>
<span class="kt">uint8_t</span> <span class="o">*</span><span class="n">u</span> <span class="o">=</span> <span class="n">my_universe</span> <span class="o">+</span> <span class="p">(</span><span class="n">y</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="n">my_padded_width</span> <span class="o">+</span> <span class="n">x</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">_mm256_lddqu_si256</span><span class="p">((</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)</span><span class="n">u</span><span class="p">);</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">_mm256_add_epi8</span><span class="p">(</span><span class="n">_mm256_load_si256</span><span class="p">((</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)(</span><span class="n">u</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)),</span> <span class="n">n</span><span class="p">);</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">_mm256_add_epi8</span><span class="p">(</span><span class="n">_mm256_lddqu_si256</span><span class="p">((</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)(</span><span class="n">u</span> <span class="o">+</span> <span class="mi">2</span><span class="p">)),</span> <span class="n">n</span><span class="p">);</span>
<span class="n">u</span> <span class="o">+=</span> <span class="n">my_padded_width</span><span class="p">;</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">_mm256_add_epi8</span><span class="p">(</span><span class="n">_mm256_lddqu_si256</span><span class="p">((</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)</span><span class="n">u</span><span class="p">),</span> <span class="n">n</span><span class="p">);</span>
<span class="n">alive</span> <span class="o">=</span> <span class="n">_mm256_load_si256</span><span class="p">((</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)(</span><span class="n">u</span> <span class="o">+</span> <span class="mi">1</span><span class="p">));</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">_mm256_add_epi8</span><span class="p">(</span><span class="n">_mm256_lddqu_si256</span><span class="p">((</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)(</span><span class="n">u</span> <span class="o">+</span> <span class="mi">2</span><span class="p">)),</span> <span class="n">n</span><span class="p">);</span>
<span class="n">u</span> <span class="o">+=</span> <span class="n">my_padded_width</span><span class="p">;</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">_mm256_add_epi8</span><span class="p">(</span><span class="n">_mm256_lddqu_si256</span><span class="p">((</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)</span><span class="n">u</span><span class="p">),</span> <span class="n">n</span><span class="p">);</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">_mm256_add_epi8</span><span class="p">(</span><span class="n">_mm256_load_si256</span><span class="p">((</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)(</span><span class="n">u</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)),</span> <span class="n">n</span><span class="p">);</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">_mm256_add_epi8</span><span class="p">(</span><span class="n">_mm256_lddqu_si256</span><span class="p">((</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)(</span><span class="n">u</span> <span class="o">+</span> <span class="mi">2</span><span class="p">)),</span> <span class="n">n</span><span class="p">);</span>
<span class="n">_mm256_stream_si256</span><span class="p">((</span><span class="n">__m256i</span><span class="o">*</span><span class="p">)(</span><span class="n">my_new</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">my_padded_width</span> <span class="o">+</span> <span class="n">x</span><span class="p">),</span>
<span class="n">_mm256_or_si256</span><span class="p">(</span>
<span class="n">_mm256_and_si256</span><span class="p">(</span><span class="n">ones</span><span class="p">,</span> <span class="n">_mm256_cmpeq_epi8</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">threes</span><span class="p">)),</span>
<span class="n">_mm256_and_si256</span><span class="p">(</span><span class="n">alive</span><span class="p">,</span> <span class="n">_mm256_cmpeq_epi8</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">twos</span><span class="p">))));</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">uint8_t</span> <span class="o">*</span><span class="n">tmp</span> <span class="o">=</span> <span class="n">my_universe</span><span class="p">;</span>
<span class="n">my_universe</span> <span class="o">=</span> <span class="n">my_new</span><span class="p">;</span>
<span class="n">my_new</span> <span class="o">=</span> <span class="n">tmp</span><span class="p">;</span>
<span class="p">}</span>
<span class="cp">#pragma omp single
</span> <span class="p">{</span>
<span class="n">universe</span> <span class="o">=</span> <span class="n">my_universe</span><span class="p">;</span>
<span class="n">new</span> <span class="o">=</span> <span class="n">my_new</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c1">//Before we start another iteration and start sending again, let's make sure</span>
<span class="c1">//that everyone has received our messages.</span>
<span class="n">MPI_Wait</span><span class="p">(</span><span class="o">&</span><span class="n">top_left_req</span><span class="p">,</span> <span class="n">MPI_STATUS_IGNORE</span><span class="p">);</span>
<span class="n">MPI_Wait</span><span class="p">(</span><span class="o">&</span><span class="n">top_req</span><span class="p">,</span> <span class="n">MPI_STATUS_IGNORE</span><span class="p">);</span>
<span class="n">MPI_Wait</span><span class="p">(</span><span class="o">&</span><span class="n">top_right_req</span><span class="p">,</span> <span class="n">MPI_STATUS_IGNORE</span><span class="p">);</span>
<span class="n">MPI_Wait</span><span class="p">(</span><span class="o">&</span><span class="n">right_req</span><span class="p">,</span> <span class="n">MPI_STATUS_IGNORE</span><span class="p">);</span>
<span class="n">MPI_Wait</span><span class="p">(</span><span class="o">&</span><span class="n">bottom_right_req</span><span class="p">,</span> <span class="n">MPI_STATUS_IGNORE</span><span class="p">);</span>
<span class="n">MPI_Wait</span><span class="p">(</span><span class="o">&</span><span class="n">bottom_req</span><span class="p">,</span> <span class="n">MPI_STATUS_IGNORE</span><span class="p">);</span>
<span class="n">MPI_Wait</span><span class="p">(</span><span class="o">&</span><span class="n">bottom_left_req</span><span class="p">,</span> <span class="n">MPI_STATUS_IGNORE</span><span class="p">);</span>
<span class="n">MPI_Wait</span><span class="p">(</span><span class="o">&</span><span class="n">left_req</span><span class="p">,</span> <span class="n">MPI_STATUS_IGNORE</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">unsigned</span> <span class="o">*</span><span class="n">out</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="c1">//This part is very similar to the Scatter. We now have all of the final</span>
<span class="c1">//configurations, and need to send them to the master process so that we</span>
<span class="c1">//can return a matrix.</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">y</span> <span class="o">=</span> <span class="n">Y_IN_GHOST</span><span class="p">;</span> <span class="n">y</span> <span class="o"><</span> <span class="n">Y_IN_GHOST</span> <span class="o">+</span> <span class="n">my_height</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="n">X_IN_GHOST</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">X_IN_GHOST</span> <span class="o">+</span> <span class="n">my_width</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">scatter_buffer_recv</span><span class="p">[(</span><span class="n">y</span> <span class="o">-</span> <span class="n">Y_IN_GHOST</span><span class="p">)</span> <span class="o">*</span> <span class="n">my_width</span> <span class="o">+</span> <span class="n">x</span> <span class="o">-</span> <span class="n">X_IN_GHOST</span><span class="p">]</span> <span class="o">=</span> <span class="n">universe</span><span class="p">[</span><span class="n">y</span> <span class="o">*</span> <span class="n">my_padded_width</span> <span class="o">+</span> <span class="n">x</span><span class="p">];</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">MPI_Gather</span><span class="p">((</span><span class="k">const</span> <span class="kt">void</span><span class="o">*</span><span class="p">)</span><span class="n">scatter_buffer_recv</span><span class="p">,</span>
<span class="n">my_height</span> <span class="o">*</span> <span class="n">my_width</span><span class="p">,</span>
<span class="n">MPI_UNSIGNED_CHAR</span><span class="p">,</span>
<span class="p">(</span><span class="kt">void</span><span class="o">*</span><span class="p">)</span><span class="n">scatter_buffer_send</span><span class="p">,</span>
<span class="n">my_height</span> <span class="o">*</span> <span class="n">my_width</span><span class="p">,</span>
<span class="n">MPI_UNSIGNED_CHAR</span><span class="p">,</span>
<span class="mi">0</span><span class="p">,</span>
<span class="n">MPI_COMM_WORLD</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">rank</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">out</span> <span class="o">=</span> <span class="p">(</span><span class="kt">unsigned</span><span class="o">*</span><span class="p">)</span><span class="n">malloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">unsigned</span><span class="p">)</span> <span class="o">*</span> <span class="n">height</span> <span class="o">*</span> <span class="n">width</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">their_y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">their_y</span> <span class="o"><</span> <span class="n">side</span><span class="p">;</span> <span class="n">their_y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">their_x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">their_x</span> <span class="o"><</span> <span class="n">side</span><span class="p">;</span> <span class="n">their_x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">y</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">y</span> <span class="o"><</span> <span class="n">my_height</span><span class="p">;</span> <span class="n">y</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">x</span> <span class="o"><</span> <span class="n">my_width</span><span class="p">;</span> <span class="n">x</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">out</span><span class="p">[(</span><span class="n">their_y</span> <span class="o">*</span> <span class="n">my_height</span> <span class="o">+</span> <span class="n">y</span><span class="p">)</span> <span class="o">*</span> <span class="n">width</span> <span class="o">+</span> <span class="n">their_x</span> <span class="o">*</span> <span class="n">my_width</span> <span class="o">+</span> <span class="n">x</span><span class="p">]</span> <span class="o">=</span>
<span class="n">scatter_buffer_send</span><span class="p">[(</span><span class="n">their_y</span> <span class="o">*</span> <span class="n">side</span> <span class="o">+</span> <span class="n">their_x</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">my_width</span> <span class="o">*</span> <span class="n">my_height</span><span class="p">)</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="n">my_width</span> <span class="o">+</span> <span class="n">x</span><span class="p">];</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">aligned_free</span><span class="p">(</span><span class="n">new</span><span class="p">);</span>
<span class="n">aligned_free</span><span class="p">(</span><span class="n">universe</span><span class="p">);</span>
<span class="n">aligned_free</span><span class="p">(</span><span class="n">ghost_buffer_top_left_send</span><span class="p">);</span>
<span class="n">aligned_free</span><span class="p">(</span><span class="n">ghost_buffer_top_send</span><span class="p">);</span>
<span class="n">aligned_free</span><span class="p">(</span><span class="n">ghost_buffer_top_right_send</span><span class="p">);</span>
<span class="n">aligned_free</span><span class="p">(</span><span class="n">ghost_buffer_right_send</span><span class="p">);</span>
<span class="n">aligned_free</span><span class="p">(</span><span class="n">ghost_buffer_bottom_right_send</span><span class="p">);</span>
<span class="n">aligned_free</span><span class="p">(</span><span class="n">ghost_buffer_bottom_send</span><span class="p">);</span>
<span class="n">aligned_free</span><span class="p">(</span><span class="n">ghost_buffer_bottom_left_send</span><span class="p">);</span>
<span class="n">aligned_free</span><span class="p">(</span><span class="n">ghost_buffer_left_send</span><span class="p">);</span>
<span class="n">aligned_free</span><span class="p">(</span><span class="n">ghost_buffer_top_left_recv</span><span class="p">);</span>
<span class="n">aligned_free</span><span class="p">(</span><span class="n">ghost_buffer_top_recv</span><span class="p">);</span>
<span class="n">aligned_free</span><span class="p">(</span><span class="n">ghost_buffer_top_right_recv</span><span class="p">);</span>
<span class="n">aligned_free</span><span class="p">(</span><span class="n">ghost_buffer_right_recv</span><span class="p">);</span>
<span class="n">aligned_free</span><span class="p">(</span><span class="n">ghost_buffer_bottom_right_recv</span><span class="p">);</span>
<span class="n">aligned_free</span><span class="p">(</span><span class="n">ghost_buffer_bottom_recv</span><span class="p">);</span>
<span class="n">aligned_free</span><span class="p">(</span><span class="n">ghost_buffer_bottom_left_recv</span><span class="p">);</span>
<span class="n">aligned_free</span><span class="p">(</span><span class="n">ghost_buffer_left_recv</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">rank</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">aligned_free</span><span class="p">(</span><span class="n">scatter_buffer_send</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">aligned_free</span><span class="p">(</span><span class="n">scatter_buffer_recv</span><span class="p">);</span>
<span class="k">return</span> <span class="n">out</span><span class="p">;</span>
<span class="p">}</span></code></pre></figure>
<p>This implementation runs our 8192x8192 problem for 256 iterations in 0.58 seconds, a speedup of 462.30x over the reference code. We did gain some performance by explicitly managing our memory, but it appears that we have not achieved full utilization of our single processor.</p>
<p>You may be wondering why mpi.c is faster than the omp.c. After all, omp.c performs as well at 10 threads as it does at 40 threads, and mpi.c has a lot of extra copying into buffers, etc. I suspect that because the MPI code explicitly manages communication, passing only the ghost zones to its neighbors once every couple of iterations, mpi.c spends less time in communication. The OpenMP version of the code leaves the communication to the cache, which doesn’t know or care about our delicate ghost zones. By keeping two MPI processes on each NUMA node with 10 OpenMP threads each, we can explicitly manage communication between NUMA nodes.</p>
<p>Perhaps an interested reader can help explain to me why the application does not scale beyond 10 or so CPUs. My hypothesis is that we are memory-bound at some level of the cache, meaning that most of the processors are waiting to load life cells from memory instead of staying busy computing lice cells.</p>
<p>A benefit of rewriting using MPI is that we can now run our program on an arbitrary number of nodes networked together. Who cares about single processor performance when you can have 400 processors? Future work!</p>
<h2 id="conclusion">Conclusion</h2>
<p>Here is a table showing the various times, speedups, and percentages of peak for each code:</p>
<table>
<thead>
<tr>
<th style="text-align: left">Code</th>
<th style="text-align: left">Time (seconds)</th>
<th style="text-align: left">Speedup (over reference.c)</th>
<th style="text-align: left">% of peak</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">reference.c</td>
<td style="text-align: left">268.33</td>
<td style="text-align: left">1.00</td>
<td style="text-align: left">0.05</td>
</tr>
<tr>
<td style="text-align: left">simple.c</td>
<td style="text-align: left">208.35</td>
<td style="text-align: left">1.29</td>
<td style="text-align: left">0.06</td>
</tr>
<tr>
<td style="text-align: left">padded.c</td>
<td style="text-align: left">51.59</td>
<td style="text-align: left">5.20</td>
<td style="text-align: left">0.24</td>
</tr>
<tr>
<td style="text-align: left">sse2.c</td>
<td style="text-align: left">4.62</td>
<td style="text-align: left">58.06</td>
<td style="text-align: left">2.68</td>
</tr>
<tr>
<td style="text-align: left">avx2.c</td>
<td style="text-align: left">3.71</td>
<td style="text-align: left">72.42</td>
<td style="text-align: left">3.34</td>
</tr>
<tr>
<td style="text-align: left">streaming.c</td>
<td style="text-align: left">3.10</td>
<td style="text-align: left">86.44</td>
<td style="text-align: left">3.99</td>
</tr>
<tr>
<td style="text-align: left">omp.c</td>
<td style="text-align: left">0.71</td>
<td style="text-align: left">375.60</td>
<td style="text-align: left">17.34</td>
</tr>
<tr>
<td style="text-align: left">mpi.c</td>
<td style="text-align: left">0.58</td>
<td style="text-align: left">462.30</td>
<td style="text-align: left">21.34</td>
</tr>
</tbody>
</table>
<p>Hopefully you had as much fun reading this code as I did writing it. This writeup is heavy on code and light on analysis, but I hope you enjoyed following me on my journey through life.</p>
<h2 id="copyright">Copyright</h2>
<p>Copyright (c) 2016, Los Alamos National Security, LLC</p>
<p>All rights reserved.</p>
<p>Copyright 2016. Los Alamos National Security, LLC. This software was produced under U.S. Government contract DE-AC52-06NA25396 for Los Alamos National Laboratory (LANL), which is operated by Los Alamos National Security, LLC for the U.S. Department of Energy. The U.S. Government has rights to use, reproduce, and distribute this software. NEITHER THE GOVERNMENT NOR LOS ALAMOS NATIONAL SECURITY, LLC MAKES ANY WARRANTY, EXPRESS OR IMPLIED, OR ASSUMES ANY LIABILITY FOR THE USE OF THIS SOFTWARE. If software is modified to produce derivative works, such modified software should be clearly marked, so as not to confuse it with the version available from LANL.</p>
<p>Additionally, redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:</p>
<ol>
<li>Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.</li>
<li>Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.</li>
<li>Neither the name of Los Alamos National Security, LLC, Los Alamos National Laboratory, LANL, the U.S. Government, nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.</li>
</ol>
<p>THIS SOFTWARE IS PROVIDED BY LOS ALAMOS NATIONAL SECURITY, LLC AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL LOS ALAMOS NATIONAL SECURITY, LLC OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.</p>Willow AhrensAn example of multiprocessor optimization.