(This Article is Barely About) Reducers

Have you ever done multiple things at the same time? Don’t be silly, of course you have. You’ve made a sandwich while guiltily watching The Real Housewives of Orange County or texted while driving or fantasized about which yacht to buy with your vast book proceeds while vacuuming. (I don’t know where I got these examples. They just came to my mind for no reason whatsoever.)

Point is, life is full of doing multiple things at once. Until recently, though, we programmers haven’t had to deal with this unpleasant fact while programming. Alas, the halcyon days of purely serial (non-concurrent, single-threaded) code are over. It’s time to adapt to the new reality, a reality where you have to know how to write code for multiple processors if you want your programs to have acceptable performance.

In Clojure for the Brave and True I wrote about the state-management difficulties that you can run into when doing concurrent programming, and how Clojure can help you deal with them. But that’s only half the story. If you really want to master the art of doing multiple things at once, you need to understand parallelism.

And hey, guess what! It just so happens that this article is about parallelism! In the pages (screens? not-yet-scrolled-to portion?) ahead, you’ll learn about Clojure’s core.reducers library, a great option for doing parallel computation. Whereas clojure.core provides only pmap to parallelize mapping, I’ll show you how to use reducers to parallelize take, filter, and more.

My goal, though, is for you to understand the reducers library completely; if I only show you how to use reducers, I’ll have failed you as an author and a gentleman. The world of parallel programming is fascinating, and this article will take you on a thorough tour of it so you’ll understand how the reducers library fits into the broader computing landscape. You’ll understand not just what the reducers library does, but why and how.

To get there, I’ll start you off with a tutorial that will give you a good practical understanding of the library. You’ll learn about what the functions do, when to use them, and unexpected behavior to watch out for. I’ll also compare reducer performance to the serial counterparts.

This will give you enough to hang your hat on; it will give you a good concrete reference point to make sense of the more abstract discussion that will follow. Plus, it will inspire the kind of pulse-pounding edge-of-your-seat suspense that you haven’t felt since Lost went off the air as your brain scrambles to answer the quest But how do reducers do that? (Probably.) And if you somehow manage to stop reading after the tutorial, you’ll still have learned enough to improve your code.

After the tutorial you’ll jump into the more conceptual portion by learning all about parallel performance. You’ll learn more about why it matters, and some general performance strategies. Next, you’ll dig deep into data parallism, where you’ll learn about the work-span model, one of the theoretical models used to reason about parallel performance. It’s not all theory, though; you’ll also learn about the practical approaches to writing parallel programs. I’ll discuss how you can achieve the balance between minimizing overhead and load balancing using thread management, granularity, parallel slack, tiling, and fusion. You’ll learn about the executors in Java’s java.util.concurrent package, which Clojure uses extensively.

All of this fun prep work will have you primed to understand the fork/join framework. You’ll learn how fork/join implements many of the techniques mentioned above and adds a couple more to the mix, including recursive decomposition and work stealing.

And then we’ll be ready to circle back to the reducers library. We’ll revisit the reducers examples and add a few more, and we’ll peek at the Clojure implementation. After going through the implementation code, your brain, which is a parallel processing machine, will fully understand Clojure’s parallel processing library. You will enjoy a moment of smug assurance that your brain’s capabilities vastly exceed the computer’s, until a nagging doubt worms its way into your consciousness: But for how long?

Sounds super fun! Let’s get started!

Know Your Reducers

In one episode of Man vs. Wild where Will Ferrell, Hollywood actor and comedian, accompanies Bear Grylls, professional survivalist, deep inside the Arctic circle. The very first thing Grylls makes Ferrell do is rappel from a helicopter hovering 150 feet above the ground. As Ferrell makes his descent, he has only one word for the camera: "Mommy!"

That’s about how I felt when I was first tried to learn about reducers: plunked down far from home in a completely foreign and uncomfortable landscape. Let’s avoid creating that feeling in you by starting with something familiar: seq functions.

Reducers vs. Seq Functions

The reducers library (in the clojure.core.reducers namespace) has alternative implementations of map, filter, and other seq functions. These alternative functions are called reducers, and you can apply almost everything you know about seq functions to reducers. Just like seq functions, the purpose of reducers is to transform collections. For example, if you wanted to increment every number in a vector, filter out the even ones, and sum the numbers, the seq and reducer versions look virtually identical:

[[seq functions and reducers are identical in many ways]]

---
;; seq version
(->> (range 10)
     (map inc)
     (filter even?)
     (reduce +))
;=> 30

: reducer version (require '[clojure.core.reducers :as r]) (→> (range 10) (r/map inc) (r/filter even?) (r/reduce +)) ;⇒ 30 ---

In both examples, (range 10) returns a seq of the numbers 0 through 9. In the first example, map is a seq function. In the second, r/map is a reducer. In both examples, map has the same purpsose: transforming an input collection to an output by applying a function to each element of the input. Reducers are just another means of transforming collections, and for the most part they’re used just like the seq functions you already know and love.

One Difference between Reducers and Seq Functions

One way that reducers are not like seq functions is that you can’t call many seq functions like first on their return values: something like (first (r/map inc [1 2])) won’t work. You’ll learn why soon.

So why bother with them? Reducers are designed to perform better (often dramatically better) than seq functions under some circumstances by performing eager computation, eliminating intermediate collections and allowing parallelism.

In fact, as far as I can tell, parallelization is the main reason Rich Hickey, Lambda Whisperer added reducers to the language. The goal of executing collection transformations in parallel completely determined how reducers are designed, and even explains why we use the term reducers: reduce operations can be trivially parallellized (you’ll learn how by the end), and every collection transformation can be expressed in terms of reduce (there’s a big juicy math paper on that).

Thus, the reducers library is built around using reduce to transform collections, and it’s designed this way to make parallelization possible. Both eager computation and the elimination of intermediate collections are secondary - but useful - consequences of this design. (By the way: "elimination of intermediate collections" is a freaking mouthful, so I’m going to start using EIC).

Let’s look at these three strategies more closely. After that, I’ll out the rules for how to ensure that your code makes use of each of them, without going into much detail about why they work as they do. Last, I’ll also give some examples of reducers at work, showing how their performance compares to alternatives.

The Strategies

Understanding the reduce library’s three performance strategies - eager computation, EIC, and parallelism - will give you the rationale behind reducers, and that will help you understand their implementation later on. Plus, these ideas are pretty groovy in their own right. I know you’re eager to learn about eager computation, so let’s go!

Eager Computation

Eager computation stands in contrast to Clojure’s core sequence functions, which produce and consume lazy sequences. You may remember that lazy seqs don’t compute their members until you try to access them. (For a refresher on lazy sequences, check out the lazy seqs section in CFTBAT.) This is often a good performance strategy because it saves your program from doing unnecessary computations, but in cases where you know you have to realize (compute the elements of) the entire seq, laziness actually introduces some unnecessary overhead. Reducers perform eager computation: they always compute every member of a collection, and that can improve performance slightly.

Because reducers are eager, you shouldn’t use them with infinite sequences unless you want something useless to happen. The reducer would try to realize the entire sequence, which isn’t possible, and consume all available memory in the process.

This isn’t to say that reducers can’t work on lazy seqs. They totally can! They just fail if the lazy seq is infinite.

(Story time! When my brother and I were teenagers, we honed our technique in using "The Lazy", which was like the opposite of The Force from Star Wars. In the same way that The Force had a vaguely Eastern grounding in chi or whatever, The Lazy had a vaguely Eastern grounding in the tao or whatever; it was the art of doing by not doing. For example, I was ostensibly supposed to do my own laundry, but I found that if I left it by the washing machine long enough then my mom would get disgusted and just do it.

It didn’t always work. Once, I had been writing in bed. I don’t remember what (probably terrible poetry or some other teenagery dreck), but I had been balling up my paper and throwing it at my waste basket, and missing. My brother came into my room and I innocently asked him, "Hey, could you pick up that piece of paper?" and he did. Then, "Could you put it in that trash can?" And he almost did. He leaned over to drop it, then realized what he was being Lazied and threw it at my face instead.)

Eliminating Intermediate Collections

Reducers don’t produce intermediate collections. An intermediate collection is a collection produced by one sequence transforming function and passed as an argument to another sequence transforming function. Let’s look at a kind of stupid example:

---
(->> (list "Shake" "Bake")         ; (1)
     (map #(str % " it off"))        ; (2)
     (map clojure.string/lower-case) ; (3)
     (into []))                      ; (4)
---

This code just takes a list of words <1> and performs two map operations. First, it appends " it off" to each string <2>, then it lower cases each <3>. Finally, it puts the result in a vector <4>.

Silly code that would never really exist. What we care about here is that the map at <2> constructs a new collection (a lazy seq, as it happens), and so does the map at <3>. Each of these intermediate collections takes resources to construct and traverse. How inefficient! Haters are going to hate on this inefficiency, but luckily you can shake it off with reducers. (Note to self: is there some way to make this attempt at a pun even more forced and cringeworthy? Another note to self: What am I doing quoting Taylor Swift yet again? Is this some desperate bid to stave off an unconscious fear of growing older? For god’s sake, man! You’re only 30! Get a hold of yourself!)

Looking at this code you can deduce that it would be more efficient to just combine the two string functions into one:

[[fusing two elemental functions]]

---
(->> (list "Shake" "Bake")
     (map (comp clojure.string/lower-case #(str % " it off")))
     (into []))
---

This eliminates one of your intermediate collections, and makes your code faster. In parallel programming terminology, you’d say that you fused two elemental functions into one. An elemental function is just a normal function, like lower-case, but we give it the qualifier elemental to communicate that we’re talking about it in the context of collection transformation; it’s applied to each element of the collection. Fusion is the composition of elemental functions to avoid producing intermediate collections.

It would be cumbersome to have to rewrite your code to achieve this kind of fusion, though. Ideally, you want your collection transformations to compose into one fused function without any manual intervention on your part. You want to be able to write idiomatic Clojure code (like the following) and have it "just work":

[[you dream of a paradise where this composes through fusion]]

---
(->> (range 1000)
     (r/filter even?)
     (r/map inc)
     (into []))
---

And hey, guess what! Reducers just work! They’re actually designed to compose into a single fused elemental function when you chain them like in the example above, without any extra work on your part. In the above example, it’s as if you’re telling Clojure, "Take a seq of the numbers 0 through 999. For every element of that function, apply a function that does three things: filters the element if it’s even, then increments the result (assuming the element wasn’t filtered), and then places the result in a vector."

In a few minutes, you’ll see just how much this can improve performance. But first, let’s look at the final performance strategy that reducers enable: parallelism.

TODO add note about transducers, include this: --- (into [] (comp (filter even?) (map inc)) (range 1000000)) ---

Parallelism

Parallelism is the simultaneous execution of tasks on two or more processors.

Task is a nebulous word, so let’s define it. I use task to refer to function calls, but from the perspective of the processor instead of the programmer; execution instead of semantics. That’s still nebulous, so let’s look at a concrete example.

Let’s say we’re creating a new industry-disrupting app, iFacePalm. To use it, you position your phone so that its camera faces your palm and takes a picture. Then it predicts your future. Fortune telling industry: disrupted!!!

Now, you may not know this, but the fortune telling industry has no quality control standards. There is no Six Sigma of fortune telling. I mean gosh, sometimes it seems like these people are just making things up! We can do better. We can develop an accurate palm-based human life prediction model. But to do that, we’re going to need to process a shit-ton of data, comparing people’s palm stats with their entire life history to reveal the connections that without a doubt exist.

Here’s some Clojure code we could write that would get us started:

[[A bright beginning to a game-changing app]]

---
(defn lifeline-and-years-lived
  "Given a human subject, return vector of lifeline ratio and years
  person lived"
  [{:keys [palm-stats life-history] :as subject}]
  [(:lifeline-ratio palm-stats) (:years-lived life-history)])
---

This function extracts a person’s lifeline ratio (length of lifeline compared to hand size) and age at death, and putting the two in a vector. Here’s how you’d use it:

[[A bright beginning to a game-changing app]]

---
(lifeline-and-years-lived {:palm-stats   {:lifeline-ratio 0.5}
                           :life-history {:years-lived 75}})
---

From the programmer’s perspective, we’d call this a function call. We care about the return value and how to relate that to other functions according to business needs. We care about the meaning of the function call.

From the execution perspective, we’d call it a task. It’s just some work that needs to get done, and we don’t care about its meaning or how it relates to business needs.

So, if I said to you, "Let’s speed this program up by putting this task on another thread," you’d know that I’m not talking about changing the meaning of the program. I’m just talking about some work that needs to get done, and how to shuffle the work around for better performance.

Now let’s say we wrote something like this:

[[Task is used at any level of granularity]]

---
(map lifeline-and-years-lived subjects)
---

In the same way that this is one function call, map, which results in many applications of lifeline-and-years-lived, it’s one task that results in the execution of many sub-tasks. You can treat the larger task as single, black box task, just as you can treat the function as a black box.

Learn more about Concurrency and Parallelism

For a more detailed explanation of concurrency and parallelism, check out the first two sections Clojure for the Brave and True’s chapter The Sacred Art of Concurrent and Parallel Programming. The chapter uses the classic Lady Gaga Telephone example to explain the concepts, and describes how they’re implemented on the JVM.

So, parallelism is all about executing tasks simultaneously on multiple processors. Given the right conditions (explained in the next section), reducers will transparently subdivide their "transform a collection" task into subtasks of "transform this partitioned portion of a collection". These subtasks will execute in parallel, and the results are recombined, also transparently. So, for example, if you have this code:

[[parallelism example]]

---
(r/fold + (vec (range 10000000)))
---

Then Clojure will divide your vector of ten million numbers into sub vectors of 512 elements, run (reduce \ subvector)+ on those subvectors in parallel, and combine the results.

This divide/parallelize/re-combine process relies on the fork/join framework, and much of the rest of this article builds up to explaining how it works.

First, though, let’s look at how to actually use the reducers library and get a real sense of the performance bounty it yields.

How to Use Reducers

You’ve seen how, for the most part, you can use reducers just like seq functions. For example, this produces the same result as the seq counterpart:

[[quick reducer example]]

---
(require '[clojure.core.reducers :as r])
(->> [1 2 3 4]
     (r/filter even?)
     (r/map inc)
     (r/reduce +)
; => 8
---

There are a couple rules to keep in mind, though. First, if you want to make use of parallelism, you have to use the r/fold function with a foldable collection. r/fold is a parrallel implementation of reduce.

I need to explain parallelism a bit more before explaining what foldable means, but for now the relevant fact is that vectors and maps are the only foldable collections. So if you wrote code like this, you might expect it to run in parallel, but it wouldn’t:

[[only foldable collections can be parallelized]]

---
(require '[clojure.core.reducers :as r])
(->> '(1 2 3 4)
     (r/filter even?)
     (r/map inc)
     (r/fold +)
; => [3 5]
---

Lists aren’t foldable, so reducers can’t operate on them in parallel. Instead, the code falls back to serial reduce.

The second rule is that reducers actually don’t produce any collections. It’s awesome that r/map and r/filter compose without creating intermediate collections, but the catch is that they behave a little differently from seq functions. Here’s one way you could get tripped up:

[[sad broken bode]]

---
(first (r/map inc [1 2 3]))
; => java.lang.IllegalArgumentException: Don't know how to create ISeq from: clojure.core.reducers$folder$reify__17186
---

This happens because r/map and the other reducers don’t actually return a new collection. Instead, they return a reducible.

A reducible is like a recipe for how to produce a new collection, along with a reference to a source collection. In the above example, the recipe is "produce a new collection by incrementing each element each element of the source collection", and the source collection is the vector [1 2 3].

Another way to put it: a reducible is an elemental function, along with the collection whose elements you want to apply the function to.

And yet another way to put it: It’s as if you were one of Santa’s worker elves (or some other mythical creature of labor), and the foreman handed you a piece of paper with your instructions for the day: "Go to the plastic shed out back, get all the plastic lumps and turn them into toy whatevers to feed to the insatiable, gaping maw of consumerism."

If I then walked up to you and said, "give me the first of that piece of paper", you would look at me in confusion, because the request wouldn’t make any sense. The piece of paper is not a collection that you can take the first of. Similarly, when you tell Clojure (first (r/map inc [1 2 3])) it expresses its confusion with an exception because the request doesn’t make sense.

You might initially think you’re telling Clojure "give me the first element of the collection returned by r/map", but you’re actually saying "give me the first element of the reducible returned by r/map". A reducible is a recipe for how to transform a given collection, but it’s not a collection itself, so Clojure can’t fulfill your request.

This design decision to have functions like r/map return a reducible instead of a collection is what allows reducers to seamlessly compose, building up a single fused elemental function. I’ll explain exactly how this happens later in the article, and for now rely on the power of metaphor so that you’ll have a strong intuitive understanding of how it works.

But you need to get collections somehow. If you want to use reducer functions to produce another collection, you have to explicitly realize the result collection by calling r/foldcat or clojure.core/into. r/foldcat calls r/fold internally, and into calls reduce. Here are two examples:

[[successful reducer collection realization]]

---
(->> [1 2 3 4]
     (r/map inc)
     (into [])
     (first))
; => 2

(→> [1 2 3 4] (r/map inc) (r/foldcat) (first)) ; ⇒ 2 ---

You should use into when you want to explicitly specify the collection type. In the example above, you’re realizing the collection as a vector. into is serial.

r/foldcat executes parallelly (is that a word?) and returns a collection that acts pretty much like a vector. The collection is foldable, so you can use it with reducers in further parallel computations. It’s also seqable, like a vector, so you can call functions like first on it.

To summarize the rules:

If you want to parallelize your reduction, use r/fold and a foldable collection (a vector or map).
If you want your transformations to return a collection, use into or r/foldcat. into executes serially and lets you specify the collection type, while r/foldcat is parallel and returns a vector-like collection.

Soooo now that you’ve spent all this time learning about reducers' performance strategies and how to use them, let’s wrap everything together by comparing reducer performance to seq functions. If I’ve done my job right, you won’t be surprised by the differences. If I haven’t done my job right, then, well… sorry?

A Peek at Performance

Whether you’re producing intermediate collections
Whether you’re using r/fold with foldable collections

To see the performance impact of these variables, we’re going to look at similar computations performed using r/fold (for parallelism and EIC), r/reduce (for EIC alone), and clojure.core/reduce as a baseline. We’ll also perform the computations on foldable collections (vectors) and on unfoldable collections (lazy seqs).

You can find the code for this section at … After doing the performance measurements for awhile, I wanted to make the process a bit less tedious, so I wrote the macro times and its helper functions pretty-time and runs-and-pauses to abstract the process of running a snippet multiple times, timing each run, averaging those time values, and usefully printing the result. This lets you do something like this:

[[times macro example]]

---
(let [x (vec (doall (range 1000000)))]
  (times (r/fold + x)
         (r/reduce + x)
         (reduce + x)))
"    6.62 (r/fold + x)"
"   20.94 (r/reduce + x)"
"   16.05 (reduce + x)"
---

The code for times is the first thing to greet you, but I’m going to go over it here for fear that you would (justifiably) shake your first at me from behind your monitor for veering off topic.

Here are our first comparisons:

[[Comparing performance for computationally simple ops]]

---
;;----- Simple operations

(def snums (range 10000000)) ; <1> (def snumsv (vec snums))

(defn t1 [x] (times (r/fold + x) (r/reduce + x) (reduce + x)))

(defn t2 [x] (times (→> x (r/map inc) (r/fold +)) (→> x (r/map inc) (r/reduce +)) (→> x (map inc) (reduce +))))

(defn t3 [x] (times (→> x (r/filter even?) (r/map inc) (r/fold +)) (→> x (r/filter even?) (r/map inc) (r/reduce +)) (→> x (filter even?) (map inc) (reduce +)))) ---

First, we define two collections of numbers from 0 to 99,999,999. snums is an unfoldable seq and snumsv is a foldable vector.

Next, we define three functions, t1, t2, and t3. Each function takes a collection (we’ll use snums and snumsv), transforms it using r/fold, r/reduce, and reduce, and prints out how much time each transformation takes. Enough with the yik yak, let’s actually use them:

[[timing a simple reduce]]

---
(t1 snums)
; => "  127.28 (r/fold + x)"
; => "  131.67 (r/reduce + x)"
; => "  119.67 (reduce + x)"

(t1 snumsv) ; ⇒ " 59.15 (r/fold + x)" ; ⇒ " 164.32 (r/reduce + x)" ; ⇒ " 148.15 (reduce + x)" ---

When we call t1 with snums, the three reduction functions have roughly the same performance. Put another way: we don’t get any performance benefits from EIC or parallelism. This makes sense because we’re not doing a map, filter, or some other function that semantically returns a new collection. We’re also using the unfoldable snums, and in that case r/fold quietly degrades to r/reduce.

When we call t2 with snumsv, though, we see a significant speedup from r/fold. My computer has four cores, so ideally r/fold would take 25% as much time as the serial versions, but it’s rare to meet that ideal because parallelization has overhead costs.

Let’s see what happens when we have an intermediate transformation (map):

[[reduce with an intermediate transformation]]

---
(t2 snums)
; => "  166.43 (->> x (r/map inc) (r/fold +))"
; => "  169.32 (->> x (r/map inc) (r/reduce +))"
; => "  322.66 (->> x (map inc) (reduce +))"
---

Dayumn! The reducer versions take nearly fifty percent less time! There’s no difference between r/fold and r/reduce, though, becuase we’re not using a foldable collection. Here’s what happens when you change that:

[[reduce with an intermediate collection and foldable source]]

---
(t2 snumsv)
"   74.45 (->> x (r/map inc) (r/fold +))"
"  214.04 (->> x (r/map inc) (r/reduce +))"
"  374.91 (->> x (map inc) (reduce +))"
---

Here, r/fold is five times faster than reduce, thanks to EIC and parallelism. (Interestingly, it seems serial reduce takes a bit longer with vectors.)

What happens if you have two intermediate collections? This happens:

[[reduce with two intermediate collections]]

---
(t3 snums)
"  184.07 (->> x (r/map inc) (r/filter even?) (r/fold +))"
"  181.03 (->> x (r/map inc) (r/filter even?) (r/reduce +))"
"  431.53 (->> x (map inc) (filter even?) (reduce +))"

(t3 snumsv) " 71.96 (→> x (r/map inc) (r/filter even?) (r/fold +))" " 207.47 (→> x (r/map inc) (r/filter even?) (r/reduce +))" " 478.25 (→> x (map inc) (filter even?) (reduce +))" ---

TODO fix this paragraph, too wordy

When you call r/fold with a vector, it’s almost seven times faster. Not only that, reducer performance doesn’t dramatically degrade with more intermediate transformations as it does with the seq functions; adding the additional r/filter doesn’t seem to affect performance at all. I’m sure if we added a computationally intensive transformation then performance would naturally suffer, but it would be from the transformation and not the construction of an intermediate collection.

Performance & Parallelism

TODO

The Three Performance Walls

The reason we need to care about concurrent and parallel programming techniques is that computer hardware manufacturers have run into three fundamental limitations, imposed by physics, that won’t be overcome any time soon — if ever. Because of these limitations, we can no longer … The limitations are known as:

The Power Wall
The Memory Wall
The Instruction-Level Parallelism Wall

The Power Wall is a limitation on CPU clock speeds. You’ve probably noticed that clock speeds have barely inched forward over the last decade, compared to the rapid progress of previous decades where clock speeds followed Moore’s law and doubled every eighteen months. The reason for this near halt in progress is that chip designs have reached a point where increasing clock speed results in exponential increases in power consumption and heat, and no one wants to buy a computer that costs as much to run as a UNIVAC.

Even if clock speeds could be increased, the hardware would still have to contend with the Memory Wall, which is the extreme disparity between memory access time and CPU performance — CPUs can process instructions much faster than they can fetch them from main memory. Increasing clock speed would be like TODO analogy.

TODO mention this is why we need explicit parallel techniques TODO that the code we write is often serial even though it can be considered parallel

The final limitation, the Instruction-Level Parallelism (ILP) Wall is a limitation on the level of parallelism that can be extracted from serial (non-parallel) instructions. Much of the hullabaloo around parallelism has focused on the fact that we’re stuffing more cores into CPU’s, but in fact, even old-timey single-core machines have parallel aspects to their architectures and are capable of running serial instructions in parallel, to an extent. In fact, hardware can automatically parallelize serial instructions to an extent.

In an ideal world, hardware would be smart enough to automatically parallelize everything that can be parallelized, but the fact is they can’t, and it looks like there won’t be any significant improvements any time soon.

Because of these three limitations, chip manufacturers have focused on developing multi-core process instead of increasing clock speed. In order to get the most performance out of these processors, we have to structure our applications differently.

Concurrent and Parallel Programming

TODO explain the "task" abstraction

Concurrent and Parallel programming refer to the tools and techniques you use to program for multiple processors. Concurrency refers to a system’s ability to manage more than one task at a time, while parallelism refers to a system’s ability to execute more than one task at a time. From this perspective, parallelism is a sub-category of concurrency.

Programmers usually the term concurrency when referring to multiple, independent tasks with access to shared state. For example, TODO example. Parallelism usually refers to decomposing a collection of data into smaller chunks, processing those, and reassembling the results. In this situation, there’s no logical need for shared access to state. Of course, you have to keep track of all of the different computations.

TODO talk about threading and scheduling

Performance

So far, I’ve been talking about performance without defining it, relying on the shared general sense of performance as the thing we want to improve to the point that users don’t say "This is slow and I hate it." In this section, I’ll break down performance, defining its most relevant aspects. I’ll also describe the high-level strategies we use to improve it.

Performance Aspects

Latency is the amount of time it takes to complete a task, and is what we usually care about most because it has the most direct impact on user experience. One example is network latency, or the amount of time it takes for a packet to reach its destination. If you’re measuring the amount of time it takes to open a file or execute a function or generate a report, those are all latency.

You can measure latency at any level of granularity. For example, if you make web sites you’ve probably measured the total time it takes to load a web page to decide if it needs optimization. At first, you only care about the "load the page" task as a whole. If you discover it’s too slow, then you can drill down to individual network requests to see what’s causing problems. Drilling down further, you might find that your SQL queries are taking a long time because your tables aren’t indexed properly, or something like that.

Most of this article focuses on how to effectively reduce latency with parallel programming.

Throughput is the number of tasks per second that your system can perform. Your web server, for example, might be able to complete 1,000 requests per second.

TODO EXPAND There’s a direct relationship between latency and throughput. Let’s say you’re running the world’s lousiest web server, and it can only handle one request per second. If a thousand people make a request to this server at the same time, then on average it will take 500 seconds to respond to a request.

Utilization is the degree to which a resource is used. It has two flavors, capacity-based utilization and time-based utilization. In this article we only care about the time-based flavor, which is a measure of how busy a resource is over a given unit of time. Specifically, we care about CPU utilization, which is the percentage of time that your CPU is doing work divided by some unit of time.

One of the challenges with parallel programming is figuring how to make efficient use of resources by ensuring that we reduce unnecesary CPU idle time. Later in the article, you’ll learn about techniques that help you do this, including the powerful Fork/Join framework.

TODO Speedup

General Performance Strategies

There are three concurrent/parallel programming general strategies you can use to help improve performance: latency hiding, functional decomposition, and data parallelism. Guess what’s coming next! That’s right, I’m going to explain those things!

Latency Hiding

TODO betterify definition Latency hiding is a fancy term for something you do all the time. You’re hiding latency whenever you move a task that’s in a waiting state to the background and focus on something else. Examples abound, not just in programming but in real life.

If you use Clojure’s future function to kick off a task in a separate thread so that the main task can continue unimpeded, you’re hiding latency. I’ve used future on web servers to send an email without increasing the overall response time for a user’s request.

Latency hiding is often a cheap and easy way to get quick performance gains. On the other hand, forgetting to employ it can lead to some dire consequences, as this comic illustrates:

TODO image

You probably already use latency hiding all the time, even if you don’t call it that. Though you may be an old hand at it, I think it’s useful to have a name for it and to place it within the larger performance context.

Functional Decomposition

Functional decomposition is the term for when a multicultural group of teenagers combine their powers to summon an avatar of the earth to fight pollution:

TODO image

TODO not just different threads. Different servers. Different spaces/processes.

TODO already used this trope

Cough uh, I mean, functional decomposition is the practice of running logically independent program modules in parallel on separate threads. As it turns out, all Java programs (including Clojure programs) already do this: every Java program has a garbage collector running on a separate thread.

Another common example functional decomposition is putting long-running tasks on a queue so that a background thread can process them without impeding the main thread. One of my site projects does this: the main thread runs a web server, and the background thread (launched with future) constantly works through a queue of RSS feeds, checking for updates and putting the results in a database.

Functional decomposition will only give you a constant factor speedup. When you split your code base into two modules and run them on separate threads, you don’t get any additional benefits if you increase the number of cores on your machine.

If you squint a little bit, this strategy looks a lot like something you do all the time on a larger scale. You run your web server and database on separate machines. On a single machine, you run logically independent modules as separate processes, also known as programs. Like latency hiding, functional decomposition might be something you’re familiar with; it’s just fun to know words for things and their place in the greater order of the cosmos.

In the next section, I’m going to start venturing into unfamiliar territory. Best grab your mosquito repellant and machete.

Data Parallelism

With the data parallelism strategy, you divide a task into sub-tasks that that don’t have side effects and that can be executed simultaneously. A dramatic example of this is seen in the Terminator movies, where the task "destroy humankind" is divided into subtasks of "destroy the humans you encounter" and executed by hunky killing machines with Austrian accents. You can think of it as a kind of parallelized reduce operation.

TODO terminator image

Another oft-used example is the map function. On an abstract level, map derives a new collection from an existing one by applying a function to every element of the original collection. There’s nothing in map's semantics that requires the function applications to happen in any particular order, and there’s no shared state, so it’s a perfect candidate for parallelization. (In the literature, this kind of easily-parallelized operation is called embarrassing parallelism, which absolutely tickles me. "Pardon me! It appears that I’ve been parallelized yet again, and in public no less!")

One hallmark of data parallelism is that it’s scalable. If you increase the amount of the work that needs to get done, it’s possible to reduce the amount of time needed to do it by throwing more hardware at it. In the Terminator example, Skynet can eleminate mankind more quickly by producing more terminators. In the map example, you can complete the mapping more quickly by adding more cores.

Part of the fun in learning about data parallelism is discovering how it can be used in cases beyond a simple map. The scan operation, for example, has a data dependency not present with map. Let’s look at how scan should work, then give it a definition and unpack how it differs from map. There’s no scan function in Clojure, but here’s how it should behave:

[[scan behavior]]

---
(scan [1 2 3 4])
; => (1 3 6 10)

(scan [1 0 0]) ; ⇒ (1 1 1) ---

Scan works by "rolling up" values as it traverses a sequence such that each element of the resulting sequence is derived from previous elements in the sequence. Let’s call the initial sequence x and the result y. In the first example, x₁ and y₁ are identical. y₂ is the result of adding x₁ and x₂. y₃ is x₁ \+ x₂ \+ x₃, and so on. You can see why scan is also known as "cumulative sum" - each element of the result is sum of all elements from the original sequence, up to that point.

The reason why it’s not obvious how to parallelize this is that each function application depends on the result of the previous application. This becomes obvious when you look at a naive implementation:

[[naive scan implementation]]

---
(defn scan
  [[x y & xs]]
  (if y
    (cons x (scan (cons (+ x y) xs)))
    (list x)))
---

This implementation is completely serial, with no hope of running in parallel. Don’t worry, though - you’re going to learn how to accomplish this parallelization.

TODO explain why there’s no hope. TODO "in a concurrent universe" is a little weak

Now that you understand how data parallelism compares to the other main strategies for achieving performance in a concurrenct universe, let’s really dig into it so that you can understand it completely.

Data Parallelism

At this point you probably have a vague intuition about how you might be able to speed up your programs by using data parallelism. In this section, you’re going to refine your understanding by learning about the work-span model, the premiere theoretical model for understanding and predicting performance. You’re also going to learn the concrete implementation concerns you’ll encounter when trying to write parallel code.

Work-Span Model

TODO kind of a jump, now we’re suddenly talkng about algorithms
TODO mention that it assumes an ideal machine with infinite processors

The work-span model is concerned with two aspects of a parallel algorithm: its work and its span. Work is the total number of tasks that need to be completed, and span is the length of the longest path of work that has to be done serially. Take a look at this diagram:

In the example on the left, the work is 9 because there are 9 tasks that need to be completed, and the span is 5 because there are 5 tasks that have to be performed serially.

The span describes the upper limit to the amount of speedup you can expect; past that, no amount of additional hardware will help your algorithm run faster.

The work-span model reveals an important difference between serial and parallel algorithms. With serial algorithms, performance is determined by the total amount of work that needs to be done; you improve performance by reducing work. By contrast, the parallel performance is determined by the span. In fact, with some algorithms you actually improve performance by performing more work. You can see this in the diagram for scan:

In the serial version of scan, there are 7 tasks. In the parallel version, there are 11, but the span is only 3, so the parallel version would complete before the serial version.

Implementation

TODO explain "potential parallelism" The work-span model assumes an ideal machine where there are no costs involved in running tasks in parallel. If you tried to write your Clojure program as if there were no penalty for creating threads, then you’d quickly run into trouble. From Java Concurrency in Practice:

Thread creation and teardown are not free. The actual overhead varies across platforms, but thread creation takes time, introducing latency into request processing, and requires some processing activity by the JVM and OS. If requests are frequent and lightweight, as in most server applications, creating a new thread for each request can consume significant computing resources.

— Brian Goetz

TODO mention that abundant threads leads to slowdown

TODO metaphor time, maybe group project in school Because threads are expensive, you might say to yourself, "No problem! I’ll just create one thread per thread core and divide the work among them." But that can cause load balancing problems. Imagine that you have four cores, and a map operation divided among threads A, B, C, and D. Now imagine that for some reason thread A is taking a signifanct amount of time on one of the map function applications. Threads B, C, and D have to just wait idly for A to finish.

So, in real programs, you’re concerned with balancing the need to limit thread creation, scheduling, and synchronization, and the need to balance the workload.

Some of the approaches available to deal with these concerns are thread management, granularity, parallel slack, and fusion. Let’s look at those.

Thread Management

One of the best ways to avoid the overhead from creating a thread is - prepare to have your mind blown - to never create the thread in the first place. What is this, some kind of zen koan? You didn’t sign up for this!

No, I’m not trying to force your awareness to gain sudden insight into the limitations of rational consciousness. I’m talking about using thread pools to allow thread reuse.

Thread pools are a layer of indirection between tasks and threads. Rather than create a thread for a task directly, you submit your task to a threadpool, and that threadpool handles it. It might create a new thread if there are none available, or it might reuse an existing thread.

Thread pools can also enforce a thread limit, in which case your task can be queued. This is useful in avoiding the problems that arise when the scheduler has to switch between too many threads.

You can learn more about thread pools by investigating java.util.concurrent executors.

Thread pools are the most common implementation of executors.

TODO give a little introduction to executors

Granularity

Parallel programming involves decomposing a task into subtasks and executing those in parallel. We use the term granularity to refer to the size of the subtask. If the granularity is too small then you risk eliminating any parallelization gains to overhead, and if it’s too large than you risk running into load balancing problems.

If your grain size is too small, you can combine subtasks into larger tasks, a technique called tiling. The subtasks run serially within the larger tasks, and the larger tasks are run in parallel. This helps you reduce the ratio of time spent on parallel overhead.

Parallel Slack

You want to overdecompose your subtasks so that your program can continue reaping performance benefits if its run on more cores.

You want algorithms or language support for doing this automatically. You can probably see that the amount of parallel slack is directly related to your grain size.

Fusion

Fusion refers to the process of combining multiple transforming functions into one function, allowing you to loop over a collection just once instead of having to loop over collections once per transformation.

Fork/Join

Now you’re ready to learn about one of the most versatile parallel programming strategies, fork/join. It employs tiling, parallel slack, and fusion, and the Java version handles thread management. Fork/join also adds two new strategies to the mix: recursive decomposition and work stealing.

Basics

The fork/join strategy works by recursively breaking down some task into subtasks until some base condition is met, and then adding those subtasks to a queue. Fork/join also involves combining the results of the subtask results. Fork/join refers to this process of splitting and combining.

You want to include all transformations in the base case, performing fusion. Tiling is handled through recursive decomposition.

Work stealing is interesting: the fork/join framework employes a double-ended queue (deque). The queue is sorted by task complexity. Each worker thread pulls from the least-complex end of its own queue. If a worker finishes, it "steals" work from the most-complex end of another worker.

Reducers

Reducers use the fork/join framework. Here’s some of the most important code to show you what’s happening.

Clojure’s reducers library manages the splitting of your tasks.

Other options

Claypoole
Tesser
Manually interacting with fork/join

References

Systems Performance: Enterprise and the Cloud
Structured Parallel Programming
Ebook on queueing systems

TODO

Explain how it differs from laziness
No intermediate collections
Talk about performance first?
TODO lookup where I got my definition of efficiency as how well it makes use of computing resources / soak the cores
TODO mention that we’re naming things you already do
TODO mention that existing software needs to be able to run faster on new hardware
TODO mention that reader should read the first section of concurrency chapter
MENTION that ideally, performance improves with hardware improvements
Get definite answer about what it means for a collection to reduce itself

Buy the (beta) ebook!?

The First Part's Free

The ebook is where the awesome is

(This Article is Barely About) Reducers

Know Your Reducers

Reducers vs. Seq Functions

The Strategies

Eager Computation

Eliminating Intermediate Collections

Parallelism

How to Use Reducers

A Peek at Performance

Performance & Parallelism

The Three Performance Walls

Concurrent and Parallel Programming

Performance

Performance Aspects

General Performance Strategies

Latency Hiding

Functional Decomposition

Data Parallelism

Data Parallelism

Work-Span Model

Implementation

Thread Management

Granularity

Parallel Slack

Fusion

Fork/Join

Basics

Reducers

Other options

References

TODO