To lock, or not to lock

11 October 2012
TLS,
performance

TL;DR #

As I've promised you in my previous post, I made TLSnappy balance and handle requests a little bit better.

Data flow #

For leveraging all available CPUs TLSnappy runs multiple threads that are each picking and processing tasks from their dispatch queues, one by one. Tasks are created from node's event-loop in following cases:

Data comes from client and should be decrypted
Data from server should be encrypted

So, as you can see, each thread is receiving data from it's inputs (either encrypted or clear) and/or emitting data to it's outputs. This pattern apparently requires a lot of data transfer to and from worker threads and requires storing (buffering) that data in some temporary storage before processing it.

To my mind, best structure to fit this needs is Circular (Ring) buffer. Because it's fast, can be grown if more than it's current capacity needs to be held.

The Naive version of it was good enough to try out things, but it wasn't supposed to be run in a multi-threaded concurrent environment - all access to this buffer can take place only in a critical section. This means that at any time only one thread may access the ring's methods or properties. You might think that this doesn't make difference, but, according to Amdahl's law, reducing time spent in non-parallelizable (sequential) parts of application is much more critical for overall performance than speeding up parallel parts.

Lock-less ring buffer #

Removing locks seemed to be essential for achieving better performance, however a special structure needs to be used in order to make a ring buffer work across multiple CPUs. Here is the structure I chose for it:

Ring buffer

Ring consists of pages that're forming circular linked list, each page has two offsets: reader (roffset) and writer (woffset). And there're two special pages (which could be the same one actually): reader head (rhead) and writer head (whead).

Initially the ring contains only one page which is rhead and whead at the same time. When the producer wants to put data in - it goes to the whead, copies data into the page, increments woffset and if the page is full - it create a new page, or reuses an old one that doesn't contain any un-read data. Consumer takes rhead reads up to woffset - roffset bytes from it, increments roffset and moves to the next page if roffset is equal to the size of the page.

So here are benchmarks:

Without lock-less ring:

Transactions:                 200000 hits
Availability:                 100.00 %
Elapsed time:                  47.90 secs
Data transferred:             585.37 MB
Response time:                  0.02 secs
Transaction rate:            4175.37 trans/sec
Throughput:                    12.22 MB/sec
Concurrency:                   98.79
Successful transactions:      200000
Failed transactions:               0
Longest transaction:            0.09
Shortest transaction:           0.00

With lock-less ring:

Transactions:                 200000 hits
Availability:                 100.00 %
Elapsed time:                  47.37 secs
Data transferred:             585.37 MB
Response time:                  0.02 secs
Transaction rate:            4222.08 trans/sec
Throughput:                    12.36 MB/sec
Concurrency:                   98.83
Successful transactions:      200000
Failed transactions:               0
Longest transaction:            0.12
Shortest transaction:           0.00

As you can see, performance hasn't greatly improved and is actually almost beyond statistical error (which means that results are nearly the same). However these are results for small 3kb page, lets try sending some big 100kb buffers.

Without lock-less ring:

Transactions:                 100000 hits
Availability:                 100.00 %
Elapsed time:                  64.06 secs
Data transferred:            9536.74 MB
Response time:                  0.06 secs
Transaction rate:            1561.04 trans/sec
Throughput:                   148.87 MB/sec
Concurrency:                   98.59
Successful transactions:      100000
Failed transactions:               0
Longest transaction:            1.93
Shortest transaction:           0.00

With lock-less ring:

Transactions:                 100000 hits
Availability:                 100.00 %
Elapsed time:                  58.73 secs
Data transferred:            9536.74 MB
Response time:                  0.06 secs
Transaction rate:            1702.71 trans/sec
Throughput:                   162.38 MB/sec
Concurrency:                   98.98
Successful transactions:      100000
Failed transactions:               0
Longest transaction:            0.19
Shortest transaction:           0.00

Wow! That's much better - about 9% performance improvement.

Instruments #

Still TLSnappy's performance wasn't even close to what nginx is capable of (~5100 requests per second). Thus it was necessary to continue investigation and this is where Instruments.app comes into play, which is basically an UI for some very useful dtrace scripts. I've run the CPU Sampler utility and this is what the call tree looked like: Original node

Obviously it spends almost 30% of time in synchronization between threads, particularly in CRYPTO_add_lock function: Old CRYPTO_add_lock

After modifying the code to use atomic operations, which are supported by almost every CPU nowadays): New CRYPTO_add_lock

Call tree locked like this: Patched node

Results #

I've opened pull request for node.js and sent the same patches to the openssl-dev mailing list. With patched node and latest tlsnappy these are the benchmark results:

Requests per second Average load