Measuring and (slightly) Improving Post-Quantum Handshake Performance

2024-12-17

To defend against the potential advent of "Cryptographically Relevant Quantum Computers" there is a move to using "hybrid" key exchange algorithms. These glue together a widely-deployed classical algorithm (like X25519) and a new post-quantum-secure algorithm (like ML-KEM) and treat the result as one TLS-level key exchange algorithm (like X25519MLKEM768).

In this report, first we'll measure the additional cost of post-quantum-secure key exchange. Then we'll describe and measure an optimization we have implemented.

Headline measurements

All these measurements are taken on our amd64 benchmarking machine, which has a Xeon E-2386G CPU. We'll compare:

All three are taken on the same hardware, and the latter measurements are from our previous report -- which also contains reproduction instructions and describes what the benchmarks measure.

One important thing to note is that post-quantum key exchange involves sending and receiving much larger messages than classical ones. Our benchmark design only covers CPU costs -- and does not include networking -- so real-world performance will be worse than these measurements.

client handshake performance results on amd64 architecture

server handshake performance results on amd64 architecture

The cost of X25519MLKEM768 post-quantum key exchange is clearly visible for both clients and servers.

We can see that the performance headroom that rustls has attained means we can almost completely absorb the extra cost of post-quantum key exchange, while still performing better than (post-quantum-insecure) OpenSSL -- with the exception of client resumption.

We will do further comparative benchmarking in this area when OpenSSL gains post-quantum key exchange support.

Sharing X25519 setup costs

Background

In TLS1.3, the client starts the key exchange in its first message (the ClientHello). The ClientHello includes both a description of which algorithms the client supports, and zero or more presumptive "key shares".

The server then evaluates which algorithms it is willing to use, and either uses one of the presumptive key shares, or replies with a HelloRetryRequest which instructs the client to send new ClientHello with a specific, mutually-acceptable key share.

A HelloRetryRequest can be expensive, because it introduces an additional round trip into the handshake. It also means any work the client did for its presumptive key shares is wasted.

It's therefore advantageous for a client to avoid HelloRetryRequests, by:

The key shares in a ClientHello would look like:

diagram of TLS1.3 client key exchange with X25519MLKEM768

At least for a transitional period, we want to avoid a HelloRetryRequest round trip when connecting to a server that hasn't been upgraded to support X25519MLKEM768. That means also offering a separate X25519 key share:

diagram of TLS1.3 client key exchange with X25519MLKEM768 and X25519

However, this arrangement is not optimal. While X25519 setup is very fast, we are doing it twice and then we are guaranteed to throw away half of that work, because the server can only ever select one key share to use.

Instead, we can do:

diagram of TLS1.3 optimized client key exchange with X25519MLKEM768 and X25519

This report measures the benefit of that optimization.

This optimization is described further in draft-ietf-tls-hybrid-design section 3.2.

Micro benchmarking

First, we can micro-benchmark the time to construct and serialize a ClientHello, in a variety of situations:

We run this on two machines that cover both amd64 (Xeon E-2386G) and aarch64 (Ampere Altra Q80-30) architectures.

micro benchmark results on amd64 architecture

micro benchmark results on arm64 architecture

From this we can see:

Whole handshakes

Next, let's measure the same scenarios in the context of whole client handshakes. The remaining measurements are only done on our amd64 benchmark machine.

The above optimization only affects the client's first message, so now we'll see whether the effect of the optimization is meaningful when compared to the rest of the computation a client must do.

client handshake performance results on amd64 architecture

The difference is visible but small, as it has been diluted by other parts of the handshake. It is approximately 4.3% for resumptions, 2.8% for full RSA handshakes, and 2.6% for ECDSA handshakes.