Milk Sad: Update #8 - Custom Rust Library Optimizations

This is a developer-focused research update on code optimizations for secp256k1 libraries.

Speeding up Research Tasks
Faster secp256k1 Derivations in Rust
Summary & Outlook

Speeding up Research Tasks

For our research into weak wallet usage, we wrote Rust programs that crunch through potential wallets of different types and investigate their usage on the blockchain. Notably, we’re still doing the necessary calculations purely on the CPU, as opposed to offloading some computations to a graphics card (GPU) to speed them up.

Due to this design decision, we had to invest a significant amount of energy into selecting fast libraries and doing various optimizations in order to keep an acceptable performance while searching through large key ranges. The requirement for fast code is a challenge and complicates development, but learning how to make good use of the available CPU time is a skill that applies well beyond this specific research work as well, so it’s not a lost effort.

Checking a key candidate for some on-chain usage with a certain coin involves a lot of costly steps, but there are two particular speed bottlenecks:

Performing many HMAC-SHA512 hashing steps for BIP39
- relevant when going from the original weak random entropy input to the master private key of a given wallet
Calculating a lot of secp256k1 key derivations for BIP32
- relevant for determining public keys (-> addresses) for commonly used derivation paths

This research update is specifically about speeding up the secp256k1 derivations.

Faster `secp256k1` Derivations in Rust

Generic Improvements

Getting good performance out of a given computer setup usually involves a lot of “generic” optimization decisions, such as

Building on top of a programming language that has good runtime performance
Picking libraries that are already optimized for the task
Using all available resources and features of the local machine
Determining and enabling relevant compile-time & runtime optimization flags

In our case, this meant relying on the bitcoin and secp256k1 Rust crates, which under the hood uses the well-known libsecp256k1 C library for the heavy lifting. We distribute wallet calculation tasks in parallel to all available logical CPU cores using rayon.

A number of Cargo.toml based Rust optimization settings can help to push the performance beyond the normal --release baseline, for example:

Using faster panic abort behavior
Link time optimization (lto)
Using less codegen-units
RUSTFLAGS="-C target-cpu=native"
stripping symbols

This moves the needle towards slightly faster runtime, at the cost of slower compilation and less debug options for the release target.

Ultimately, these optimizations don’t do very much on this particular workload, though. Even with all Rust toolchain-related generic optimizations in place and all CPU cores at near 100% during runtime, we felt that things were still not going as fast as they could. Therefore we started taking a closer look at what work was being done behind the scenes, profiling it, and checking if we actually needed all of it to happen - “work smarter, not harder”.

Doing the same Work with Less Steps

During our search for optimizations, we discovered the old libsecp256k1_fast_unsafe fork from 2016 that showcased a number of ways the cryptographic code could be reworked to obtain significant speedups.

In the end, the target-specific optimizations we applied mostly fell into three categories:

Improve C code generation (the good)
- libsecp256k1 pre-generates certain data tables. Switch to much larger but faster tables via increased ECMULT_GEN_PREC_BITS. This requires raising internal hardcoded limits.
- Pick the fastest local C compiler and optimization level.
- Ensure machine-specific C compiler optimizations are active via -march=native.
Speedups through insecure code (the bad)
- Cryptographic libraries use special, slower computations designed to be constant-time to avoid leaking sensitive information. All the keys we handle are already weak enough to be compromised already, so switch to the faster variable-time function versions wherever possible.
- Skip zeroing out sensitive memory.
- Allow the compiler to “short circuit” some evaluations to speed them up.
Breaking API changes to skip work (the ugly)
- The bitcoin crate always computes a special fingerprint after each derivation step, which we don’t end up using. Skipping this calculation is a breaking change, but this isn’t a problem for our use case.

Additionally, secp256k1_fast_unsafe has the concept of Batched Key Serialization, which we didn’t use so far.

Open Source the Improvements

We’ve now published these optimizations as custom forks 📦🎉:

Please note that the changes deliberately break important security mechanisms and functionality, making this code completely unsuitable for any type of production use. We will also not provide support, maintenance or warranties of any kind. This code is meant be a helpful point-in-time reference for other security researchers, and we don’t recommend any other usage or depending on it to work right. You’ve been warned 😉

Optimizations

Fair and representative benchmarking is an art form, and we’re too short on time to do the matter justice. Therefore we’ll just stick to some rough ballpark numbers for a real workload:

AMD Ryzen 7950X3D based machine running Linux
calculating a Bitcoin address with derivation path m/44'/0'/0'/0/0 for each wallet
crunching through 2^32 = 4.29 billion different wallets
perform address lookups against a bloom filter in memory
do some other work and I/O during program startup

In this scenario, exchanging the standard libraries with the optimized library variants results in a speedup of over 6x and brings down the total runtime from about 540 minutes (9 hours) to about 90 minutes, which is a great improvement!

Potential Future Work

In the world of performance optimizations, there’s always more that can be done, although often with diminishing returns. For example, we’ve noticed that the hashing functions in rust-bitcoin are implemented as pure Rust code without the use of CPU architecture specific optimized assembler code, unlike RustCrypto hashes. This leaves room for improvements.

On the side of the “generic” optimizations, we haven’t made use of Profile-guided optimization. This feature allows the C and Rust compiler to create more fine-tuned code that is often a few percent faster, but requires additional effort to provide suitable profile data.

Perhaps we’ll revisit this area in the future.

Summary & Outlook

In this research update, we discussed speeding up wallet private key & public key derivations through the deliberate removal of security mechanisms and non-essential functionality in two popular Rust cryptocurrency libraries. Although this approach is generally a bad idea in most other contexts, these unsafe optimizations may help other security researchers to work more easily with many wallets or derivations.

If you would like to see more articles with developer-level details of our work, let us know!

Table of Contents