This is a developer-focused research update on code optimizations for secp256k1
libraries.
Table of Contents
Speeding up Research Tasks
For our research into weak wallet usage, we wrote Rust programs that crunch through potential wallets of different types and investigate their usage on the blockchain. Notably, we’re still doing the necessary calculations purely on the CPU, as opposed to offloading some computations to a graphics card (GPU) to speed them up.
Due to this design decision, we had to invest a significant amount of energy into selecting fast libraries and doing various optimizations in order to keep an acceptable performance while searching through large key ranges. The requirement for fast code is a challenge and complicates development, but learning how to make good use of the available CPU time is a skill that applies well beyond this specific research work as well, so it’s not a lost effort.
Checking a key candidate for some on-chain usage with a certain coin involves a lot of costly steps, but there are two particular speed bottlenecks:
- Performing many HMAC-SHA512 hashing steps for BIP39
- relevant when going from the original weak random entropy input to the master private key of a given wallet
- Calculating a lot of
secp256k1
key derivations for BIP32- relevant for determining public keys (-> addresses) for commonly used derivation paths
This research update is specifically about speeding up the secp256k1
derivations.
Faster secp256k1
Derivations in Rust
Generic Improvements
Getting good performance out of a given computer setup usually involves a lot of “generic” optimization decisions, such as
- Building on top of a programming language that has good runtime performance
- Picking libraries that are already optimized for the task
- Using all available resources and features of the local machine
- Determining and enabling relevant compile-time & runtime optimization flags
In our case, this meant relying on the bitcoin and secp256k1 Rust crates, which under the hood uses the well-known libsecp256k1 C library for the heavy lifting. We distribute wallet calculation tasks in parallel to all available logical CPU cores using rayon.
A number of Cargo.toml
based Rust optimization settings can help to push the performance beyond the normal --release
baseline, for example:
- Using faster
panic
abort behavior - Link time optimization (
lto
) - Using less
codegen-units
RUSTFLAGS="-C target-cpu=native"
- stripping symbols
This moves the needle towards slightly faster runtime, at the cost of slower compilation and less debug options for the release target.
Ultimately, these optimizations don’t do very much on this particular workload, though. Even with all Rust toolchain-related generic optimizations in place and all CPU cores at near 100% during runtime, we felt that things were still not going as fast as they could. Therefore we started taking a closer look at what work was being done behind the scenes, profiling it, and checking if we actually needed all of it to happen - “work smarter, not harder”.
Doing the same Work with Less Steps
During our search for optimizations, we discovered the old libsecp256k1_fast_unsafe fork from 2016 that showcased a number of ways the cryptographic code could be reworked to obtain significant speedups.
In the end, the target-specific optimizations we applied mostly fell into three categories:
- Improve C code generation (the good)
libsecp256k1
pre-generates certain data tables. Switch to much larger but faster tables via increasedECMULT_GEN_PREC_BITS
. This requires raising internal hardcoded limits.- Pick the fastest local C compiler and optimization level.
- Ensure machine-specific C compiler optimizations are active via
-march=native
.
- Speedups through insecure code (the bad)
- Cryptographic libraries use special, slower computations designed to be constant-time to avoid leaking sensitive information. All the keys we handle are already weak enough to be compromised already, so switch to the faster variable-time function versions wherever possible.
- Skip zeroing out sensitive memory.
- Allow the compiler to “short circuit” some evaluations to speed them up.
- Breaking API changes to skip work (the ugly)
- The
bitcoin
crate always computes a special fingerprint after each derivation step, which we don’t end up using. Skipping this calculation is a breaking change, but this isn’t a problem for our use case.
- The
Additionally, secp256k1_fast_unsafe
has the concept of Batched Key Serialization, which we didn’t use so far.
Open Source the Improvements
We’ve now published these optimizations as custom forks 📦🎉:
Please note that the changes deliberately break important security mechanisms and functionality, making this code completely unsuitable for any type of production use. We will also not provide support, maintenance or warranties of any kind. This code is meant be a helpful point-in-time reference for other security researchers, and we don’t recommend any other usage or depending on it to work right. You’ve been warned 😉
Optimizations
Fair and representative benchmarking is an art form, and we’re too short on time to do the matter justice. Therefore we’ll just stick to some rough ballpark numbers for a real workload:
- AMD Ryzen 7950X3D based machine running Linux
- calculating a Bitcoin address with derivation path
m/44'/0'/0'/0/0
for each wallet - crunching through
2^32
= 4.29 billion different wallets - perform address lookups against a bloom filter in memory
- do some other work and I/O during program startup
In this scenario, exchanging the standard libraries with the optimized library variants results in a speedup of over 6x and brings down the total runtime from about 540 minutes (9 hours) to about 90 minutes, which is a great improvement!
Potential Future Work
In the world of performance optimizations, there’s always more that can be done, although often with diminishing returns.
For example, we’ve noticed that the hashing functions in rust-bitcoin
are implemented as pure Rust code without the use of CPU architecture specific optimized assembler code, unlike RustCrypto hashes. This leaves room for improvements.
On the side of the “generic” optimizations, we haven’t made use of Profile-guided optimization. This feature allows the C and Rust compiler to create more fine-tuned code that is often a few percent faster, but requires additional effort to provide suitable profile data.
Perhaps we’ll revisit this area in the future.
Summary & Outlook
In this research update, we discussed speeding up wallet private key & public key derivations through the deliberate removal of security mechanisms and non-essential functionality in two popular Rust cryptocurrency libraries. Although this approach is generally a bad idea in most other contexts, these unsafe optimizations may help other security researchers to work more easily with many wallets or derivations.
If you would like to see more articles with developer-level details of our work, let us know!