Milk Sad: Update #3 - Bloom Filter, Dataset, Canaries

This research update has some information on the Bloom filter mechanism and public blockchain address data we used to find weak Bitcoin wallets. Using this technique, we were able to check several billion of potential wallets for actual usage on the blockchain without running a Bitcoin full node, or flooding other Bitcoin servers and APIs with excessive network requests.

We also describe some artificially created wallets that we’ve placed to track the real-world theft behavior in one of the weak ranges.

Bloom Filter Explanation and Address Data Source
Canary Wallet Observations
Summary & Outlook

Bloom Filter Explanation and Address Data Source

When searching through billions of algorithm-generated data chunks that could reveal a few interesting private keys, efficient filtering becomes very important. In our original publication, we briefly described this as follows:

We used a publicly available list of all Bitcoin addresses historically seen by the Bitcoin network and constructed a bloom filter with a very low false positive rate on the data set. Using this filter, we were able to do quick address lookups to query and discard many unused wallet candidates, for which the relevant derived accounts were never seen by the network, without doing costly lookups to a Bitcoin full node.

A Bloom filter is a special data structure that provides quick lookup checks against previously added elements. Unlike a hash table or other common lossless read-access-optimized list structures, the Bloom filter deliberately trades off some lookup accuracy for space-efficiency. This make lookup in RAM possible for datasets that would otherwise be too large. Depending on the settings used when creating the filter structure and inserting items, the lookup will falsely detect an item as being in the original set - a false positive - for a certain percentage of queries. In return for this negative effect, only a fraction of the original data footprint has to be kept in memory. This was very attractive for us for optimization reasons.

In the first days of our research, we experimented with a Python Proof-of-Concept to test out this data structure for our tasks. After converging on Rust as the main language for our tooling, the bloomfilter crate became our tool of choice. This library is very fast, but fairly minimal, and doesn’t have a built-in mechanism to export and import pre-generated Bloom filter files from disk. For this reason, we wrote some serialization code to do this for us, as seen in published code for the bloom-filter-generator and its use in the lookup server process. For the research code, we’re using the Rayon library to parallelize our worker threads, which are able to use a single Bloom filter object to avoid memory duplication, which is important when dealing with multiple dozen threads.

To check if the wallet addresses we derived from the generated weak private keys were previously used, we need a collection of addresses that were used on-chain, ideally covering every address ever seen publicly. For a blockchain like Bitcoin which has a long history and frequent changes of receive/change addresses, this is a lot of data. We considered using only addresses seen after a certain date (such as the first bx code commit with the vulnerable mechanism). But we had some resources to spare, and decided against this additional restriction, to ensure we wouldn’t miss other wallet keys that were older than expected or from other generation sources.

The most comprehensive and up-to-date public collection of Bitcoin Mainnet addresses that we could find to build our filter is from blockchair.com, via https://blockchair.com/dumps. Due to download speed limits and the split nature of the data, we did not use this download source directly. Instead, we went with a derivative of this data.

User LoyceV from the bitcointalk.org forum distributes regularly updated data sets assembled from the individual blockchair.com data dump snippets via http://alladdresses.loyce.club/, as far as as we’ve understood from public forum posts. This was just what we needed for Bitcoin, and a valuable resource to kickstart our research, so we’re thankful it’s publicly hosted without any barriers 👍.

Our all_Bitcoin_addresses_ever_used_sorted.txt.gz list snapshot from ca. 2023-08-01, which we used for our initial searches, comes in at ca. 42 Gigabytes in uncompressed form and has ca. 1.19 billion individual Bitcoin addresses. The corresponding Bloom filter that we built from it reduced this to ca. 7.3 Gigabytes in size (with a 0.00000000001 false positive factor for searches), which is far less data to keep in RAM. These numbers should explain why we are interested in a fast lookup mechanism with reduced memory footprint compared to the original data. Since false positives are still annoying to deal with in later processing stages, we’ve further reduced the false positive factor in our later research by 100x, which has worked out quite well.

Going forward, we would like to extend our search to some other selected coins, but are still looking for recently updated, comprehensive data collections that are publicly available. If you’re aware of public and well-maintained address/pubkey/pubkey-hash collections for Ethereum and other popular coins, we would love to hear from you directly!

Canary Wallet Observations

Very early into the bx vulnerability discovery, one of our team members deliberately moved small amounts of Bitcoin onto known vulnerable bx seed -b 256 | bx mnemonic-new generated wallet private keys. At this point, we already understood the main weakness and could deliberately generate specific weak keys, but did not yet have custom tooling to search through the vulnerable range. Setting up a some “canary” wallets with a few dollars in Bitcoin each was therefore a cheap and simple way to gather data on the behavior of attackers.

One of our questions was: are attacker now actively watching the vulnerable range for new deposits, and quickly acting upon them? At least for the bx BIP39 range with 24 mnemonic words and our used paths, this was not the case initially. By the time of publication of this new blogpost, all of the four sub-wallets have been emptied, though:

PRNG ID	derivation path	address	original deposit	theft transaction	theft date
`0x000001f4`	`m/44'/0'/0'/0/0`	13KqxkrmsPKy8gyYwochCQTuPHC7Lp8bFU	$5	ff8c6822..acfa9c3a	2023-08-23 01:23
`0x000001f4`	`m/0/0`	1NxkqwmsQMTqv4SrggPv4vGHDzJKR52S2f	$5	256b6b98..e3f3a851	2023-08-27 12:20
`0xffffffff`	`m/44'/0'/0'/0/0`	1HQR3nKaDahAFrPHMoDVdWiMNFGFb7cHA5	$5	48354a8b..5b18a3d5	2023-09-30 09:07
`0xffffffff`	`m/0/0`	16pQhPkBa5puwEzudZVyKtsrugLtA87cy	$1	8d09a736..ed1f03cd	2023-10-01 22:16

Considering the date of deposit after the main 2023-07-12 theft, low per-wallet funds and theft dates, the thieves sweeping the funds are likely not related to the main attacker. It’s still interesting to see that even a weak wallet with as little as $1 in BTC gets emptied sooner or later. The sharks are clearly in the water now 🦈.

Note that the m/0/0 derivation path we used is an older pattern, and rare - we haven’t found other bx-generated Bitcoin wallets in this range. Attackers may have looked into some of these unusual paths more exhaustively just for these particular wallet PRNG IDs after discovering some usage via the more common M44 P2PKH standard path pattern.

Summary & Outlook

In this post, we introduced a combination of data structure and data set that we successfully used to look up large numbers of addresses. Additionally, we listed some previously internal information about deliberately created weak wallets and related theft patterns.

We still have a long backlog of research topics to present here. We’ll try to get the next post ready before the holidays 🎁

Check out our RSS feed if you want to get notified by your favorite reader application.

Table of Contents

Bloom Filter Explanation and Address Data Source

Canary Wallet Observations

Summary & Outlook