— rust, 1brc, performance — 12 min read
Heyo folks!
This blog talks about my journey of tackling the 1 Billion Row Challenge (1BRC) and the insights I gained along the way.
It all began when Shraddha introduced me to this challenge through her implementation in Go. (Seriously, if you haven't read her article yet, go check it out!) Her work sparked my curiosity: How would this challenge play out in Rust? More importantly, how would the mental model shift when switching languages?
I started out with a humble 5 minute execution time and was able to bring it down to 9 seconds \o/. That's right - we're talking about a speedup of over 33x! But the numbers only tell part of the story. The real adventure was in the learning, the optimizations, and the "aha!" moments along the way
Input: A 14 GB text file containing 1 billion rows of temperature
measurements from various weather stations. Each line has the format
<string: station name>;<double: measurement>
Output: An alphabetically sorted list of cities with their temperature statistics, formatted as:
{city1=<min>/<mean>/<max>, city2=<min>/<mean>/<max>, ...}
The sheer volume of data - one billion rows - makes this task non-trivial. Processing such a large dataset quickly requires careful consideration of various constraints. Plus, let's face it - there's something alluring about conquering a "billion" of anything!
While adhering to the official rules, I set some personal challenges:
Links:
Tools:
Before diving into the challenge, it's crucial to break down the problem into manageable sub-tasks. At a high level, we're facing three main challenges:
Commit Link: Commit
The first attempt at this challenge was done in the naivest way possible. Here's how it worked:
String
BTreeMap
for sorted output.1struct StationValues {2 min: f32,3 max: f32,4 mean: f32,5 count: u32,6}7
8// Parse line, to extract StationName and the value9fn read_line(data: String) -> (String, f32) {10 let parts: Vec<&str> = data.split(';').collect();11 let station_name = parts[0].to_string();12 let value = parts[1].parse::<f32>().expect("Failed to parse value");13 (station_name, value)14}15
16// Calculate the station values17fn calculate_station_values(data: String) -> HashMap<String, StationValues> {18 let mut result: HashMap<String, StationValues> = HashMap::new();19 for line in data.lines() {20 let line = line.trim();21 let (station_name, value) = read_line(line.to_string());22 result23 .entry(station_name)24 .and_modify(|e| { ... })25 .or_insert(StationValues { ... });26 }27
28 result29}
Flamegraph: Flamegraph (Download and Open in a new tab to see the interactive version)
The flamegraph reveals that:
read_line
function.collect()
parse()
to_string()
The flamegraph shows that the collect()
calls invoke malloc system calls. This
means, for each of our billions lines, we're allocating memory on heap,
processing it and immediately discarding this memory. This unnecessary
allocation and deallocation seems like a performance bottleneck.
The reason I used a vector is to easily access the station names and its values by index. However, we can achieve the same result more efficiently:
collect()
ing into a vector, we can use the iterator returned by
split()
directly. next()
to access elements, avoiding allocation altogether. This
realisation sets the stage for our next optimization attempt.I had a couple of “face palm” moments when trying to calculate the execution time for this solution:
--release
flag:println!
in the hot path:println!
statement inside the read_line function, which
processes the billion lines.println!
makes a system
call.
When placed in a loop iterating a billion times, this results in a billion
system calls, dramatically slowing down execution. Always be cautious
about I/O operations in hot paths. Commit Link: Commit
In this iteration, I focused on removing unnecessary allocations during line
parsing. Replacing the vector-based parsing with a more efficient iterator
approach next()
shaved off 12 seconds from our execution time.
1fn read_line(data: String) -> (String, f32) {2 let mut parts = data.split(';');3 let station_name = parts.next().expect("error");4 let value = parts.next().expect("error").parse::<f32>().expect("error");5 (station_name.to_owned(), value)6}
Flamegraph: Flamegraph (Download and Open in a new tab to see the interactive version)
The flamegraph verifies that the read_line
now consumes only 35% of execution
time, down from 45% in the previous iteration.
1fn calculate_station_values(data: String) -> HashMap<String, StationValues> {2 for line in data.lines() {3 let (station_name, value) = read_line(line.to_string()); ← Problem ??4 // ... process data ...5 }6}7
8fn main() {9 let data = std::fs::read_to_string(args.file).expect("error"); ← Problem ??10 let result = calculate_station_values(data);11 // ... output results ...12}
New Observations and Hypotheses:
read_to_string
takes 1.05% of the execution time.to_string
accounts for 8% of the time.Potention Issues:
read_to_string
allocates memory equivalent to
the file size (14GB). This
large allocation is inefficient and may not be feasible on systems with
limited RAM.lines()
method returns string slices
(&str
), but we often need owned Strings for manipulation. This necessitates
to_string()
calls, leading to more allocationsTo address these issues, we should consider:
Strings
directly, avoiding
to_string()
conversions. BufReader
seems like a promising solution to both these problems. It allows for
efficient, chunked reading and can be configured to work with owned Strings
directly.
Commit Link: Commit
In this iteration, we replaced read_to_string with BufReader, resulting in a reduction of almost 100 seconds in execution time. This change validates our hypothesis about BufReader's efficiency for large files. Paraphrasing the reddit comment below:
System calls involve doing a bunch of work to prepare to hand over control to the operating system so that it can do the read operation. That work is the same regardless of the amount of data that is being read. read_to_string is very efficient in terms of the number of system calls but it comes with the downside of a lot of unnecessary memory use and necessitates redundant allocation of owned String for each line, only to discard it later.
BufReader is intended to be a middle ground without excessive memory usage or excessive system calls. BufReader makes it so that the system calls always read a lot at a time even when you only ask for a little bit of data, so that when you ask for the next little bit of data it doesn't need to go through the work all over again
Lessons
read_to_string
is very convenient for smaller files, or when ownership of
the entire data isn't required.Flamegraph: Flamegraph (Download and Open in a new tab to see the interactive version)
Let's take a diversion and look at how the numbers are being processed. The flame graph reveals that about 13% of the execution time is spent converting strings to floats. I recall reading through a post which talked about how rust float parsing is slow, maybe we should try a faster float parsing crate.
Commit Link: Commit
Flamegraph: Flamegraph (Download and Open in a new tab to see the interactive version)
Using the fast-float crate to parse the strings as floats have helped reduce the execution time by 4 seconds. Not a huge improvement, but I'll take what I can get :)
Interestingly, reading through the fast-float github repo, I see that the techniques used in this repository have now been merged into the std core's parse method. Then why did we get the extra 4s reduction. This comment clears up the air:
Why do real-case benchmarks favour core, while the single float tests favour fast-float-rust? It might be due to the very aggressive inlining in fast-float-rust.
Commit Link: Commit
This improvement was just sheer luck and some RTFM skills. While reading through Rust float parsing I came across this Hashing page. The lead me to read the manual for the Rust Hashmap which mentions that:
The default hashing algorithm is currently SipHash 1-3, though this is subject to change at any point in the future. While its performance is very competitive for medium sized keys, other hashing algorithms will outperform it for small keys such as integers as well as large keys such as long strings, though those algorithms will typically not protect against attacks such as HashDoS.
The SipHash 1-3 algorithm is high
quality—it provides high protection against collisions—but is relatively slow,
particularly for short keys such as integers. The Rust Performance
Book suggested the usage
of FxHashMap
from the rustc_hash
crate and I ended up using it.
This reduced 11 seconds from the execution time. Looks like rust prefers security over performance.
Flamegraph: Flamegraph (Download and Open in a new tab to see the interactive version)
Inspired by past approaches, I was looking for any redundant memory allocations.
The process of reading a line from the BufReader in the calculate_station_values
method was contributing to some memory allocations. This is because the lines
method on the Bufreader allocates a new String
per line, this is wasteful as we
drop these strings immediately after use.
The efficient way to read the lines would be to maintain one single String buffer and use that to store each line we process. This way we only allocate memory for one String buffer.
Commit Link: Commit
Flamegraph: Flamegraph (Download and Open in a new tab to see the interactive version)
The read_line
method
on the Bufreader takes all bytes until a newline (the 0xA byte) is reached, and
appends them to the provided String buffer. This helped us reduce the time by 18
seconds. The caveat with this solution is that we have to manually manage the
buffer but we aren’t looking for ergonomy here.
1fn calculate_station_values(reader: &mut BufReader<File>) -> FxHashMap<String, StationValues> {2 let mut buf = String::new();3
4 while let Ok(bytes_read) = reader.read_line(&mut buf) {5 if bytes_read == 0 {6 break;7 }8 let line = buf.trim();9 let (station_name, value) = read_line(line);10 ....11 buf.clear();12 ....13 }14....15}
Dealing with buffers made me realise that we have been playing with Strings till now. And Strings in Rust are UTF-8 encoded - this means we are incurring an overhead of validation everytime we read a String. Since we know that the file we are dealing with is a valid UTF-8 encoded file, I think we can eliminate this check entirely. The idea being, instead of storing the data on String, we directly use bytes!
Commit Link: Commit
Bufreader has a read_until method which takes a byte limiter (\n
) and a &mut
Vec<u8>
buffer. The reader reads through the file and keeps pushing the data
into the buffer as long as we keep calling the method. We also change the type
of the Station Name in our hashmap to be a Vec<u8>
instead of String. These
two changes effectively remove the String from the code.
These changes have given us some significant gains, we reduced our execution time by almost 22 seconds.
1fn calculate_station_values(reader: &mut BufReader<File>) -> FxHashMap<Vec<u8>, StationValues> {2 let mut result: FxHashMap<Vec<u8>, StationValues> = FxHashMap::default();3 let mut buf = Vec::new();4
5 while let Ok(bytes_read) = reader.read_until(b'\n', &mut buf) {6 if bytes_read == 0 {7 break;8 }9 // remove new line character10 buf.truncate(bytes_read - 1);11 12 // read_line allocates a new vector to store station_name13 let (station_name, value) = read_line(&buf);14
15 buf.clear()16 }17 ...18}19
20fn read_line(data: &Vec<u8>) -> (Vec<u8>, f32) {21 let mut parts = data.rsplit(|&c| c == b';');22 ....23 (station_name.to_vec(), value)24}
Flamegraph: Flamegraph (Download and Open in a new tab to see the interactive version)
The flamegraph shows that we traded String allocation with Vector allocation
(notice the to_vec
in read_line
). We did receive a performance boost because we
are no longer validating UTF8 encoding. The interesting question to ask is:
Can we avoid any allocation happening altogether and directly use bytes?
Commit Link: Commit
The memmap2 crate provides us a nice
interface to the file and gives us a &[u8]
reference to deal with. In this step,
we replaced the method of reading a file using Streaming access (BufReader) to
using RandomAccess (mmap). This way we transfer the responsibility to the OS and
let it deal with paging mechanisms etc
Memory map helped reduce the execution time by 5 seconds. Mmap has also helped in avoiding any extra allocation.
1fn calculate_station_values(data:&[u8]) -> FxHashMap<&[u8], StationValues> {2 ....3}4
5fn read_line(data: &[u8]) -> (&[u8], f32) {6 let mut parts = data.rsplit(|&c| c == b';')7 ...8}9
10fn main() {11 let file = std::fs::File::open(&args.file).expect("Failed to open file");12 let mmap = unsafe { Mmap::map(&file).expect("Failed to map file") };13 let data = &*mmap;14 let result = calculate_station_values(data);15 ....16}
Flamegraph: Flamegraph (Download and Open in a new tab to see the interactive version)
The split function that we use in calculate_station_values
and read_line
seems
to take 13% and 11% of the execution time respectively. These areas present our
next optimization targets.
I am unhappy with the usage of unsafe code here, but let’s plough through this hope that I get inspiration later.
Commit Link: Commit
In this step, we ventured into SIMD (Single Instruction, Multiple Data)
territory to optimise our byte searching process. Instead of using split()
to
get an iterator with lines, we are going to loop through the bytes and use the
heavily optimised memchr
crate to search the new line character. This helped us
shave off 7 seconds.
1// Eg: data looks like "Hamburg;19.8\n"2fn calculate_station_values(data:&[u8]) -> FxHashMap<&[u8], StationValues> {3 let mut result: FxHashMap<&[u8], StationValues> = FxHashMap::default();4 let mut buffer = data;5 loop {6 match memchr(b';', &buffer) {7 None => {8 break;9 }10 Some(comma_seperator) => {11 let end = memchr(b'\n', &buffer[comma_seperator..]).unwrap();12 let name = &buffer[..comma_seperator];13 let value = &buffer[comma_seperator+1..comma_seperator+end];14 let value = fast_float::parse(value).expect("Failed to parse value");15
16 ....17 buffer = &buffer[comma_seperator+end+1..];18 }19
20 }21}
Flamegraph: Flamegraph (Download and Open in a new tab to see the interactive version)
Flamegraph analysis shows no more redundant memory allocations, suggesting we've tackled the low-hanging fruit in terms of memory efficiency.
Next Steps
mmap
code to improve safety.These final optimizations should help us squeeze out the last bits of performance!
Commit Link: Commit
Woooot! The execution time has now come down to 12 seconds.This solution is almost 6x faster. We implemented two major changes that improved our performance.
We replaced memory mapping with streaming access, which allowed us to eliminate unsafe code while maintaining efficient byte-level access.
1loop {2 let bytes_read = file.read(&mut buf[..]).expect("Failed to read file");3 // println!("bytes_Read {:?}", bytes_read);4 if bytes_read == 0 {5 break;6 }7 .....8}
This approach leverages the read
method for the
Read trait implemented by File, allowing us to read directly into a &mut [u8]
buffer. This method avoids redundant allocations for each line while providing
byte-level access to the file content.
We implemented a single producer, multiple consumer architecture using the crossbeam_channel crate.
We have a single Producer (sender) responsible for reading the file and generating a data chunk. Each chunk is a subset of complete lines from the file, with a maximum size of 128 KiB.
We then create Consumers (receivers) (equal to the number of CPU's a system has), each maintaining its own map of calculated values for the stations they receive.
At the end we combine results from all consumers to calculate the final values.
We use two types of buffers to ensure we always process complete lines:
Here's how it works:
This strategy ensures that only complete lines are sent to receiver threads, regardless of how file reads align with line endings.
Flamegraph: Flamegraph (Download and Open in a new tab to see the interactive version)
The start
bar in the flame graph reveals that to_owned
and append
operations
are taking about 10% of the execution time. This is because we need to send an
owned copy of the unprocessed buffer to the receiver thread. If we can eliminate
the "unprocessed buffer" and only use the "Read Buffer" to send chunks to
receiver threads, we can potentially fix this.
Commit Link: Commit
Yaaay! We have finally reached a single digit execution time. We have eliminated the separate "Unprocessed Buffer" and now use only the "Read Buffer" for all operations:
Assumption and Edge Case:
We make an assumption that if unprocessed data exists, the next read will contain a newline character. This is reasonable given:
If no newline is found after a read with unprocessed data, we panic, considering it an indication of file format issues.
While there might be room for further micro-optimizations I think I am finally happy with this flamegraph ^_^
I'm thrilled to have met all the goals I set for myself. This journey has been a fantastic learning experience, pushing the boundaries of what I thought was possible with Rust performance optimization.
--release
flagprintln!
in critical paths; use logging crates for debuggingFromIterator::collect()
; it triggers new allocationsto_owned()
and clone()
[u8]
) instead of Strings when UTF-8 validation isn't neededWhen I embarked on this challenge, I was eager to see how Go and Rust would stack up against each other. But you know what? I quickly realised I was asking the wrong question. It's not about which language is "better" - they're both unique beasts with their own quirks and charms.
What really fascinated me was how these languages shape one's thinking. Take Go, for instance. With its built-in garbage collector and concurrency support, I saw that the approaches naturally gravitated towards concurrent solutions. Memory management? Barely crossed their minds.
Rust, on the other hand, sent me down a completely different path. Without a garbage collector or green threads to fall back on, my first instinct was to scrutinise every allocation. "How can I make this more efficient?" I asked myself, only considering concurrency as a final optimization step.
This contrast was enlightening. It's not about which language is "better," but how each influences problem-solving. Our approaches change based on the tools at hand. This realisation extends beyond coding. It reminds me that the world isn't binary. There's rarely a universal "right" way - context and circumstances matter.
Wow, what a ride! I can't thank Gunnar Morling enough for crafting this challenge. Seeing my code's execution time plummet from over 5 minutes to just 9 seconds was incredibly satisfying.
I owe a big thanks to Shraddha for suggesting I keep a journal. Writing down my thought process made me question my decisions and understand the 'why' behind each solution. Without her nudge, this blog might never have materialised.
I've also gained a newfound appreciation for thorough documentation. To all the crate authors out there who take the time to write detailed docs: you're lifesavers! I've learned (sometimes the hard way) that the answer to my problem is often right there in the documentation. Note to self: RTFM!
This challenge has reignited my passion for systems programming. I think I've caught the performance bug, and I'm excited to see where it takes me next!
If you've tackled this challenge too, I'd love to hear about your approach. Drop a comment, shoot me a tweet, or send an email. Let's geek out together!