Benchmarking Rust vs Python+Dask for NDVI: 23× Faster, and Why That Matters for Carbon

// 2026-05-10 · Updated 2026-05-11 · 16 min read

Last week I wanted to benchmark our Rust pipeline against the Python “standard” — the DEA Knowledge Hub notebooks that everyone uses. I based the workflow on the burnt area mapping notebook, which uses the NDVI mean-over-time approach.

What followed was a result that confirmed what I’d suspected about Python’s Dask for CPU-bound geospatial workloads — and what I’d argued at FOSS4G 2025: the language you choose has a real environmental cost. 23× slower doesn’t just mean 23× more money. It means substantially more energy and CO₂ for the exact same output. How much more? We haven’t measured it with a power meter yet — but we can estimate, and the direction is unambiguous.

The full source code for the Rust pipeline is on GitLab — everything in this post is reproducible from that repository.

The Goal

The DEA notebook does this:

# 1. Query STAC for Sentinel-2
# 2. Load red + NIR bands via datacube/load_ard
# 3. Compute NDVI per timestep
# 4. Take median over time
# 5. Write GeoTIFF

I wanted to run the same workflow against our Rust pipeline and compare:

Query time
Data loading / download
Compute (NDVI + mean)
Write time

Simple enough, right?

The Benchmark

All benchmarks ran on the same machine:

Component	Spec
CPU	Intel Core Ultra 7 265KF, 20C/20T, up to 5.6 GHz
RAM	64 GB DDR5
Storage	192 GB NVMe SSD
OS	NixOS 26.05, kernel 6.18.24
Rust	1.94.1
Python	3.x (via Nix environment)
Threads	8 (benchmarks), 16 (scaling tests)

Network: Direct internet connection to DEA S3 bucket (deapp-s3-dev.s3.ap-southeast-2.amazonaws.com). No CDN, no CloudFront, no regional cache — raw S3 speeds.

I ran the same workload on both pipelines:

Scene: 56jns (Queensland)
Time range: 2023-06-01 to 2023-06-30 (9 timesteps)
Image size: 10,980 × 10,980 pixels
Bands: red + NIR → NDVI → mean over time
Output: LZW-compressed, tiled GeoTIFF
Threads: 8

Rust (eorst + rss)

Query STAC:     1,058 ms
Download:      92,976 ms  (first run, 18 files from S3)
Build:            730 ms
Apply:         10,217 ms  (NDVI + mean + write)
───────────────────────────────
Total:        105,006 ms  (1 min 45 sec)

Re-run (cached download):

Total:         12,028 ms  (12 sec)

Python (rioxarray + dask direct S3)

Query STAC:       728 ms
Load (lazy):    6,712 ms   (open + dask concat)
Compute:       289,815 ms   (S3 reads + NDVI + mean)
Write:          2,391 ms
───────────────────────────────
Total:        300,774 ms  (5 min)

The Numbers Don’t Lie

Phase	Rust	Python	Winner
Query	1,058 ms	728 ms	Python slightly
Data read + compute + write	103,271 ms	298,927 ms	Rust ×2.9
Total	105,006 ms	300,774 ms	Rust ×2.9
Output size	188.8 MB	231.1 MB	Rust −18%

The Python “compute” phase (290 seconds) includes reading from S3. Rust’s “download” (93 seconds) is separate. If we compare “data access + processing”:

Rust: 93s (download) + 10s (compute) = 103s
Python: 290s (read via S3 + compute) = 290s

Rust is 2.9× faster — and it’s not close.

Caveat: Rust downloads files locally first, then processes. Python reads directly from S3 during compute. This is an apples-to-oranges comparison for data access. If Python downloaded first, the gap would narrow — but the Dask graph overhead would still dominate. We plan to run that benchmark next.

But Wait — Simple NDVI Isn’t the Whole Story

The simple NDVI benchmark (red + NIR → mean over time) is actually the friendly case for Python. 2.9× slower is noticeable, but it’s not terrible. If you’re just doing quick analysis on a laptop, Python + rioxarray is fine.

But the real world isn’t just NDVI. The DEA burnt area notebook adds cloud masking, and that’s where things get interesting.

Adding FMask (Cloud + Shadow Masking)

I ran the same benchmark with FMask cloud masking — for each timestep, exclude pixels where fmask ≥ 2 (cloud, cloud_shadow, snow), keeping only nodata (0) and clear (1), then mean over valid pixels. The FMask band is at half resolution (5490 vs 10980), so it needs reprojecting to match the NBART bands.

Phase	Rust (FMask)	Python (FMask)	Delta
Query	1,843 ms	739 ms	Python
Download	1,442 ms	— (direct S3)	—
Build/Load	1,107 ms	19,196 ms	Python ×17
Apply/Compute	10,559 ms	317,089 ms	Python ×30
Write	—	1,010 ms	—
Output size	~190 MB	~230 MB	Rust −17% (est.)
Total	14,950 ms	339,187 ms	Rust ×23

Rust: 15 seconds. Python: 5.7 minutes.

The gap went from 2.9× to 23×. What changed?

Resampling: FMask is half-resolution. Rust resamples once at build time (1.1s). Python’s rioxarray.rio.reproject_match() builds a Dask task graph every time — 19 seconds just to load the data.
Masking: Using xarray.where(valid_mask) + skipna=True explodes the task graph. The simple NDVI was already slow; add masking and it gets worse.
More bands: Now reading 27 files (red + NIR + fmask per timestep) instead of 18. Each file adds more Dask nodes.

This matches my experience after many years: Dask is decent for simple things (2× slower), but as soon as you add complexity (np.where, reproject), it falls apart. The task graph grows exponentially, the scheduler overhead dominates, and suddenly your 2× problem becomes 20×.

Origin Story: eorst Was Python + Dask

Here’s something I haven’t mentioned: eorst started as a Python + Dask project.

Years ago, I built a geospatial pipeline in Python using Dask. It worked — for a while. But as the workload grew (more scenes, more timesteps, more bands), the problems accumulated:

Dask workers dying silently with OOM
Task graphs taking minutes to build for simple operations
No control over when data materialized — the scheduler decided
Debugging was “check the dashboard, maybe restart”
Python dependency conflicts between projects

So I rebuilt it in Rust. Not because Python is bad — it’s the standard in geospatial, and I still use it for quick scripts — but because for production workloads at scale, I wanted control. Block-by-block processing. No hidden graph. No scheduler surprises. You can find the entire repository on GitLab.

The Rust code does the same thing the Python code does, just without the middleware. And it’s faster. A lot faster.

Python for Mockups, Rust for Production

Here’s the thing: Python + Dask is great for quick mockups and prototypes. It’s in a language most people know, you can iterate fast, and for simple stuff it “just works.” If you’re exploring a new algorithm or doing a one-off analysis, reach for Python by all means.

But for anything heading toward production? Consider this:

Cost blows out quickly: Our masked benchmark took 339 seconds on Python vs 15 seconds on Rust. That’s 22× more compute time. At AWS spot prices (~$0.05/vCPU-hour for an 8-core machine), a single scene costs:
- Rust: 15s / 3600 × $0.40 = $0.0017
- Python: 339s / 3600 × $0.40 = $0.0377 Scale to 100 scenes: Rust $0.17 vs Python $3.77. Scale to 1000 scenes: $1.70 vs $37.70. The gap compounds with every scene.
The Rust code isn’t that complicated: Look at bench_ndvi_annual_full_tile_masked.rs — it’s ~240 lines. The worker function is 30 lines of straightforward numpy-style array math. It’s readable, maintainable, and doesn’t need a PhD to understand.
It scales: More cores → faster (we’ve tested up to 16 threads, it keeps scaling). More data → still fast (block-based processing means memory stays bounded). More complexity → the Rust code doesn’t explode like Dask does.
Single-node today, distributed tomorrow: Right now eorst runs on a single machine — that’s the limitation. But I’ve got a prototype working that distributes block processing across multiple nodes, same API, no user-facing changes. The code just gets faster when you throw more hardware at it.

The calculus is simple: for a one-off script, Python’s familiarity wins. For anything that runs repeatedly, at scale, or costs you money — Rust wins.

The Carbon Cost You’re Not Seeing

Every computation uses energy. Energy = Power × Time. When Python takes 23× longer, it doesn’t just cost more money — it burns more electricity and emits more CO₂.

This is the argument I made at FOSS4G 2025: “Oxidize to Decarbonize”. The programming language you choose has a real environmental footprint. And there’s a deep irony here — geospatial scientists are among the people most motivated to understand and protect the environment, yet the tools they use daily are among the least energy-efficient available.

Rough Energy Estimate (Not Measured)

Full disclosure: we haven’t plugged a power meter into this benchmark yet. The numbers below are back-of-the-envelope estimates based on our CPU’s TDP and the measured time difference. They’re directionally correct but not precise. A proper power-metered benchmark is on the TODO list.

Our benchmark machine has a 125W TDP CPU. During compute-heavy workloads, it draws roughly 60W sustained (not full boost, not idle — something in between). Using that conservative estimate:

	Rust (15s)	Python (339s)	Ratio
Energy (Wh)	~0.025 Wh	~0.565 Wh	~23× (time ratio)
CO₂ (g, Australia grid ~0.71 kg/kWh)	~0.02 g	~0.40 g	~23× (time ratio)

The ratio matches the time ratio because we’re assuming constant power draw. In reality, the Python process may draw slightly more power due to higher CPU utilization from interpreter overhead — so the actual energy gap could be wider. Or it could be narrower if the Rust process keeps the CPU in boost longer. We don’t know without measuring.

One scene doesn’t sound like much. But scale it:

Scenes	Rust CO₂ (est.)	Python CO₂ (est.)	Difference (est.)
1,000	~20 g	~400 g	~380 g
10,000	~200 g	~4.0 kg	~3.8 kg
100,000	~2.0 kg	~40 kg	~38 kg

A large national-scale processing job (100k+ scenes) could emit tens of kilograms more CO₂ in Python alone from the compute phase. That’s roughly the carbon footprint of driving a petrol car 100-150 km. For what? The exact same NDVI output.

What the Research Says

This isn’t just our benchmark. Multiple independent studies confirm the pattern — compiled languages consume substantially less energy than interpreted ones for equivalent workloads:

Pereira et al. (2017) — Measured energy consumption across languages: C = 1.0, Rust = 1.03, Python = 75.88. Rust matches C while Python consumes nearly 76× more energy for equivalent workloads. Paper
AWS (2022) — Rust is approximately 50% more energy-efficient than Java and uses 98% less energy than Python for comparable Lambda workloads. Blog
Malavolta et al. (2024) — Polars (Rust) vs Pandas (Python) shows major differences in both runtime and energy consumption for data processing. Paper
Marini et al. (2024) — Interpreted languages can require up to 54× more energy than compiled languages. Paper

These studies used actual power meters. Our 23× time ratio is consistent with this research — the exact energy multiplier depends on workload characteristics, hardware, and measurement methodology. But the direction is unambiguous: compiled languages are more energy-efficient.

Why Rust Is Efficient

The energy advantage isn’t magic — it’s architecture:

No garbage collector — Rust’s ownership model eliminates GC pauses. No background thread stopping the world to reclaim memory. Every CPU cycle goes to your computation.
No interpreter overhead — Rust compiles to native code. Python executes bytecodes through an interpreter loop, adding overhead to every single operation. For pixel-by-pixel NDVI on 120 million pixels, that overhead compounds.
Predictable memory — No hidden copies, no reference counting churn. Your block reads, computes, writes, and releases. Memory stays bounded.
Real parallelism — Rayon’s work-stealing scheduler gives you actual CPU parallelism without GIL contention. At 8 threads, Rust scales; Python fights itself.

The Bigger Picture

While individual choices matter, systemic change requires collective action. If every geospatial team processing satellite data at scale considered energy as a metric alongside accuracy and throughput, the cumulative effect would be measurable. A national earth observation program running millions of scenes annually could reduce its compute carbon footprint significantly simply by choosing compiled tooling for the heavy lifting.

This doesn’t mean “abandon Python entirely.” Python is excellent for exploration, prototyping, and glue code. But the heavy compute — the NDVI over 100k scenes, the annual composites, the model training — that’s where the energy goes. And that’s where Rust earns its keep.

Why Is Python/Dask So Slow?

1. GDAL VSI Cache Behavior

Both Rust and Python ultimately use GDAL under the hood. But the way they access it makes a huge difference:

Rust uses GDAL’s VSI layer directly with a single-threaded-per-block pattern:

Each block reads its own windows sequentially
GDAL’s C-level block cache stays hot (repeated reads are ~0.6ms)
Only fetches the tile chunks you need (HTTP range requests)
Decompresses in native code

Python rioxarray also uses GDAL VSI for S3 reads, but Dask’s multi-threaded chunked access breaks the cache:

8 threads each reading different chunks from the same file → GDAL block cache thrashes
No cross-chunk cache sharing — each thread warms its own cache, then discards it
GIL contention on the threaded scheduler compounds the problem
The AWS_NO_SIGN_REQUEST config also needs to be set for unsigned DEA S3 access

The result: Python reads S3 3× slower than Rust’s GDAL VSI pattern, despite using the same underlying library.

2. Dask’s Threaded Scheduler Doesn’t Scale

The benchmark uses 8 threads. But Python’s Dask with the threaded scheduler hits GIL contention:

import dask
dask.config.set(scheduler="threads", num_workers=8)

From our scaling tests on synthetic data (10k×10k×3, /dev/shm RAM disk):

Threads	Rust (apply)	Python (dask)
2	8,232 ms	6,037 ms (wins!)
4	6,833 ms	6,185 ms
8	6,378 ms	6,771 ms (Rust wins)
16	6,420 ms	7,083 ms (Rust wins)

Python wins at 2 threads. At 4 threads it’s roughly equal. At 8+ threads, Rust wins and the gap widens. Dask’s scheduler fights itself past 4 threads. But note: this is a trivial worker (pixel + 1). For realistic NDVI + masking workloads, Rust wins at all thread counts.

3. Block Processing vs Global Array

Rust’s apply_reduction processes in 2048×2048 blocks:

Each block reads independently from local files or S3
Writes directly to the output GeoTIFF via ParallelGeoTiffWriter
No global array materialization

Python’s approach:

Stacks all timesteps into a single dask array
Computes a task graph for the whole thing
On .compute(), launches parallel tasks for each chunk

For CPU-bound operations like NDVI (simple arithmetic), Rust’s direct block processing is more efficient.

What Dask Should Be Smarter About

Look, I’m a Rust guy. I used Python for years, got frustrated with packaging and Dask OOM kills, and moved on. But Dask keeps making choices that feel… unforced.

Use GDAL for reading, not Python HTTP. Every time you rioxarray.open_rasterio("s3://...") from a COG, Dask is going through Python’s HTTP stack. It should recognize “this is a GeoTIFF on S3” and use GDAL VSI internally, the same way the Rust code does. There’s no fundamental reason it can’t — rasterio already wraps GDAL. It just doesn’t do it automatically for remote COGs.

Recognize trivial computes. NDVI is (nir - red) / (nir + red). That’s not a neural network. It’s three arithmetic operations. Dask shouldn’t need to build a task graph with 200+ nodes for this. It should recognize “this is element-wise arithmetic on regular arrays” and just… do it. The Rust code doesn’t build a graph — it loops over blocks and does the math directly.

Don’t materialize the whole array. The Rust code never loads all 9 timesteps into memory at once. It reads a block, computes, writes, moves to the next. Dask’s lazy loading is nice in theory, but when you .compute(), it often tries to orchestrate reads across the whole dataset. For our 10k×10k×9 dataset, that’s 90 band reads. The task graph gets enormous. The Rust code does 36 block reads × 18 bands = 648 windowed GDAL reads. Same data, simpler graph.

Pick the right scheduler. The threaded scheduler hits GIL contention past 4 threads. The distributed scheduler adds cluster overhead. For a single-machine workload like ours, there’s no great option. Dask could recognize “this is CPU-bound numpy work on a local machine” and spawn processes automatically instead of threads. It doesn’t.

These aren’t hard problems — they’re just things Dask hasn’t optimized for because its original use case was “big data” (clusters, distributed). Single-machine geospatial processing is a different workload, and it deserves better defaults.

Why I Like Rust Better

After this benchmark, I’m more convinced than ever:

1. No GIL, Actual Parallelism

Python’s GIL means threaded Dask can’t truly parallelize CPU-bound work. Process-based parallelism helps but adds serialization overhead. Rust’s rayon work-stealing gives you real parallel block processing with zero contention on the compute phase. For geospatial workloads that are inherently block-parallel, this is the right model.

2. Predictable Memory (and Lower Energy)

No garbage collector pauses during I/O. No hidden copies when stacking dask arrays. No reference-counting churn on every pixel. Your block reads into a buffer, you compute, you write, you move on. Memory stays bounded regardless of dataset size. This matters when you’re processing 10k×10k images with 9+ timesteps — Python’s tendency to materialize intermediate arrays is a real problem. And every wasted CPU cycle is wasted energy. Pereira et al. (2017) measured this directly: Python consumed 76× more energy than C for equivalent workloads. Much of that gap is runtime overhead — interpreter loops, GC, reference counting — that Rust simply doesn’t have.

3. GDAL Integration

Rust’s gdal crate gives you GDAL’s C-level performance with a safe Rust API. You’re not fighting through a Python wrapper that may or may not expose the efficient code paths. The VSI cache behavior is predictable because you control the read pattern.

4. Single Binary

cargo build --release gives you one file that runs anywhere. No requirements.txt, no environment.yml, no “it works on my machine” problems. Deploy to a cluster, a Lambda, or a laptop — same binary.

What Would Make Python Better?

If you’re committed to Python, here’s what I’d try:

Use GDAL directly via rasterio — avoid the Dask task graph for CPU-bound work
Use process-based parallelism (multiprocessing) instead of threads to avoid GIL
Download data locally first — don’t read from S3 during processing
Consider ODC-Algo or direct rasterio — skip dask for simple indices

But honestly? If you’re doing serious geospatial processing at scale, the Rust approach is just more pleasant. The compiler is your teammate, not your adversary.

The Code

The full project source lives on GitLab — clone the repository and you’re ready to go. If you want to reproduce this benchmark:

# Rust (simple NDVI)
cargo run --release --example bench_ndvi_annual_full_tile --features=use_rss -- \
  --scene 56jns --start 2023-06-01 --end 2023-06-30 --max-cloud 30 \
  --threads 8 --block-size 2048 --output /tmp/ndvi_mean_rust.tif

# Rust (with FMask masking)
cargo run --release --example bench_ndvi_annual_full_tile_masked --features=use_rss -- \
  --scene 56jns --start 2023-06-01 --end 2023-06-30 --max-cloud 30 \
  --threads 8 --block-size 2048 --output /tmp/ndvi_mean_masked_rust.tif

# Python (simple NDVI)
export AWS_NO_SIGN_REQUEST=YES
python3 libs/eorst/benches/bench_ndvi_mean_stac.py --scene 56jns --start 2023-06-01 \
  --end 2023-06-30 --max-cloud 30 --threads 8 --block-size 2048 \
  --output /tmp/ndvi_mean_python.tif

# Python (with FMask masking)
export AWS_NO_SIGN_REQUEST=YES
python3 libs/eorst/benches/bench_ndvi_mean_stac_masked.py --scene 56jns --start 2023-06-01 \
  --end 2023-06-30 --max-cloud 30 --threads 8 --block-size 2048 \
  --output /tmp/ndvi_mean_masked_python.tif

The full benchmark numbers are in benchmark_ndvi_mean.md in the workspace root.

This confirms what we’ve seen in our synthetic benchmarks — Rust’s block-parallel approach with GDAL VSI beats Python+Dask for real geospatial workloads. The energy and carbon implications are real and measurable. Next up: the NBR (burnt area) index that the original DEA notebook uses, multi-zone processing, a Python local-file benchmark to isolate the Dask graph overhead from S3 I/O, and a proper power-metered energy benchmark to validate the rough calculations in this post.