Chunking & Storage¶
cfdb stores all data — both coordinates and data variables — as compressed chunks. This page explains how chunking works and how to choose good chunk shapes.
What is a Chunk?¶
A chunk is a fixed-size rectangular block of the full array. For example, a variable with shape (1000, 2000) and chunk shape (100, 200) is stored as 100 separate chunks (10 along the first axis, 10 along the second).
Each chunk is independently compressed and stored as a single Booklet key-value entry.
Chunk Key Format¶
Chunk keys follow the pattern:
For example, chunk (200, 400) of variable temperature is stored with key temperature!200,400.
Compression¶
Every chunk is compressed before storage. The algorithm is set at dataset creation:
| Algorithm | Library | Characteristics |
|---|---|---|
zstd |
zstandard | Best compression ratio at reasonable speed (default) |
lz4 |
lz4 | Fastest compression/decompression |
Compression level defaults to 1 for both algorithms. Higher levels improve ratio but slow down writes.
Automatic Chunk Shape¶
When chunk_shape=None is passed during variable creation, cfdb uses rechunkit.guess_chunk_shape() to estimate an appropriate chunk shape based on:
- The variable's total shape
- The dtype's element size
- A target chunk byte size
The algorithm prefers composite numbers for chunk dimensions. This is important because rechunking between two chunk shapes is most efficient when the least common multiple (LCM) of corresponding dimensions is small — and composite numbers tend to have lower LCMs than primes.
The trade off is that a larger chunk would have a higher compression ratio, but a larger chunk would slow downs reads do to having to decompress a large amount of data for a small slicing request.
The default chunk byte size is a maximum of ~2 MB. Both compression algorithms used in cfdb tend to max out the compression ratio between 1-2 MB of raw data. A chunk byte size greater than 2 MB would not significantly improve the compression and would slow down reads. If anything, the user should reduce the default chunk byte size rather than increase it.
Choosing Chunk Shapes¶
The optimal chunk shape depends on your access pattern:
| Access Pattern | Ideal Chunk Shape |
|---|---|
| Read full rows | (1, N) — thin along rows, wide along columns |
| Read full columns | (N, 1) — wide along rows, thin along columns |
| Read spatial blocks | (M, M) — square chunks |
| Time series at one point | (1, 1, T) — thin spatially, long temporally |
| Spatial snapshot at one time | (Y, X, 1) — wide spatially, thin temporally |
In practice, the auto-estimated chunk shape is a reasonable starting point. Use the Rechunker when you need a different access pattern.
Coordinate Chunk Storage¶
Coordinates are also stored as chunks, but they always hold the full data in memory. This is because coordinate data is typically small (1-D arrays) and needed frequently for index lookups.
Data Variable Chunk Storage¶
Data variables never hold full data in memory. Every read goes through the chunk store. This keeps memory usage predictable even for very large datasets.
Chunk Alignment and Origins¶
Coordinates can have a non-zero origin when data is prepended. The origin tracks the starting position of the coordinate in the global index space. This allows prepending data without rewriting existing chunks.
For example, if a coordinate originally starts at index 0 and you prepend 100 values, the origin becomes -100 and existing chunks keep their original keys.