Skip to content

Preprocessing Tools

rechunkit provides several functions to help plan a rechunking operation before running it. These let you estimate chunk shapes, calculate memory requirements, and predict I/O costs.

Guessing a Chunk Shape

If you don't already have a target chunk shape, guess_chunk_shape picks one that fits within a byte budget:

from rechunkit import guess_chunk_shape

shape = (1000, 2000, 500)
itemsize = 4  # float32

chunk_shape = guess_chunk_shape(shape, itemsize, target_chunk_size=2**21)

The function assigns each dimension to the largest highly composite number that keeps the total chunk size within the target. It iterates over dimensions, halving the largest one each round until the budget is met.

Tip

Using composite numbers for chunk dimensions produces smaller LCMs with other chunk shapes, which directly reduces the number of redundant source reads during rechunking.

Ideal Read Chunk Shape

The ideal intermediate read shape is the element-wise LCM of source and target chunk shapes:

from rechunkit import calc_ideal_read_chunk_shape, calc_ideal_read_chunk_mem

source_chunks = (6, 4)
target_chunks = (4, 6)

ideal_shape = calc_ideal_read_chunk_shape(source_chunks, target_chunks)  # (12, 12)
ideal_mem = calc_ideal_read_chunk_mem(ideal_shape, itemsize=4)           # 576 bytes

If you can afford ideal_mem bytes of buffer, every source chunk will be read exactly once.

Memory-Constrained Read Shape

When the ideal shape doesn't fit in memory, calc_source_read_chunk_shape computes a reduced read shape that fits within max_mem:

from rechunkit import calc_source_read_chunk_shape

read_shape = calc_source_read_chunk_shape(
    source_chunks, target_chunks, itemsize=4, max_mem=200
)

The algorithm finds a multiple of the source chunk shape that fits in memory while preserving the aspect ratio of the ideal shape as closely as possible.

Counting Chunks

calc_n_chunks returns the total number of chunks for a given shape and chunk shape:

from rechunkit import calc_n_chunks

n_source = calc_n_chunks((100, 100), (6, 4))   # 425
n_target = calc_n_chunks((100, 100), (4, 6))   # 425

Predicting Read Counts

Two functions let you compare I/O strategies:

Brute-force reads

calc_n_reads_simple counts the reads if every target chunk independently reads all its overlapping source chunks — no optimization:

from rechunkit import calc_n_reads_simple

n_simple = calc_n_reads_simple((31, 31, 31), (5, 2, 4), (4, 5, 3))  # 3952

Optimized reads

calc_n_reads_rechunker predicts the reads and writes using the optimized algorithm at a given memory budget:

from rechunkit import calc_n_reads_rechunker

n_reads, n_writes = calc_n_reads_rechunker(
    (31, 31, 31), dtype.itemsize, (5, 2, 4), (4, 5, 3), max_mem=2000
)
# n_reads=2044, n_writes=616

More memory means fewer reads — see How It Works for details.

chunk_range Utility

chunk_range is a multi-dimensional equivalent of Python's range, yielding tuples of slices:

from rechunkit import chunk_range

for chunk in chunk_range((0, 0), (10, 10), (4, 6)):
    print(chunk)
# (slice(0, 4), slice(0, 6))
# (slice(0, 4), slice(6, 10))
# (slice(4, 8), slice(0, 6))
# (slice(4, 8), slice(6, 10))
# (slice(8, 10), slice(0, 6))
# (slice(8, 10), slice(6, 10))

This is useful for iterating over chunk positions when writing to storage backends.