Skip to content

DataVariable

Data variables store N-dimensional data referenced by coordinates. Unlike coordinates, they never hold full data in memory — all access goes through the chunk store.

Class Hierarchy

Variable (base)
  └── DataVariableView (sliced view, supports read + write)
        └── DataVariable (full variable)
  • DataVariable — the full variable, returned by ds['var_name'] or creation methods
  • DataVariableView — a sliced subset, returned by indexing

Usage

with cfdb.open_dataset(file_path) as ds:
    temp = ds['temperature']
    print(temp.shape)
    print(temp.coord_names)

Properties

Property Type Description
name str Variable name
shape tuple Shape derived from coordinate sizes
chunk_shape tuple Chunk shape
dtype DataType cfdb data type
coord_names tuple of str Names of linked coordinates
coords tuple Coordinate objects for this variable
ndims int Number of dimensions
attrs Attributes Variable attributes
units str or None Physical units
writable bool Whether the dataset is writable
data np.ndarray Full array (reads all chunks — use with care)
values np.ndarray Alias for data
loc LocationIndexer Location-based indexer

Reading

Indexing

temp[0, 0]          # single value
temp[10:20, :]      # slice
temp[5, 0:100]      # mixed

iter_chunks(chunk_shape=None, max_mem=2**27, decoded=True)

Iterate through chunks of the variable. Always yields (slices, data) tuples.

# Storage chunks
for chunk_slices, data in temp.iter_chunks():
    print(chunk_slices, data.shape)

# Rechunked iteration
for chunk_slices, data in temp.iter_chunks({'latitude': 50}):
    print(chunk_slices, data.shape)

Parameters:

Parameter Type Description
chunk_shape dict or None {coord_name: int} for target chunk sizes. None uses storage chunks.
max_mem int Max memory for rechunker buffer (default 128 MB).
decoded bool If False, yield encoded data. Only applies in storage-chunk mode.
include_data bool If False, yield only chunk position slices without loading data. Default True.

get_chunk(sel=None, missing_none=False)

Read data from one chunk.

items(decoded=True)

Iterate through all individual positions yielding (index, value) tuples.

Writing

Indexing Assignment

temp[:] = full_array
temp[0:10, :] = partial_data
temp[5, 100] = 42.0

set(sel, data, decoded=True)

Set data at index positions. The decoded parameter controls whether input data is in decoded or encoded form.

GroupBy

groupby(coord_names, max_mem=2**29)

Group by one or more coordinates. Returns a generator of (slices, data) tuples.

Parameters:

Parameter Type Description
coord_names str, list of str, or dict Coordinates to group by. Dict values can be int (chunk size) or str (time period).
max_mem int Max memory for rechunker buffer (default 128 MB).

Period strings: 'D' (day), '7D' (7 days), 'M' (month), 'Y' (year), '6h' (6 hours), etc. Valid units: Y, M, W, D, h, m, s, ms, us, ns. Only valid on datetime coordinates.

# Group by individual coordinate values
for slices, data in temp.groupby('latitude'):
    print(slices, data.shape)

for slices, data in temp.groupby(('latitude', 'time')):
    print(slices, data.shape)

# Group by time period
for slices, data in temp.groupby({'time': 'D'}):
    print(slices, data.shape)

for slices, data in temp.groupby({'time': 'M'}):
    print(slices, data.shape)

# Mixed: period on time, chunk size on spatial
for slices, data in temp.groupby({'time': 'D', 'latitude': 50}):
    print(slices, data.shape)

Parallel Map

map(func, chunk_shape=None, n_workers=None, max_mem=2**27)

Apply a function to each chunk in parallel using multiprocessing. Yields (target_chunk, result) tuples as workers complete.

The user function receives (target_chunk, data) — the same values yielded by iter_chunks(). It must be a top-level picklable function (not a lambda or closure).

def compute_mean(target_chunk, data):
    return data.mean()

with cfdb.open_dataset(file_path) as ds:
    means = [result for _, result in ds['temperature'].map(compute_mean, n_workers=4)]

Parameters:

Parameter Type Description
func callable func(target_chunk, data) -> result or None. Return None to skip.
chunk_shape dict or None {coord_name: int} for rechunked iteration. None uses storage chunks with the efficient booklet.map path.
n_workers int or None Number of worker processes. Defaults to os.cpu_count().
max_mem int Max memory for rechunker buffer. Only used when chunk_shape is set.

With rechunked chunks:

with cfdb.open_dataset(file_path, flag='w') as ds:
    temp = ds['temperature']
    for target_chunk, result in temp.map(scale_kelvin, chunk_shape={'latitude': 50}, n_workers=4):
        temp[target_chunk] = result

Works on views too:

with cfdb.open_dataset(file_path) as ds:
    view = ds['temperature'][0:50, :]
    for target_chunk, result in view.map(transform, n_workers=4):
        ...

Interpolation

interp(x=None, y=None, z=None, iter_dim=None, xy=None)

Create an interpolation object for spatial interpolation. Returns GridInterp for grid datasets or PointInterp for ts_ortho datasets. Requires geointerp and a CRS.

gi = temp.interp()
for time_val, grid in gi.to_grid(grid_res=0.01):
    print(grid.shape)

See GridInterp / PointInterp for details.

Rechunking

rechunker()

Return a Rechunker for on-the-fly rechunking:

rechunker = temp.rechunker()
for slices, data in rechunker.rechunk((50, 50)):
    print(data.shape)

See Rechunker for details.

Other Methods

load()

For EDataset: pre-fetch chunks from S3. No-op for local datasets.

update_units(units)

Update the units value. Only on writable datasets.