DataVariable¶

Data variables store N-dimensional data referenced by coordinates. Unlike coordinates, they never hold full data in memory — all access goes through the chunk store.

Class Hierarchy¶

Variable (base)
  └── DataVariableView (sliced view, supports read + write)
        └── DataVariable (full variable)

DataVariable — the full variable, returned by ds['var_name'] or creation methods
DataVariableView — a sliced subset, returned by indexing

Usage¶

with cfdb.open_dataset(file_path) as ds:
    temp = ds['temperature']
    print(temp.shape)
    print(temp.coord_names)

Properties¶

Property	Type	Description
`name`	str	Variable name
`shape`	tuple	Shape derived from coordinate sizes
`chunk_shape`	tuple	Chunk shape
`dtype`	DataType	cfdb data type
`coord_names`	tuple of str	Names of linked coordinates
`coords`	tuple	Coordinate objects for this variable
`ndims`	int	Number of dimensions
`attrs`	Attributes	Variable attributes
`units`	str or None	Physical units
`writable`	bool	Whether the dataset is writable
`data`	np.ndarray	Full array (reads all chunks — use with care)
`values`	np.ndarray	Alias for `data`
`loc`	LocationIndexer	Location-based indexer

Reading¶

Indexing¶

temp[0, 0]          # single value
temp[10:20, :]      # slice
temp[5, 0:100]      # mixed

iter_chunks(chunk_shape=None, max_mem=2**27, decoded=True)¶

Iterate through chunks of the variable. Always yields (slices, data) tuples.

# Storage chunks
for chunk_slices, data in temp.iter_chunks():
    print(chunk_slices, data.shape)

# Rechunked iteration
for chunk_slices, data in temp.iter_chunks({'latitude': 50}):
    print(chunk_slices, data.shape)

Parameters:

Parameter	Type	Description
`chunk_shape`	dict or None	`{coord_name: int}` for target chunk sizes. `None` uses storage chunks.
`max_mem`	int	Max memory for rechunker buffer (default 128 MB).
`decoded`	bool	If `False`, yield encoded data. Only applies in storage-chunk mode.
`include_data`	bool	If `False`, yield only chunk position slices without loading data. Default `True`.

get_chunk(sel=None, missing_none=False)¶

Read data from one chunk.

items(decoded=True)¶

Iterate through all individual positions yielding (index, value) tuples.

Writing¶

Indexing Assignment¶

temp[:] = full_array
temp[0:10, :] = partial_data
temp[5, 100] = 42.0

set(sel, data, decoded=True)¶

Set data at index positions. The decoded parameter controls whether input data is in decoded or encoded form.

GroupBy¶

groupby(coord_names, max_mem=2**29)¶

Group by one or more coordinates. Returns a generator of (slices, data) tuples.

Parameters:

Parameter	Type	Description
`coord_names`	str, list of str, or dict	Coordinates to group by. Dict values can be `int` (chunk size) or `str` (time period).
`max_mem`	int	Max memory for rechunker buffer (default 128 MB).

Period strings: 'D' (day), '7D' (7 days), 'M' (month), 'Y' (year), '6h' (6 hours), etc. Valid units: Y, M, W, D, h, m, s, ms, us, ns. Only valid on datetime coordinates.

# Group by individual coordinate values
for slices, data in temp.groupby('latitude'):
    print(slices, data.shape)

for slices, data in temp.groupby(('latitude', 'time')):
    print(slices, data.shape)

# Group by time period
for slices, data in temp.groupby({'time': 'D'}):
    print(slices, data.shape)

for slices, data in temp.groupby({'time': 'M'}):
    print(slices, data.shape)

# Mixed: period on time, chunk size on spatial
for slices, data in temp.groupby({'time': 'D', 'latitude': 50}):
    print(slices, data.shape)

Parallel Map¶

map(func, chunk_shape=None, n_workers=None, max_mem=2**27)¶

Apply a function to each chunk in parallel using multiprocessing. Yields (target_chunk, result) tuples as workers complete.

The user function receives (target_chunk, data) — the same values yielded by iter_chunks(). It must be a top-level picklable function (not a lambda or closure).

def compute_mean(target_chunk, data):
    return data.mean()

with cfdb.open_dataset(file_path) as ds:
    means = [result for _, result in ds['temperature'].map(compute_mean, n_workers=4)]

Parameters:

Parameter	Type	Description
`func`	callable	`func(target_chunk, data) -> result or None`. Return `None` to skip.
`chunk_shape`	dict or None	`{coord_name: int}` for rechunked iteration. `None` uses storage chunks with the efficient booklet.map path.
`n_workers`	int or None	Number of worker processes. Defaults to `os.cpu_count()`.
`max_mem`	int	Max memory for rechunker buffer. Only used when `chunk_shape` is set.

With rechunked chunks:

with cfdb.open_dataset(file_path, flag='w') as ds:
    temp = ds['temperature']
    for target_chunk, result in temp.map(scale_kelvin, chunk_shape={'latitude': 50}, n_workers=4):
        temp[target_chunk] = result

Works on views too:

with cfdb.open_dataset(file_path) as ds:
    view = ds['temperature'][0:50, :]
    for target_chunk, result in view.map(transform, n_workers=4):
        ...

Interpolation¶

interp(x=None, y=None, z=None, iter_dim=None, xy=None)¶

Create an interpolation object for spatial interpolation. Returns GridInterp for grid datasets or PointInterp for ts_ortho datasets. Requires geointerp and a CRS.

gi = temp.interp()
for time_val, grid in gi.to_grid(grid_res=0.01):
    print(grid.shape)

See GridInterp / PointInterp for details.

Rechunking¶

rechunker()¶

Return a Rechunker for on-the-fly rechunking:

rechunker = temp.rechunker()
for slices, data in rechunker.rechunk((50, 50)):
    print(data.shape)

See Rechunker for details.

Other Methods¶

load()¶

For EDataset: pre-fetch chunks from S3. No-op for local datasets.

update_units(units)¶

Update the units value. Only on writable datasets.