DataVariable¶
Data variables store N-dimensional data referenced by coordinates. Unlike coordinates, they never hold full data in memory — all access goes through the chunk store.
Class Hierarchy¶
Variable (base)
└── DataVariableView (sliced view, supports read + write)
└── DataVariable (full variable)
DataVariable— the full variable, returned byds['var_name']or creation methodsDataVariableView— a sliced subset, returned by indexing
Usage¶
with cfdb.open_dataset(file_path) as ds:
temp = ds['temperature']
print(temp.shape)
print(temp.coord_names)
Properties¶
| Property | Type | Description |
|---|---|---|
name |
str | Variable name |
shape |
tuple | Shape derived from coordinate sizes |
chunk_shape |
tuple | Chunk shape |
dtype |
DataType | cfdb data type |
coord_names |
tuple of str | Names of linked coordinates |
coords |
tuple | Coordinate objects for this variable |
ndims |
int | Number of dimensions |
attrs |
Attributes | Variable attributes |
units |
str or None | Physical units |
writable |
bool | Whether the dataset is writable |
data |
np.ndarray | Full array (reads all chunks — use with care) |
values |
np.ndarray | Alias for data |
loc |
LocationIndexer | Location-based indexer |
Reading¶
Indexing¶
iter_chunks(chunk_shape=None, max_mem=2**27, decoded=True)¶
Iterate through chunks of the variable. Always yields (slices, data) tuples.
# Storage chunks
for chunk_slices, data in temp.iter_chunks():
print(chunk_slices, data.shape)
# Rechunked iteration
for chunk_slices, data in temp.iter_chunks({'latitude': 50}):
print(chunk_slices, data.shape)
Parameters:
| Parameter | Type | Description |
|---|---|---|
chunk_shape |
dict or None | {coord_name: int} for target chunk sizes. None uses storage chunks. |
max_mem |
int | Max memory for rechunker buffer (default 128 MB). |
decoded |
bool | If False, yield encoded data. Only applies in storage-chunk mode. |
include_data |
bool | If False, yield only chunk position slices without loading data. Default True. |
get_chunk(sel=None, missing_none=False)¶
Read data from one chunk.
items(decoded=True)¶
Iterate through all individual positions yielding (index, value) tuples.
Writing¶
Indexing Assignment¶
set(sel, data, decoded=True)¶
Set data at index positions. The decoded parameter controls whether input data is in decoded or encoded form.
GroupBy¶
groupby(coord_names, max_mem=2**29)¶
Group by one or more coordinates. Returns a generator of (slices, data) tuples.
Parameters:
| Parameter | Type | Description |
|---|---|---|
coord_names |
str, list of str, or dict | Coordinates to group by. Dict values can be int (chunk size) or str (time period). |
max_mem |
int | Max memory for rechunker buffer (default 128 MB). |
Period strings: 'D' (day), '7D' (7 days), 'M' (month), 'Y' (year), '6h' (6 hours), etc. Valid units: Y, M, W, D, h, m, s, ms, us, ns. Only valid on datetime coordinates.
# Group by individual coordinate values
for slices, data in temp.groupby('latitude'):
print(slices, data.shape)
for slices, data in temp.groupby(('latitude', 'time')):
print(slices, data.shape)
# Group by time period
for slices, data in temp.groupby({'time': 'D'}):
print(slices, data.shape)
for slices, data in temp.groupby({'time': 'M'}):
print(slices, data.shape)
# Mixed: period on time, chunk size on spatial
for slices, data in temp.groupby({'time': 'D', 'latitude': 50}):
print(slices, data.shape)
Parallel Map¶
map(func, chunk_shape=None, n_workers=None, max_mem=2**27)¶
Apply a function to each chunk in parallel using multiprocessing. Yields (target_chunk, result) tuples as workers complete.
The user function receives (target_chunk, data) — the same values yielded by iter_chunks(). It must be a top-level picklable function (not a lambda or closure).
def compute_mean(target_chunk, data):
return data.mean()
with cfdb.open_dataset(file_path) as ds:
means = [result for _, result in ds['temperature'].map(compute_mean, n_workers=4)]
Parameters:
| Parameter | Type | Description |
|---|---|---|
func |
callable | func(target_chunk, data) -> result or None. Return None to skip. |
chunk_shape |
dict or None | {coord_name: int} for rechunked iteration. None uses storage chunks with the efficient booklet.map path. |
n_workers |
int or None | Number of worker processes. Defaults to os.cpu_count(). |
max_mem |
int | Max memory for rechunker buffer. Only used when chunk_shape is set. |
With rechunked chunks:
with cfdb.open_dataset(file_path, flag='w') as ds:
temp = ds['temperature']
for target_chunk, result in temp.map(scale_kelvin, chunk_shape={'latitude': 50}, n_workers=4):
temp[target_chunk] = result
Works on views too:
with cfdb.open_dataset(file_path) as ds:
view = ds['temperature'][0:50, :]
for target_chunk, result in view.map(transform, n_workers=4):
...
Interpolation¶
interp(x=None, y=None, z=None, iter_dim=None, xy=None)¶
Create an interpolation object for spatial interpolation. Returns GridInterp for grid datasets or PointInterp for ts_ortho datasets. Requires geointerp and a CRS.
See GridInterp / PointInterp for details.
Rechunking¶
rechunker()¶
Return a Rechunker for on-the-fly rechunking:
See Rechunker for details.
Other Methods¶
load()¶
For EDataset: pre-fetch chunks from S3. No-op for local datasets.
update_units(units)¶
Update the units value. Only on writable datasets.