Data Variables¶

Data variables store N-dimensional data referenced by coordinates. Unlike coordinates, data variables never hold full data in memory — data is always accessed chunk by chunk.

Creating Data Variables¶

Data variables require existing coordinates:

import cfdb
import numpy as np

with cfdb.open_dataset(file_path, flag='w') as ds:
    data_var = ds.create.data_var.generic(
        'temperature',
        ('latitude', 'time'),
        dtype='float32',
    )

Template Methods¶

Like coordinates, common data variable types have template methods:

with cfdb.open_dataset(file_path, flag='w') as ds:
    temp = ds.create.data_var.air_temperature(('latitude', 'longitude', 'time'))

Templates set standard names, dtypes, and attributes. Any parameter accepted by generic() can be overridden via **kwargs.

Generic Creation Parameters¶

Parameter	Type	Description
`name`	str	Unique variable name
`coords`	tuple of str	Coordinate names defining the dimensions
`dtype`	str, np.dtype, or DataType	Data type
`chunk_shape`	tuple of int or None	Chunk shape (auto-estimated if None)

Creating from Existing¶

with cfdb.open_dataset(file_path, flag='w') as ds:
    new_var = ds.create.data_var.like('temperature_copy', ds['temperature'])

Writing Data¶

Direct Assignment¶

The simplest way to write a full array:

data = np.random.rand(200, 200).astype('float32')

with cfdb.open_dataset(file_path, flag='w') as ds:
    ds['temperature'][:] = data

Assignment uses numpy basic indexing (integers and slices):

with cfdb.open_dataset(file_path, flag='w') as ds:
    temp = ds['temperature']
    temp[0:10, :] = data[0:10, :]      # slice assignment
    temp[5, 100] = 42.0                 # scalar assignment

Note

Advanced indexing (fancy indexing with arrays) is currently not supported. It might be supported in the future.

Chunk-Based Writing (Recommended)¶

For large datasets, iterate over chunk positions to control memory usage:

with cfdb.open_dataset(file_path, flag='w') as ds:
    temp = ds['temperature']
    for chunk_slices in temp.iter_chunks(include_data=False):
        temp[chunk_slices] = data[chunk_slices]

This is the recommended approach when your source data is larger than memory or comes in pieces.

Reading Data¶

Full Array¶

For small variables, read everything at once:

with cfdb.open_dataset(file_path) as ds:
    all_data = ds['temperature'].values

Chunk-Based Reading (Recommended)¶

For large datasets, iterate over chunks:

with cfdb.open_dataset(file_path) as ds:
    temp = ds['temperature']
    for chunk_slices, chunk_data in temp.iter_chunks():
        print(chunk_slices, chunk_data.shape)

The chunk_slices tuple contains slices that can be used as numpy indexes.

You can also iterate with a different chunk shape by passing a dict of {coord_name: int}:

with cfdb.open_dataset(file_path) as ds:
    temp = ds['temperature']
    for chunk_slices, chunk_data in temp.iter_chunks({'latitude': 50}):
        print(chunk_slices, chunk_data.shape)

For position-only iteration (no data loading), pass include_data=False:

with cfdb.open_dataset(file_path) as ds:
    temp = ds['temperature']
    for chunk_slices in temp.iter_chunks(include_data=False):
        print(chunk_slices)

GroupBy¶

Group by one or more coordinate dimensions. This rechunks the data so each yielded array covers a single position along the grouped dimension(s) and the full extent of all other dimensions:

with cfdb.open_dataset(file_path) as ds:
    for slices, data in ds['temperature'].groupby('latitude'):
        print(slices, data.shape)

Group by multiple coordinates:

with cfdb.open_dataset(file_path) as ds:
    for slices, data in ds['temperature'].groupby(('latitude', 'time')):
        print(slices, data.shape)

Time Period GroupBy¶

Pass a dict with period strings to group by time periods. This works on any datetime coordinate:

with cfdb.open_dataset(file_path) as ds:
    temp = ds['temperature']

    # Daily groups (hourly data → 24 time steps per group)
    for slices, data in temp.groupby({'time': 'D'}):
        print(slices, data.shape)

    # Monthly groups (variable size — Jan=31, Feb=28/29, etc.)
    for slices, data in temp.groupby({'time': 'M'}):
        print(slices, data.shape)

    # Yearly groups
    for slices, data in temp.groupby({'time': 'Y'}):
        print(slices, data.shape)

    # Every 6 hours
    for slices, data in temp.groupby({'time': '6h'}):
        print(slices, data.shape)

Supported period units: Y (year), M (month), W (week), D (day), h (hour), m (minute), s (second), ms, us, ns. Prefix with a count for multiples, e.g. '7D', '3M', '6h'.

Dict values can also be integers (chunk sizes), which can be mixed with period strings on different coordinates:

# Group by day on time, chunk size 50 on latitude
for slices, data in temp.groupby({'time': 'D', 'latitude': 50}):
    print(slices, data.shape)

Performance: When the period maps to a fixed number of time steps (e.g. daily on hourly data = 24 steps) and all groups are the same size, cfdb uses the efficient rechunker path. For irregular periods like monthly or yearly, it falls back to a slice-based iteration that reads each group separately.

The max_mem parameter controls the memory budget for the rechunking operation (default 128 MB).

Parallel Map¶

The map() method applies a function to each chunk in parallel using multiprocessing. It yields (target_chunk, result) tuples as workers complete. The function receives exactly what iter_chunks() yields: a target_chunk tuple of slices and a data numpy array.

The function must be a top-level picklable function (not a lambda or closure).

By default (no chunk_shape), map() uses the efficient booklet.map path where workers decompress and compute directly. Pass a chunk_shape dict to use a pool-based approach with rechunked chunks instead.

Transform and Write Back¶

def scale_kelvin(target_chunk, data):
    return data + 273.15

with cfdb.open_dataset(file_path, flag='w') as ds:
    temp = ds['temperature']
    for target_chunk, result in temp.map(scale_kelvin, n_workers=4):
        temp[target_chunk] = result

Aggregate¶

def chunk_stats(target_chunk, data):
    return {'mean': float(data.mean()), 'std': float(data.std())}

with cfdb.open_dataset(file_path) as ds:
    stats = [result for _, result in ds['temperature'].map(chunk_stats)]

Skip Chunks¶

Return None from your function to skip a chunk — it will not appear in the output:

def only_positive_mean(target_chunk, data):
    m = data.mean()
    if m > 0:
        return m
    return None

with cfdb.open_dataset(file_path) as ds:
    positive_means = [result for _, result in ds['temperature'].map(only_positive_mean)]

Map on a View¶

map() works on sliced views, processing only the selected chunks:

with cfdb.open_dataset(file_path) as ds:
    view = ds['temperature'][0:50, :]
    for target_chunk, result in view.map(scale_kelvin, n_workers=4):
        print(target_chunk, result.shape)

Dataset-Level Iteration¶

The methods above operate on a single data variable. The dataset also provides iter_chunks, groupby, and map that iterate over multiple data variables in lockstep. All data variables must share the same coordinates.

iter_chunks¶

Yields (target_chunk, var_data) where target_chunk is a {coord_name: slice} dict and var_data is a {var_name: ndarray} dict:

with cfdb.open_dataset(file_path) as ds:
    for target_chunk, var_data in ds.iter_chunks({'latitude': 25, 'longitude': 25}):
        temp_data = var_data['temperature']
        wind_data = var_data['wind_speed']
        print(target_chunk, temp_data.shape)

Use data_vars to limit which variables are included:

for target_chunk, var_data in ds.iter_chunks({'latitude': 50}, data_vars=['temperature']):
    print(var_data.keys())  # {'temperature'}

Pass include_data=False for position-only iteration (no data loading):

with cfdb.open_dataset(file_path) as ds:
    for chunk in ds.iter_chunks({'latitude': 25, 'longitude': 25}, include_data=False):
        print(chunk)  # {'latitude': slice(0, 25), 'longitude': slice(0, 25)}

groupby¶

Group by one or more coordinates across all data variables. Supports the same period string syntax as the variable-level groupby:

with cfdb.open_dataset(file_path) as ds:
    # Group by single coordinate values
    for target_chunk, var_data in ds.groupby('latitude'):
        print(target_chunk, {k: v.shape for k, v in var_data.items()})

    # Group by time period
    for target_chunk, var_data in ds.groupby({'time': 'M'}, data_vars=['temperature']):
        print(target_chunk, {k: v.shape for k, v in var_data.items()})

map¶

Apply a function to aligned chunks of multiple variables in parallel. The function receives (target_chunk, var_data) — same as iter_chunks:

def sum_two_vars(target_chunk, var_data):
    return var_data['temperature'] + var_data['wind_speed']

with cfdb.open_dataset(file_path) as ds:
    for target_chunk, result in ds.map(sum_two_vars, {'latitude': 25}, n_workers=4):
        print(target_chunk, result.shape)

Properties¶

with cfdb.open_dataset(file_path) as ds:
    temp = ds['temperature']

    print(temp.name)          # variable name
    print(temp.shape)         # shape tuple
    print(temp.chunk_shape)   # chunk shape tuple
    print(temp.coord_names)   # coordinate names
    print(temp.dtype)         # cfdb DataType
    print(temp.ndims)         # number of dimensions
    print(temp.attrs)         # variable attributes

Interpolation¶

Data variables support spatial interpolation via the interp() method. See Interpolation for details.