Skip to content

Data Variables

Data variables store N-dimensional data referenced by coordinates. Unlike coordinates, data variables never hold full data in memory — data is always accessed chunk by chunk.

Creating Data Variables

Data variables require existing coordinates:

import cfdb
import numpy as np

with cfdb.open_dataset(file_path, flag='w') as ds:
    data_var = ds.create.data_var.generic(
        'temperature',
        ('latitude', 'time'),
        dtype='float32',
    )

Template Methods

Like coordinates, common data variable types have template methods:

with cfdb.open_dataset(file_path, flag='w') as ds:
    temp = ds.create.data_var.air_temperature(('latitude', 'longitude', 'time'))

Templates set standard names, dtypes, and attributes. Any parameter accepted by generic() can be overridden via **kwargs.

Generic Creation Parameters

Parameter Type Description
name str Unique variable name
coords tuple of str Coordinate names defining the dimensions
dtype str, np.dtype, or DataType Data type
chunk_shape tuple of int or None Chunk shape (auto-estimated if None)

Creating from Existing

with cfdb.open_dataset(file_path, flag='w') as ds:
    new_var = ds.create.data_var.like('temperature_copy', ds['temperature'])

Writing Data

Direct Assignment

The simplest way to write a full array:

data = np.random.rand(200, 200).astype('float32')

with cfdb.open_dataset(file_path, flag='w') as ds:
    ds['temperature'][:] = data

Assignment uses numpy basic indexing (integers and slices):

with cfdb.open_dataset(file_path, flag='w') as ds:
    temp = ds['temperature']
    temp[0:10, :] = data[0:10, :]      # slice assignment
    temp[5, 100] = 42.0                 # scalar assignment

Note

Advanced indexing (fancy indexing with arrays) is currently not supported. It might be supported in the future.

For large datasets, iterate over chunk positions to control memory usage:

with cfdb.open_dataset(file_path, flag='w') as ds:
    temp = ds['temperature']
    for chunk_slices in temp.iter_chunks(include_data=False):
        temp[chunk_slices] = data[chunk_slices]

This is the recommended approach when your source data is larger than memory or comes in pieces.

Reading Data

Full Array

For small variables, read everything at once:

with cfdb.open_dataset(file_path) as ds:
    all_data = ds['temperature'].values

For large datasets, iterate over chunks:

with cfdb.open_dataset(file_path) as ds:
    temp = ds['temperature']
    for chunk_slices, chunk_data in temp.iter_chunks():
        print(chunk_slices, chunk_data.shape)

The chunk_slices tuple contains slices that can be used as numpy indexes.

You can also iterate with a different chunk shape by passing a dict of {coord_name: int}:

with cfdb.open_dataset(file_path) as ds:
    temp = ds['temperature']
    for chunk_slices, chunk_data in temp.iter_chunks({'latitude': 50}):
        print(chunk_slices, chunk_data.shape)

For position-only iteration (no data loading), pass include_data=False:

with cfdb.open_dataset(file_path) as ds:
    temp = ds['temperature']
    for chunk_slices in temp.iter_chunks(include_data=False):
        print(chunk_slices)

GroupBy

Group by one or more coordinate dimensions. This rechunks the data so each yielded array covers a single position along the grouped dimension(s) and the full extent of all other dimensions:

with cfdb.open_dataset(file_path) as ds:
    for slices, data in ds['temperature'].groupby('latitude'):
        print(slices, data.shape)

Group by multiple coordinates:

with cfdb.open_dataset(file_path) as ds:
    for slices, data in ds['temperature'].groupby(('latitude', 'time')):
        print(slices, data.shape)

Time Period GroupBy

Pass a dict with period strings to group by time periods. This works on any datetime coordinate:

with cfdb.open_dataset(file_path) as ds:
    temp = ds['temperature']

    # Daily groups (hourly data → 24 time steps per group)
    for slices, data in temp.groupby({'time': 'D'}):
        print(slices, data.shape)

    # Monthly groups (variable size — Jan=31, Feb=28/29, etc.)
    for slices, data in temp.groupby({'time': 'M'}):
        print(slices, data.shape)

    # Yearly groups
    for slices, data in temp.groupby({'time': 'Y'}):
        print(slices, data.shape)

    # Every 6 hours
    for slices, data in temp.groupby({'time': '6h'}):
        print(slices, data.shape)

Supported period units: Y (year), M (month), W (week), D (day), h (hour), m (minute), s (second), ms, us, ns. Prefix with a count for multiples, e.g. '7D', '3M', '6h'.

Dict values can also be integers (chunk sizes), which can be mixed with period strings on different coordinates:

# Group by day on time, chunk size 50 on latitude
for slices, data in temp.groupby({'time': 'D', 'latitude': 50}):
    print(slices, data.shape)

Performance: When the period maps to a fixed number of time steps (e.g. daily on hourly data = 24 steps) and all groups are the same size, cfdb uses the efficient rechunker path. For irregular periods like monthly or yearly, it falls back to a slice-based iteration that reads each group separately.

The max_mem parameter controls the memory budget for the rechunking operation (default 128 MB).

Parallel Map

The map() method applies a function to each chunk in parallel using multiprocessing. It yields (target_chunk, result) tuples as workers complete. The function receives exactly what iter_chunks() yields: a target_chunk tuple of slices and a data numpy array.

The function must be a top-level picklable function (not a lambda or closure).

By default (no chunk_shape), map() uses the efficient booklet.map path where workers decompress and compute directly. Pass a chunk_shape dict to use a pool-based approach with rechunked chunks instead.

Transform and Write Back

def scale_kelvin(target_chunk, data):
    return data + 273.15

with cfdb.open_dataset(file_path, flag='w') as ds:
    temp = ds['temperature']
    for target_chunk, result in temp.map(scale_kelvin, n_workers=4):
        temp[target_chunk] = result

Aggregate

def chunk_stats(target_chunk, data):
    return {'mean': float(data.mean()), 'std': float(data.std())}

with cfdb.open_dataset(file_path) as ds:
    stats = [result for _, result in ds['temperature'].map(chunk_stats)]

Skip Chunks

Return None from your function to skip a chunk — it will not appear in the output:

def only_positive_mean(target_chunk, data):
    m = data.mean()
    if m > 0:
        return m
    return None

with cfdb.open_dataset(file_path) as ds:
    positive_means = [result for _, result in ds['temperature'].map(only_positive_mean)]

Map on a View

map() works on sliced views, processing only the selected chunks:

with cfdb.open_dataset(file_path) as ds:
    view = ds['temperature'][0:50, :]
    for target_chunk, result in view.map(scale_kelvin, n_workers=4):
        print(target_chunk, result.shape)

Dataset-Level Iteration

The methods above operate on a single data variable. The dataset also provides iter_chunks, groupby, and map that iterate over multiple data variables in lockstep. All data variables must share the same coordinates.

iter_chunks

Yields (target_chunk, var_data) where target_chunk is a {coord_name: slice} dict and var_data is a {var_name: ndarray} dict:

with cfdb.open_dataset(file_path) as ds:
    for target_chunk, var_data in ds.iter_chunks({'latitude': 25, 'longitude': 25}):
        temp_data = var_data['temperature']
        wind_data = var_data['wind_speed']
        print(target_chunk, temp_data.shape)

Use data_vars to limit which variables are included:

for target_chunk, var_data in ds.iter_chunks({'latitude': 50}, data_vars=['temperature']):
    print(var_data.keys())  # {'temperature'}

Pass include_data=False for position-only iteration (no data loading):

with cfdb.open_dataset(file_path) as ds:
    for chunk in ds.iter_chunks({'latitude': 25, 'longitude': 25}, include_data=False):
        print(chunk)  # {'latitude': slice(0, 25), 'longitude': slice(0, 25)}

groupby

Group by one or more coordinates across all data variables. Supports the same period string syntax as the variable-level groupby:

with cfdb.open_dataset(file_path) as ds:
    # Group by single coordinate values
    for target_chunk, var_data in ds.groupby('latitude'):
        print(target_chunk, {k: v.shape for k, v in var_data.items()})

    # Group by time period
    for target_chunk, var_data in ds.groupby({'time': 'M'}, data_vars=['temperature']):
        print(target_chunk, {k: v.shape for k, v in var_data.items()})

map

Apply a function to aligned chunks of multiple variables in parallel. The function receives (target_chunk, var_data) — same as iter_chunks:

def sum_two_vars(target_chunk, var_data):
    return var_data['temperature'] + var_data['wind_speed']

with cfdb.open_dataset(file_path) as ds:
    for target_chunk, result in ds.map(sum_two_vars, {'latitude': 25}, n_workers=4):
        print(target_chunk, result.shape)

Properties

with cfdb.open_dataset(file_path) as ds:
    temp = ds['temperature']

    print(temp.name)          # variable name
    print(temp.shape)         # shape tuple
    print(temp.chunk_shape)   # chunk shape tuple
    print(temp.coord_names)   # coordinate names
    print(temp.dtype)         # cfdb DataType
    print(temp.ndims)         # number of dimensions
    print(temp.attrs)         # variable attributes

Interpolation

Data variables support spatial interpolation via the interp() method. See Interpolation for details.