Skip to content

cfdb

CF conventions multi-dimensional array storage on top of Booklet

build PyPI version


cfdb is a pure Python database for managing labeled multi-dimensional arrays following the CF conventions. It is an alternative to netCDF4/xarray, built on Booklet for local file storage and EBooklet for S3 sync.

Key Features

  • CF conventions — coordinates, data variables, and attributes following the CF standard
  • Chunk-based storage — efficient compression with zstd or lz4, chunk-level read/write
  • Thread-safe and multiprocess-safe — thread locks and file locks for concurrent access
  • Rechunking — on-the-fly rechunking via rechunkit for flexible data access patterns
  • Parallel map — apply a function to chunks in parallel using multiprocessing
  • Grid interpolation — regridding, point sampling, NaN filling, and level regridding via geointerp
  • S3 remote syncEDataset links a local file with an S3 remote via EBooklet
  • NetCDF4 export — convert to netCDF4 with h5netcdf and from netcdf4 with cfdb-ingest

Quick Example

import cfdb
import numpy as np

file_path = 'example.cfdb'

with cfdb.open_dataset(file_path, flag='n') as ds:
    # Create coordinates
    lat = ds.create.coord.lat(data=np.linspace(-90, 90, 181, dtype='float32'))
    lon = ds.create.coord.lon(data=np.linspace(-180, 180, 361, dtype='float32'))

    # Create a data variable
    temp = ds.create.data_var.generic(
        'temperature', ('latitude', 'longitude'), dtype='float32'
    )

    # Write data
    temp[:] = np.random.rand(181, 361).astype('float32') * 40 - 10

# Read it back
with cfdb.open_dataset(file_path) as ds:
    for chunk_slices, data in ds['temperature'].iter_chunks():
        print(chunk_slices, data.shape)

Next Steps