Architecture¶
This page explains the internal design of cfdb for users who want to understand how data is stored and managed.
System Overview¶
open_dataset() / open_edataset()
│
▼
Dataset / EDataset
│
├── SysMeta (system metadata in Booklet metadata field)
│ ├── dataset_type, compression, crs
│ └── variables: dict of CoordinateVariable / DataVariable
│
├── Creator (coord, data_var, crs sub-objects)
│
├── Attributes (JSON dict stored as Booklet key)
│
├── DatasetRechunker (synchronized multi-variable rechunkit wrapper)
│
└── Variable objects (Coordinate / DataVariable)
├── DataType (encoding/decoding)
├── Compressor (zstd/lz4)
└── Rechunker (single-variable rechunkit wrapper)
Public API¶
Datasets¶
classDiagram
direction TB
class cfdb {
<<module>>
open_dataset(file_path, flag, dataset_type, compression, compression_level) Dataset
open_edataset(remote_conn, file_path, flag, dataset_type, compression, ...) EDataset
cfdb_to_netcdf4(cfdb_path, nc_path, compression, sel, sel_loc, ...)
combine(cfdb_paths, output_path, ...)
merge_into(source_path, dest_path, ...)
guess_chunk_shape(shape, itemsize, target_size) tuple
compute_scale_and_offset(min_value, max_value, precision) tuple
}
class dtypes {
<<module>>
dtype(name, precision, min_value, max_value, ...) DataType
}
class tools {
<<module>>
netcdf4_to_cfdb(nc_path, cfdb_path, sel, sel_loc, max_mem, ...)
cfdb_to_netcdf4(cfdb_path, nc_path, compression, sel, sel_loc, ...)
}
class Dataset {
+file_path : Path
+compression : str
+compression_level : int
+writable : bool
+is_open : bool
+var_names : tuple
+coord_names : tuple
+data_var_names : tuple
+coords : tuple~Coordinate~
+data_vars : tuple~DataVariable~
+attrs : Attributes
+crs : pyproj.CRS
+create : Creator
+get(var_name) Coordinate | DataVariable
+select(sel) DatasetView
+select_loc(sel) DatasetView
+iter_chunks(chunk_shape, data_vars, max_mem, include_data) Generator
+groupby(coord_names, data_vars, max_mem) Generator
+map(func, chunk_shape, data_vars, max_mem, n_workers) Generator
+rechunker(data_vars) DatasetRechunker
+copy(file_path, include_data_vars, exclude_data_vars) Dataset
+to_netcdf4(file_path, compression, include_data_vars, exclude_data_vars)
+prune(timestamp) int
+close()
}
class EDataset {
+changes() Change
+delete_remote()
+copy_remote(remote_conn)
}
class DatasetView {
+var_names : tuple
+coord_names : tuple
+data_var_names : tuple
+coords : tuple
+data_vars : tuple
+attrs : Attributes
+crs : pyproj.CRS
+get(var_name) CoordinateView | DataVariableView
+select(sel) DatasetView
+select_loc(sel) DatasetView
}
class Creator {
+coord : Coord
+data_var : DataVar
+crs : CRS
}
class Coord {
+generic(name, data, dtype, chunk_shape, step, axis) Coordinate
+like(name, coord, copy_data) Coordinate
+lat(data, step, ...) Coordinate
+lon(data, step, ...) Coordinate
+time(data, step, ...) Coordinate
+height(...) Coordinate
+altitude(...) Coordinate
+depth(...) Coordinate
+point(...) Coordinate
+x(data, step, ...) Coordinate
+y(data, step, ...) Coordinate
+z(data, step, ...) Coordinate
}
class DataVar {
+generic(name, coords, dtype, chunk_shape) DataVariable
+like(name, data_var) DataVariable
+air_temperature(coords) DataVariable
+precipitation(coords) DataVariable
}
class CRS {
+from_user_input(crs, x_coord, y_coord, xy_coord) pyproj.CRS
}
cfdb --> Dataset : open_dataset()
cfdb --> EDataset : open_edataset()
EDataset --|> Dataset
Dataset --> DatasetView : select() / select_loc()
Dataset *-- Creator : create
Creator *-- Coord : coord
Creator *-- DataVar : data_var
Creator *-- CRS : crs
Variables¶
classDiagram
direction TB
class Variable {
<<abstract>>
+name : str
+shape : tuple
+chunk_shape : tuple
+dtype : DataType
+ndims : int
+coord_names : tuple
+attrs : Attributes
+writable : bool
+is_open : bool
+units : str
+loc : LocationIndexer
+data : ndarray
+values : ndarray
}
class CoordinateView {
+iter_chunks(include_data, decoded) Generator
+items(decoded) Generator
+get_chunk(sel, missing_none) ndarray
}
class Coordinate {
+step : number
+origin : int
+axis : str
+auto_increment : bool
+append(data)
+prepend(data)
+truncate(start, stop)
+update_step(step)
+update_axis(axis)
+get_coord_origins() tuple
+load()
}
class DataVariableView {
+coords : tuple~Coordinate~
+set(sel, data, decoded)
+iter_chunks(chunk_shape, max_mem, decoded, include_data) Generator
+items(decoded) Generator
+get_chunk(sel, missing_none) ndarray
+groupby(coord_names, max_mem) Generator
+map(func, chunk_shape, max_mem, n_workers) Generator
+interp(x, y, z, xy) GridInterp | PointInterp
}
class DataVariable {
+rechunker() Rechunker
+load()
}
class Attributes {
+data : dict
+writable : bool
+get(key) value
+set(key, value)
+keys() Iterator
+values() Iterator
+items() Iterator
+pop(key, default) value
+update(other)
+clear()
}
class Rechunker {
+guess_chunk_shape(target_size) tuple
+calc_ideal_read_chunk_shape(target_shape) tuple
+calc_source_read_chunk_shape(target_shape, max_mem) tuple
+calc_n_chunks() int
+calc_n_reads_rechunker(target_shape, max_mem) tuple
+rechunk(target_shape, max_mem) Generator
}
class DatasetRechunker {
+calc_ideal_read_chunk_mem(chunk_shape) int
+calc_n_reads_rechunker(chunk_shape, max_mem) tuple
+rechunk(chunk_shape, max_mem) Generator
}
class LocationIndexer {
+__getitem__(sel) View
}
Variable <|-- CoordinateView
CoordinateView <|-- Coordinate
Variable <|-- DataVariableView
DataVariableView <|-- DataVariable
Variable *-- LocationIndexer : loc
Variable *-- Attributes : attrs
DataVariable --> Rechunker : rechunker()
Data Types¶
classDiagram
direction TB
class DataType {
<<abstract>>
+name : str
+kind : str
+itemsize : int
+dtype_decoded : numpy.dtype
+dtype_encoded : numpy.dtype
+precision : int
+fillvalue : int
+offset : number
+to_dict() dict
}
class Float {
+encode(data) ndarray
+decode(data) ndarray
+dumps(data) bytes
+loads(data_bytes, chunk_shape) ndarray
}
class Integer {
+encode(data) ndarray
+decode(data) ndarray
+dumps(data) bytes
+loads(data_bytes, chunk_shape) ndarray
}
class DateTime {
+encode(data) ndarray
+decode(data) ndarray
+dumps(data) bytes
+loads(data_bytes, chunk_shape) ndarray
}
class Bool {
+dumps(data) bytes
+loads(data_bytes, chunk_shape) ndarray
}
class String {
+dumps(data) bytes
+loads(data_bytes) ndarray
}
class Geometry {
<<abstract>>
+encode(data) list
+decode(data) ndarray
+dumps(data) bytes
+loads(data_bytes) ndarray
}
class Point
class LineString
class Polygon
DataType <|-- Float
DataType <|-- Integer
DataType <|-- DateTime
DataType <|-- Bool
DataType <|-- String
DataType <|-- Geometry
Geometry <|-- Point
Geometry <|-- LineString
Geometry <|-- Polygon
Booklet Storage¶
cfdb uses Booklet as a key-value store. Booklet is a persistent dict-like database stored in a single file, with support for thread locks and file locks.
All data in a cfdb file lives in one Booklet file:
- System metadata — stored in Booklet's metadata field (a single JSON blob)
- Data chunks — stored as Booklet key-value pairs
- Attributes — stored as separate Booklet keys (
_{var_name}.attrs)
Metadata Lifecycle¶
- On open,
SysMetais deserialized from the Booklet metadata viamsgspec.convert() - During the session,
SysMetais modified in memory (adding variables, changing shapes, etc.) - On close, a
weakref.finalizecallback serializesSysMetaback to Booklet metadata
This means metadata changes are batched and written on close, not on every operation.
Chunk Storage¶
Data chunks are stored with keys formatted as:
For example, a 2-D variable temperature with chunk starting at position (100, 200) would have the key temperature!100,200.
This key format is generated by utils.make_var_chunk_key().
Variable Hierarchy¶
Variable (base)
├── CoordinateView → Coordinate
│ - Holds all data in memory
│ - Supports append/prepend
│ - .data returns full array
│
└── DataVariableView → DataVariable
- Never holds full data in memory
- Supports __setitem__ for writing
- .data reads all chunks (expensive)
The View variants represent subsets created by indexing or select().
Thread and Multiprocess Safety¶
- Thread safety: Booklet uses thread locks for concurrent read/write access
- Multiprocess safety: File locks prevent corruption from multiple processes
- S3 safety (EDataset): Object locking on the remote ensures consistency
Error Handling¶
When an error occurs, cfdb attempts to:
- Close the Booklet file properly
- Remove file/object locks
Changes that were not synced are lost. The weakref.finalize mechanism ensures cleanup runs even on unexpected exits.
Reference Cycles and weakref.finalize¶
Dataset uses weakref.finalize to flush metadata on close/GC. Any class that holds a strong reference back to the Dataset creates a reference cycle that can prevent the finalizer from running. Python 3.12+ is strict about this — on earlier versions the GC would break the cycle, but on 3.12+ the finalizer may never fire, causing file locks to persist and tests to hang silently.
Rule: Classes that receive a Dataset (or Variable) reference and are stored as attributes on that same object must use weakref.proxy(dataset) instead of a direct reference. This currently applies to:
Creator,Coord,DataVar,CRS(increation.py) — stored onDataset.createLocationIndexer(inindexers.py) — stored onVariable.loc
If you add a new class that is both (a) stored as an attribute on a Dataset/Variable and (b) holds a reference back to it, use weakref.proxy.
Dependencies¶
| Package | Role |
|---|---|
| booklet | Local key-value file storage |
| ebooklet | S3 remote sync (optional) |
| numpy | Array operations |
| msgspec | Fast serialization (metadata, strings, geometry) |
| zstandard | Zstd compression |
| lz4 | LZ4 compression |
| rechunkit | Rechunking algorithms |
| shapely | Geometry types (WKT conversion) |
| pyproj | CRS handling |
| cfdb-models | Shared data model types |
| cfdb-vars | Variable definitions and templates |
| geointerp | Grid interpolation and CRS transformation |
| h5netcdf | NetCDF4 I/O (optional) |
| xarray | Xarray backend integration (optional) |