Combining and Merging Datasets¶
cfdb provides two distinct functions for combining multiple datasets together:
- combine: An out-of-place merge that creates a brand new dataset from the union of all inputs.
- merge_into: An in-place merge that destructively modifies an existing target dataset.
combine (Out-of-Place)¶
The combine function takes multiple input datasets, computes the union of their coordinates, and writes all data into a new output file. This is useful for joining datasets that cover different spatial regions or creating a clean dataset from many smaller files.
import cfdb
# Creates a new 'combined.cfdb' file
result = cfdb.combine(
['region_north.cfdb', 'region_south.cfdb'],
'combined.cfdb',
)
print(result)
result.close()
The function accepts file paths or open Dataset objects.
Subsetting with sel¶
You can apply a location-based selection to each input dataset before combining. This filters the data so only the selected region ends up in the output:
result = cfdb.combine(
['full_globe_a.cfdb', 'full_globe_b.cfdb'],
'europe_combined.cfdb',
sel={
'latitude': slice(35.0, 72.0),
'longitude': slice(-25.0, 45.0),
},
)
result.close()
Compression and Variables¶
By default, compression settings are inherited from the first input dataset, but they can be overridden. You can also explicitly filter which data variables to include:
result = cfdb.combine(
datasets, 'out.cfdb',
compression='zstd',
compression_level=1,
include_data_vars=['temperature'],
# exclude_data_vars=['humidity'],
)
merge_into (In-Place)¶
For large, continuously updated databases (e.g., adding yesterday's weather data to a 100GB climate cache), recreating the entire file with combine is prohibitively slow and requires double the disk space.
The merge_into function solves this by destructively writing new data directly into an existing dataset file. Because of the way cfdb stores coordinate metadata, appending or prepending data along a coordinate (like time) is extremely fast (O(new_data)), completely avoiding the need to rewrite the existing chunks.
import cfdb
# Destructively modifies 'existing_target.cfdb' in-place
result = cfdb.merge_into(
['new_weather_data.cfdb'],
'existing_target.cfdb',
allow_expansion=['time']
)
result.close()
Expansion Constraints¶
To protect against accidental database corruption and massive performance penalties, merge_into enforces strict rules about coordinate expansion:
- Insertions are blocked: You cannot insert new coordinate values into the middle of an existing dataset's coordinate range. You can only strictly append (after the max value) or prepend (before the min value).
- Expansion guardrails: The
allow_expansionparameter controls which coordinates are allowed to grow.allow_expansion=True: Any coordinate can grow via append/prepend.allow_expansion=['time']: Recommended. Only thetimecoordinate is allowed to grow. If the incoming dataset has a slightly different spatial bounding box (e.g., an extra latitude row),merge_intowill raise an error and abort instead of silently attempting an expensive in-place spatial expansion.allow_expansion=False: No coordinates can grow; incoming data must perfectly match or be a subset of the target's existing coordinates.
Handling Overlaps¶
Both combine and merge_into support an overlap parameter that controls what happens when multiple datasets contain data for the exact same coordinate values:
# Last dataset wins (default, most performant — overwrites existing data)
cfdb.combine(datasets, 'out.cfdb', overlap='last')
cfdb.merge_into(datasets, 'target.cfdb', overlap='last')
# First dataset wins (skips writing if data already exists in the target)
cfdb.combine(datasets, 'out.cfdb', overlap='first')
# Raise an error if any overlap is detected
cfdb.combine(datasets, 'out.cfdb', overlap='error')
Note
Overlap handling applies to data variables, not coordinates. Coordinate values are always merged as a sorted union regardless of the overlap setting.
Requirements¶
For both functions, all input datasets must:
- Have the same dataset_type (e.g., all grid or all ts_ortho)
- Have compatible coordinate dtypes for any shared coordinate names
- Have compatible data variable dtypes and dimensions for any shared variable names
- Have matching CRS if any datasets define one