fsspeckit.core.merge¶
merge
¶
Backend-neutral merge layer for parquet dataset operations.
This module provides shared functionality for merge operations used by both DuckDB and PyArrow merge implementations.
Key responsibilities: 1. Merge validation and normalization 2. Strategy semantics and definitions 3. Key validation and schema compatibility checking 4. Shared statistics calculation 5. NULL-key detection and edge case handling
Classes¶
fsspeckit.core.merge.MergePlan
dataclass
¶
MergePlan(
strategy: MergeStrategy,
key_columns: list[str],
source_count: int,
target_exists: bool,
dedup_order_by: list[str] | None = None,
key_columns_valid: bool = True,
schema_compatible: bool = True,
null_keys_detected: bool = False,
allow_target_empty: bool = True,
allow_source_empty: bool = True,
)
Plan for executing a merge operation.
fsspeckit.core.merge.MergeStats
dataclass
¶
MergeStats(
strategy: MergeStrategy,
source_count: int,
target_count_before: int,
target_count_after: int,
inserted: int,
updated: int,
deleted: int,
total_processed: int = 0,
)
Canonical statistics structure for merge operations.
Functions¶
fsspeckit.core.merge.MergeStats.to_dict
¶
Convert to dictionary format for backward compatibility.
Source code in src/fsspeckit/core/merge.py
fsspeckit.core.merge.MergeStrategy
¶
Bases: Enum
Supported merge strategies with consistent semantics across backends.
Attributes¶
fsspeckit.core.merge.MergeStrategy.DEDUPLICATE
class-attribute
instance-attribute
¶
Remove duplicates from source, then upsert.
fsspeckit.core.merge.MergeStrategy.FULL_MERGE
class-attribute
instance-attribute
¶
Insert, update, and delete (full sync with source).
fsspeckit.core.merge.MergeStrategy.INSERT
class-attribute
instance-attribute
¶
Insert only new records, ignore existing records.
fsspeckit.core.merge.MergeStrategy.UPDATE
class-attribute
instance-attribute
¶
Update only existing records, ignore new records.
fsspeckit.core.merge.MergeStrategy.UPSERT
class-attribute
instance-attribute
¶
Insert new records, update existing records.
Functions¶
fsspeckit.core.merge.calculate_merge_stats
¶
calculate_merge_stats(
strategy: MergeStrategy,
source_count: int,
target_count_before: int,
target_count_after: int,
) -> MergeStats
Calculate merge operation statistics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
strategy
|
MergeStrategy
|
Merge strategy that was used. |
required |
source_count
|
int
|
Number of rows in source data. |
required |
target_count_before
|
int
|
Number of rows in target before merge. |
required |
target_count_after
|
int
|
Number of rows in target after merge. |
required |
Returns:
| Type | Description |
|---|---|
MergeStats
|
MergeStats with calculated statistics. |
Source code in src/fsspeckit/core/merge.py
fsspeckit.core.merge.check_null_keys
¶
Check for NULL values in key columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source_table
|
Table
|
Source data table. |
required |
target_table
|
Table | None
|
Target data table, None if target doesn't exist. |
required |
key_columns
|
list[str]
|
List of key column names. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If NULL values found in key columns. |
Source code in src/fsspeckit/core/merge.py
fsspeckit.core.merge.get_canonical_merge_strategies
¶
Get the list of canonical merge strategy names.
Returns:
| Type | Description |
|---|---|
list[str]
|
List of strategy names in canonical order. |
fsspeckit.core.merge.normalize_key_columns
¶
Normalize key columns to a consistent list format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key_columns
|
list[str] | str
|
Key column(s) as string or list of strings. |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
List of key column names. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If key_columns is empty or contains invalid values. |
Source code in src/fsspeckit/core/merge.py
fsspeckit.core.merge.validate_merge_inputs
¶
validate_merge_inputs(
source_schema: Schema,
target_schema: Schema | None,
key_columns: list[str],
strategy: MergeStrategy,
) -> MergePlan
Validate merge inputs and create a merge plan.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source_schema
|
Schema
|
Schema of the source data. |
required |
target_schema
|
Schema | None
|
Schema of the target dataset, None if target doesn't exist. |
required |
key_columns
|
list[str]
|
List of key column names for matching records. |
required |
strategy
|
MergeStrategy
|
Merge strategy to use. |
required |
Returns:
| Type | Description |
|---|---|
MergePlan
|
MergePlan with validation results and execution details. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If validation fails with specific error messages. |
Source code in src/fsspeckit/core/merge.py
158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 | |
fsspeckit.core.merge.validate_strategy_compatibility
¶
validate_strategy_compatibility(
strategy: MergeStrategy,
source_count: int,
target_exists: bool,
) -> None
Validate that the chosen strategy is compatible with the data state.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
strategy
|
MergeStrategy
|
Merge strategy to validate. |
required |
source_count
|
int
|
Number of rows in source data. |
required |
target_exists
|
bool
|
Whether target dataset exists. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If strategy is incompatible with the data state. |