Architecture Overview¶
fsspeckit extends fsspec with enhanced filesystem utilities and storage option configurations for working with various data formats and storage backends. This document provides a technical reference for understanding the system's design and implementation patterns.
Executive Overview¶
Purpose and Value Proposition¶
fsspeckit provides enhanced data processing capabilities through a modular, domain-driven architecture that focuses on filesystem operations, storage configuration, and cross-framework SQL filter translation. The system enables users to work with multiple storage backends and data processing frameworks through unified APIs.
Core Architectural Principles¶
- Domain-Driven Design: Clear separation of concerns through domain-specific packages
- Backend Neutrality: Consistent interfaces across different storage providers
- Practical Utilities: Focus on implemented features rather than theoretical capabilities
- Backwards Compatibility: Migration path for existing users
- Type Safety: Strong typing and validation throughout the codebase
Target Use Cases¶
- Multi-Cloud Data Access: Unified access to AWS S3, Azure Blob, Google Cloud Storage
- Dataset Operations: High-performance dataset operations with DuckDB and PyArrow
- Git Integration: Filesystem access to GitHub and GitLab repositories
- SQL Filter Translation: Cross-framework SQL expression conversion
- Storage Configuration: Environment-based storage option management
Backwards Compatibility¶
- Utils Façade: The
fsspeckit.utilspackage serves as a backwards-compatible façade that re-exports from domain packages (datasets,sql,common).
Supported Imports¶
The following imports are supported for backwards compatibility:
- setup_logging - from fsspeckit.common.logging
- run_parallel - from fsspeckit.common.misc
- get_partitions_from_path - from fsspeckit.common.misc
- to_pyarrow_table - from fsspeckit.common.types
- dict_to_dataframe - from fsspeckit.common.types
- opt_dtype_pl - from fsspeckit.common.polars
- opt_dtype_pa - from fsspeckit.common.types
- cast_schema - from fsspeckit.common.types
- convert_large_types_to_normal - from fsspeckit.common.types
- pl - from fsspeckit.common.polars
- sync_dir - from fsspeckit.common.misc
- sync_files - from fsspeckit.common.misc
- DuckDBParquetHandler - from fsspeckit.datasets
- Progress - from fsspeckit.utils.misc (shim for rich.progress.Progress)
Migration Path¶
- Existing Code: All existing
fsspeckit.utilsimports continue to work unchanged - New Development: New code should import directly from domain packages for better discoverability
- Deprecated Paths: Deeper import paths like
fsspeckit.utils.misc.Progressare deprecated but functional for at least one major version
Deprecation Notices¶
fsspeckit.utilsmodule is deprecated and exists only for backwards compatibility- New implementation code should not live in
fsspeckit.utils - Use domain-specific imports:
fsspeckit.datasets,fsspeckit.sql,fsspeckit.commonfor new development
Architectural Decision Records (ADRs)¶
ADR-001: Domain Package Architecture¶
Decision: Organize fsspeckit into domain-specific packages (core, storage_options, datasets, sql, common) rather than a monolithic structure.
Rationale: - Separation of Concerns: Each domain has distinct responsibilities and user patterns - Discoverability: Users can easily find relevant functionality without searching large modules - Testing: Isolated testing for each domain with clear boundaries - Maintenance: Changes to one domain don't impact others
Migration Path: Existing imports through fsspeckit.utils continue working while new code uses domain-specific imports.
ADR-002: Backend-Neutral Planning Layer¶
Decision: Centralize merge and maintenance planning logic in the core package with backend-specific delegates.
Rationale: - Consistency: All backends use identical merge semantics and validation - Maintainability: Single source of truth for business logic - Performance: Shared optimization strategies across implementations - Testing: Consistent behavior validation across all backends
Implementation: Both DuckDB and PyArrow backends delegate to core.merge and core.maintenance for planning, validation, and statistics calculation.
ADR-003: Storage Options Factory Pattern¶
Decision: Implement factory pattern for storage configuration with environment-based setup.
Rationale: - Portability: Code works across different cloud providers without changes - Configuration: Environment-based configuration for production deployments - Flexibility: Users can override defaults for specific requirements
Implementation Pattern:
Core Architecture Deep Dive¶
Domain Package Breakdown¶
fsspeckit.core - Foundation Layer¶
The core package provides fundamental filesystem APIs and path safety utilities:
Key Components:
-
AbstractFileSystem(core/ext.py): Extended base class with enhanced functionality -
DirFileSystem: Path-safe filesystem wrapper -
filesystem()function: Enhanced filesystem creation with URI inference
Integration Patterns: - Protocol detection and inference from URIs - Smart path normalization and validation - Directory confinement for security
fsspeckit.storage_options - Configuration Layer¶
Manages storage configurations for cloud and Git providers:
Factory Pattern Implementation:
Provider Implementations:
- AwsStorageOptions: AWS S3 configuration with region, credentials, and endpoint settings
- GcsStorageOptions: Google Cloud Storage setup
- AzureStorageOptions: Azure Blob Storage configuration
- GitHubStorageOptions: GitHub repository access with token authentication
- GitLabStorageOptions: GitLab repository configuration
Key Features: - YAML serialization for persistent configuration - Environment variable auto-configuration - Protocol inference from URIs - Unified interface across all providers
fsspeckit.datasets - Data Processing Layer¶
High-performance dataset operations for large-scale data processing:
DuckDB Implementation:
PyArrow Implementation:
Backend Integration:
- Shared merge logic from core.merge
- Common maintenance operations from core.maintenance
- Consistent statistics and validation across backends
fsspeckit.sql - Query Translation Layer¶
SQL-to-filter translation for cross-framework compatibility:
Core Functions:
Integration Points: - Cross-framework SQL expression translation - Schema-aware filter generation - Unified SQL parsing using sqlglot - Table name extraction for validation
fsspeckit.common - Shared Utilities Layer¶
Cross-cutting utilities used across all domains:
Parallel Processing:
Type Conversion:
File Operations:
fsspeckit.utils - Backwards Compatibility Façade¶
Re-exports selected helpers from domain packages for backwards compatibility:
Migration Strategy: - Immediate compatibility with existing code - Gradual migration to domain-specific imports - Deprecation warnings for discouraged patterns
Integration Patterns¶
Cross-Domain Communication¶
Import Patterns:
Configuration Flow:
Error Handling Architecture¶
Consistent Exception Types:
- ValueError for configuration and validation errors
- FileNotFoundError for missing resources
- PermissionError for access control issues
- Custom exceptions for domain-specific errors
Security Architecture¶
fsspeckit implements security best practices through the fsspeckit.common.security module, providing utilities to prevent common vulnerabilities in data processing workflows.
Core Security Helpers:
- Path Validation: Prevent path traversal attacks and ensure operations stay within allowed directories
validate_path(): Validates filesystem paths and enforces base directory confinement-
Integration with
DirFileSystemfor path-safe operations -
Credential Scrubbing: Protect sensitive information in logs and error messages
scrub_credentials(): Removes credential-like values from stringsscrub_exception(): Safely formats exceptions without exposing secrets-
safe_format_error(): Creates secure error messages for production logging -
Compression Safety: Prevent codec injection attacks
-
validate_compression_codec(): Ensures only safe codecs (snappy, gzip, lz4, zstd, brotli) are used -
Column Validation: Prevent column injection in SQL-like operations
validate_columns(): Validates requested columns exist in schema
Production Security Patterns:
The security helpers are integrated throughout fsspeckit's architecture:
Security in Production:
For production deployments, the architecture emphasizes:
- Credential scrubbing in all error paths
- Path validation for all filesystem operations
- Safe error formatting for observability
- Integration with centralized logging systems
- Multi-tenant isolation through DirFileSystem
These security measures are particularly important for: - Multi-cloud deployments with sensitive credentials - Multi-tenant environments requiring strict isolation - Compliance requirements (SOC2, PCI-DSS, etc.) - Centralized logging and monitoring systems
Data Flow Patterns
Typical Data Processing Pipeline:
Cross-Storage Operations:
Performance and Scalability Architecture¶
Caching Strategy¶
Filesystem Level Caching: - Support for fsspec's built-in caching mechanisms - Optional directory structure preservation - Configurable cache size and location
Parallel Processing Architecture¶
Worker Pool Management:
Resource Optimization: - Automatic worker count detection based on CPU cores - Memory-aware chunking for large datasets - Progress tracking and error handling
Memory Management¶
Efficient Data Processing: - Streaming operations for large files - Chunked processing with configurable batch sizes - Type conversion for PyArrow compatibility
Extension Points and Customization¶
Adding New Storage Providers¶
Custom Storage Options:
Custom Processing Backends¶
Extending Dataset Operations:
Migration Guide¶
For details on historical changes between versions, consult the project changelog and release notes.
Quick Reference¶
Step 1: Update Imports
Step 2: Update Configuration
Step 3: Update Filesystem Creation
Future Features (Not Yet Implemented)¶
The following features are planned but not yet implemented:
- Performance Tracking: Built-in performance monitoring and metrics collection
- Plugin Registry: Dynamic plugin discovery and registration system
- Circuit Breaker Patterns: Advanced resilience patterns for distributed systems
- Delta Lake Integration: Delta Lake write helpers and compatibility
- Advanced Monitoring: Comprehensive observability and health checking
Conclusion¶
The fsspeckit architecture provides a practical foundation for data processing across multiple storage backends and processing frameworks. The domain-driven design ensures clear separation of concerns while maintaining consistent interfaces and behavior across all components.
The modular architecture enables easy extension and customization while maintaining backwards compatibility for existing users. Built-in performance optimizations and cross-framework compatibility make fsspeckit suitable for data processing workflows.
For specific implementation details and code examples, refer to the individual domain package documentation.