API Guide¶
This guide provides a capability-oriented overview of fsspeckit's public API. For detailed method signatures and parameters, see the generated API documentation.
Core Capabilities¶
Filesystem Factory¶
Create configured filesystems with protocol inference and path safety.
Capability: Configure storage
Functions: filesystem()
API Reference: fsspeckit.core.filesystem
How-to Guides: Work with Filesystems, Configure Cloud Storage
Storage Options¶
Structured configuration for cloud and Git providers.
Capability: Configure storage
Classes: AwsStorageOptions, GcsStorageOptions, AzureStorageOptions, GitHubStorageOptions, GitLabStorageOptions
API Reference: fsspeckit.storage_options
How-to Guides: Configure Cloud Storage
Data Processing Capabilities¶
Dataset Operations¶
High-performance dataset operations with DuckDB and PyArrow.
Capability: Process datasets
Classes: DuckDBParquetHandler
Functions: optimize_parquet_dataset_pyarrow, compact_parquet_dataset_pyarrow
API Reference: fsspeckit.datasets
How-to Guides: Read and Write Datasets
Extended I/O¶
Enhanced file reading and writing capabilities.
Capability: Read/write files
Methods: read_json(), read_csv(), read_parquet(), write_json(), write_csv(), write_parquet()
API Reference: fsspeckit.core.ext
How-to Guides: Read and Write Datasets
SQL Filter Translation¶
Convert SQL WHERE clauses to framework-specific expressions.
Capability: Filter data with SQL
Functions: sql2pyarrow_filter(), sql2polars_filter()
API Reference: fsspeckit.sql.filters
How-to Guides: Use SQL Filters
Utility Capabilities¶
Parallel Processing¶
Execute functions across multiple inputs with progress tracking.
Capability: Process data in parallel
Function: run_parallel()
API Reference: fsspeckit.common.misc
How-to Guides: Optimize Performance
File Synchronization¶
Synchronize files and directories between storage backends.
Capability: Sync files
Functions: sync_files(), sync_dir()
API Reference: fsspeckit.common.misc
How-to Guides: Sync and Manage Files
Type Conversion¶
Convert between different data formats and optimize data types.
Capability: Convert and optimize data
Functions: dict_to_dataframe(), to_pyarrow_table(), convert_large_types_to_normal()
API Reference: fsspeckit.common.types
How-to Guides: Read and Write Datasets
Domain Package Organization¶
fsspeckit is organized into domain-specific packages for better discoverability:
Core Package (fsspeckit.core)¶
Foundation layer providing filesystem APIs and path safety.
- Filesystem Creation: Enhanced
filesystem()function with protocol inference - Path Safety:
DirFileSystemwrapper for secure directory confinement - Extended I/O: Rich file reading/writing methods
- Base Classes: Enhanced filesystem base classes
Storage Options (fsspeckit.storage_options)¶
Configuration layer for cloud and Git providers.
- Provider Classes: Structured configuration for AWS, GCP, Azure, GitHub, GitLab
- Factory Functions: Environment-based and URI-based configuration
- Conversion Methods: Serialize to/from YAML, environment variables
Datasets (fsspeckit.datasets)¶
Data processing layer for large-scale operations.
- DuckDB Handler: High-performance parquet operations with SQL integration
- PyArrow Helpers: Dataset optimization, compaction, and merging
- Schema Management: Type conversion and schema evolution
SQL (fsspeckit.sql)¶
Query translation layer for cross-framework compatibility.
- Filter Translation: SQL to PyArrow and Polars expressions
- Schema Awareness: Type-aware filter generation
- Cross-Framework: Consistent querying across data backends
Common (fsspeckit.common)¶
Shared utilities layer used across all domains.
- Parallel Processing: Concurrent execution with progress tracking
- Type Conversion: Format conversion and optimization
- File Operations: Synchronization and path utilities
- Security: Path validation and credential scrubbing
Usage Patterns¶
Basic Workflow¶
Advanced Workflow¶
Migration from Utils¶
The fsspeckit.utils module provides backwards compatibility. For new code, use domain packages:
| Legacy Import | Domain Package | Recommended Import |
|---|---|---|
from fsspeckit.utils import run_parallel |
Common | from fsspeckit.common.misc import run_parallel |
from fsspeckit.utils import DuckDBParquetHandler |
Datasets | from fsspeckit.datasets import DuckDBParquetHandler |
from fsspeckit.utils import sql2pyarrow_filter |
SQL | from fsspeckit.sql.filters import sql2pyarrow_filter |
from fsspeckit.utils import AwsStorageOptions |
Storage Options | from fsspeckit.storage_options import AwsStorageOptions |
from fsspeckit.utils import dict_to_dataframe |
Common | from fsspeckit.common.types import dict_to_dataframe |
For information on migration from older versions, refer to the project release notes.
Error Handling¶
fsspeckit uses consistent exception types:
ValueError: Configuration and validation errorsFileNotFoundError: Missing resourcesPermissionError: Access control issuesImportError: Missing optional dependencies
Error Handling Pattern¶
Optional Dependencies¶
fsspeckit uses lazy imports for optional dependencies:
| Feature | Required Package | Install Command |
|---|---|---|
| Dataset operations | duckdb>=0.9.0 |
pip install duckdb |
| PyArrow operations | pyarrow>=10.0.0 |
pip install pyarrow |
| Polars support | polars>=0.19.0 |
pip install polars |
| SQL filtering | sqlglot>=20.0.0 |
pip install sqlglot |
| Fast JSON | orjson>=3.8.0 |
pip install orjson |
Dependency Management¶
Performance Considerations¶
Caching¶
Enable caching for remote filesystems:
Parallel Processing¶
Use parallel execution for I/O bound operations:
Batch Operations¶
Process large datasets in batches:
Security Features¶
Path Safety¶
Filesystems are wrapped in DirFileSystem by default for security:
Credential Scrubbing¶
Prevent credential leakage in logs:
Input Validation¶
Validate user inputs for security:
API Reference Links¶
For detailed method signatures and parameters:
Related Documentation¶
- How-to Guides - Task-oriented recipes
- Tutorials - Step-by-step learning
- Explanation - Conceptual understanding