Skip to content

API Guide

This guide provides a capability-oriented overview of fsspeckit's public API. For detailed method signatures and parameters, see the generated API documentation.

Core Capabilities

Filesystem Factory

Create configured filesystems with protocol inference and path safety.

Capability: Configure storage
Functions: filesystem()
API Reference: fsspeckit.core.filesystem
How-to Guides: Work with Filesystems, Configure Cloud Storage

Storage Options

Structured configuration for cloud and Git providers.

Capability: Configure storage
Classes: AwsStorageOptions, GcsStorageOptions, AzureStorageOptions, GitHubStorageOptions, GitLabStorageOptions
API Reference: fsspeckit.storage_options
How-to Guides: Configure Cloud Storage

Data Processing Capabilities

Dataset Operations

High-performance dataset operations with DuckDB and PyArrow.

Capability: Process datasets
Classes: DuckDBParquetHandler
Functions: optimize_parquet_dataset_pyarrow, compact_parquet_dataset_pyarrow
API Reference: fsspeckit.datasets
How-to Guides: Read and Write Datasets

Extended I/O

Enhanced file reading and writing capabilities.

Capability: Read/write files
Methods: read_json(), read_csv(), read_parquet(), write_json(), write_csv(), write_parquet()
API Reference: fsspeckit.core.ext
How-to Guides: Read and Write Datasets

SQL Filter Translation

Convert SQL WHERE clauses to framework-specific expressions.

Capability: Filter data with SQL
Functions: sql2pyarrow_filter(), sql2polars_filter()
API Reference: fsspeckit.sql.filters
How-to Guides: Use SQL Filters

Utility Capabilities

Parallel Processing

Execute functions across multiple inputs with progress tracking.

Capability: Process data in parallel
Function: run_parallel()
API Reference: fsspeckit.common.misc
How-to Guides: Optimize Performance

File Synchronization

Synchronize files and directories between storage backends.

Capability: Sync files
Functions: sync_files(), sync_dir()
API Reference: fsspeckit.common.misc
How-to Guides: Sync and Manage Files

Type Conversion

Convert between different data formats and optimize data types.

Capability: Convert and optimize data
Functions: dict_to_dataframe(), to_pyarrow_table(), convert_large_types_to_normal()
API Reference: fsspeckit.common.types
How-to Guides: Read and Write Datasets

Domain Package Organization

fsspeckit is organized into domain-specific packages for better discoverability:

Core Package (fsspeckit.core)

Foundation layer providing filesystem APIs and path safety.

  • Filesystem Creation: Enhanced filesystem() function with protocol inference
  • Path Safety: DirFileSystem wrapper for secure directory confinement
  • Extended I/O: Rich file reading/writing methods
  • Base Classes: Enhanced filesystem base classes

Storage Options (fsspeckit.storage_options)

Configuration layer for cloud and Git providers.

  • Provider Classes: Structured configuration for AWS, GCP, Azure, GitHub, GitLab
  • Factory Functions: Environment-based and URI-based configuration
  • Conversion Methods: Serialize to/from YAML, environment variables

Datasets (fsspeckit.datasets)

Data processing layer for large-scale operations.

  • DuckDB Handler: High-performance parquet operations with SQL integration
  • PyArrow Helpers: Dataset optimization, compaction, and merging
  • Schema Management: Type conversion and schema evolution

SQL (fsspeckit.sql)

Query translation layer for cross-framework compatibility.

  • Filter Translation: SQL to PyArrow and Polars expressions
  • Schema Awareness: Type-aware filter generation
  • Cross-Framework: Consistent querying across data backends

Common (fsspeckit.common)

Shared utilities layer used across all domains.

  • Parallel Processing: Concurrent execution with progress tracking
  • Type Conversion: Format conversion and optimization
  • File Operations: Synchronization and path utilities
  • Security: Path validation and credential scrubbing

Usage Patterns

Basic Workflow

# 1. Configure storage
from fsspeckit.storage_options import storage_options_from_env
from fsspeckit.core.filesystem import filesystem

options = storage_options_from_env("s3")
fs = filesystem("s3", storage_options=options.to_dict())

# 2. Process data
from fsspeckit.datasets import DuckDBParquetHandler

handler = DuckDBParquetHandler(storage_options=options.to_dict())
result = handler.execute_sql("SELECT * FROM parquet_scan('data/') WHERE category = 'A'")

# 3. Optimize performance
from fsspeckit.common.misc import run_parallel

results = run_parallel(process_file, file_list, max_workers=4)

Advanced Workflow

# 1. Multi-cloud configuration
from fsspeckit.storage_options import AwsStorageOptions, GcsStorageOptions

aws_fs = AwsStorageOptions(region="us-east-1").to_filesystem()
gcs_fs = GcsStorageOptions(project="my-project").to_filesystem()

# 2. Cross-framework filtering
from fsspeckit.sql.filters import sql2pyarrow_filter, sql2polars_filter

pyarrow_filter = sql2pyarrow_filter("value > 100", schema)
polars_filter = sql2polars_filter("value > 100", schema)

# 3. Dataset optimization
from fsspeckit.datasets.pyarrow import optimize_parquet_dataset_pyarrow

optimize_parquet_dataset_pyarrow(
    dataset_path="s3://bucket/data/",
    z_order_columns=["category", "timestamp"],
    target_file_size="256MB"
)

Migration from Utils

The fsspeckit.utils module provides backwards compatibility. For new code, use domain packages:

Legacy Import Domain Package Recommended Import
from fsspeckit.utils import run_parallel Common from fsspeckit.common.misc import run_parallel
from fsspeckit.utils import DuckDBParquetHandler Datasets from fsspeckit.datasets import DuckDBParquetHandler
from fsspeckit.utils import sql2pyarrow_filter SQL from fsspeckit.sql.filters import sql2pyarrow_filter
from fsspeckit.utils import AwsStorageOptions Storage Options from fsspeckit.storage_options import AwsStorageOptions
from fsspeckit.utils import dict_to_dataframe Common from fsspeckit.common.types import dict_to_dataframe

For information on migration from older versions, refer to the project release notes.

Error Handling

fsspeckit uses consistent exception types:

  • ValueError: Configuration and validation errors
  • FileNotFoundError: Missing resources
  • PermissionError: Access control issues
  • ImportError: Missing optional dependencies

Error Handling Pattern

from fsspeckit.storage_options import AwsStorageOptions
from fsspeckit.core.filesystem import filesystem

try:
    # Configure storage
    options = AwsStorageOptions(region="us-east-1", ...)
    fs = options.to_filesystem()

    # Use filesystem
    files = fs.ls("s3://bucket/")

except ValueError as e:
    # Configuration/validation errors
    print(f"Configuration error: {e}")
except FileNotFoundError as e:
    # Missing resources
    print(f"Resource not found: {e}")
except PermissionError as e:
    # Access control issues
    print(f"Access denied: {e}")
except ImportError as e:
    # Missing optional dependencies
    print(f"Missing dependency: {e}")

Optional Dependencies

fsspeckit uses lazy imports for optional dependencies:

Feature Required Package Install Command
Dataset operations duckdb>=0.9.0 pip install duckdb
PyArrow operations pyarrow>=10.0.0 pip install pyarrow
Polars support polars>=0.19.0 pip install polars
SQL filtering sqlglot>=20.0.0 pip install sqlglot
Fast JSON orjson>=3.8.0 pip install orjson

Dependency Management

# Imports work even without dependencies
from fsspeckit.datasets import DuckDBParquetHandler
from fsspeckit.sql.filters import sql2pyarrow_filter

# Dependencies are required when actually using features
try:
    handler = DuckDBParquetHandler()
    handler.write_parquet_dataset(data, "path/")
except ImportError as e:
    print(f"Install with: pip install duckdb")

Performance Considerations

Caching

Enable caching for remote filesystems:

fs = filesystem("s3://bucket/", cached=True, cache_storage="/fast/cache")

Parallel Processing

Use parallel execution for I/O bound operations:

1
2
3
from fsspeckit.common.misc import run_parallel

results = run_parallel(process_func, data_list, max_workers=8)

Batch Operations

Process large datasets in batches:

for batch in fs.read_parquet("data/*.parquet", batch_size="100MB"):
    process_batch(batch)

Security Features

Path Safety

Filesystems are wrapped in DirFileSystem by default for security:

# All operations confined to specified directory
fs = filesystem("/data/", dirfs=True)

Credential Scrubbing

Prevent credential leakage in logs:

1
2
3
4
5
from fsspeckit.common.security import scrub_credentials

error_msg = f"Failed: access_key=AKIAIOSFODNN7EXAMPLE"
safe_msg = scrub_credentials(error_msg)
# Output: "Failed: access_key=[REDACTED]"

Input Validation

Validate user inputs for security:

1
2
3
4
from fsspeckit.common.security import validate_path, validate_columns

safe_path = validate_path(user_path, base_dir="/data/allowed")
safe_columns = validate_columns(user_columns, valid_columns=schema_columns)

For detailed method signatures and parameters: