Architecture Overview¶
fsspeckit is designed to extend and enhance the capabilities of fsspec, providing a robust and flexible framework for interacting with various filesystems and data formats. Its architecture is modular, built around core components that abstract away complexities and offer specialized functionalities.
Extending fsspec¶
At its core, fsspeckit builds upon the fsspec (Filesystem Spec) library, which provides a unified Pythonic interface to various storage backends. fsspeckit extends this functionality by:
- Simplifying Storage Configuration: It offers
StorageOptionsclasses for various cloud providers (AWS S3, Google Cloud Storage, Azure Storage) and Git platforms (GitHub, GitLab), allowing for easier and more consistent configuration of filesystem access. - Enhancing I/O Operations: It provides extended read/write capabilities for common data formats like JSON, CSV, and Parquet, with integrations for high-performance libraries like Polars and PyArrow.
- Improving Caching: The library includes an enhanced caching mechanism that preserves directory structures and offers better monitoring.
Core Components¶
The fsspeckit library is organized into several key modules:
core¶
This module contains the fundamental extensions to fsspec. It includes the filesystem function, which acts as a central factory for creating fsspec compatible filesystem objects, potentially with enhanced features like caching and extended I/O. The DirFileSystem class is also part of this module, providing specialized handling for directory-based filesystems.
storage_options¶
This module is dedicated to managing storage configurations for different backends. It defines various StorageOptions classes (e.g., AwsStorageOptions, GcsStorageOptions, AzureStorageOptions, GitHubStorageOptions, GitLabStorageOptions) that encapsulate the necessary parameters for connecting to specific storage services. It also includes utility functions for inferring protocols from URIs and merging storage options.
utils¶
The utils module provides a collection of general-purpose utility functions that support various operations within fsspeckit. These include:
- Parallel Processing: Functions like
run_parallelfor executing tasks concurrently. - Type Conversion: Utilities such as
dict_to_dataframeandto_pyarrow_tablefor data manipulation. - Logging: A setup for consistent logging across the library.
- PyArrow and Polars Integration: A lot of utility functions, e.g. for optimizing data types and schemas when working with PyArrow tables and Polars DataFrames.
Diagrams¶
graph TD
A[fsspeckit] --> B(Core Module)
A --> C(Storage Options Module)
A --> D(Utils Module)
B --> E[Extends fsspec]
C --> F{Cloud Providers}
C --> G{Git Platforms}
D --> H[Parallel Processing]
D --> I[Type Conversion]
D --> J[Logging]
D --> K[PyArrow/Polars Integration]
F --> L(AWS S3)
F --> M(Google Cloud Storage)
F --> N(Azure Storage)
G --> O(GitHub)
G --> P(GitLab)