Advanced Usage¶
fsspec-utils
extends the capabilities of fsspec
to provide a more robust and feature-rich experience for handling diverse file systems and data formats. This section delves into advanced features, configurations, and performance tips to help you get the most out of the library.
Unified Filesystem Creation with filesystem
¶
The fsspec_utils.core.filesystem
function offers a centralized and enhanced way to instantiate fsspec
filesystem objects. It supports:
- Intelligent Caching: Automatically wraps filesystems with
MonitoredSimpleCacheFileSystem
for improved performance and verbose logging of cache operations. - Structured Storage Options: Integrates seamlessly with
fsspec_utils.storage_options
classes, allowing for type-safe and organized configuration of cloud and Git-based storage. - Protocol Inference: Can infer the filesystem protocol directly from a URI or path, reducing boilerplate.
Example: Cached S3 Filesystem with Structured Options
Detailed Caching for Improved Performance¶
fsspec-utils
provides an enhanced caching mechanism that improves performance for repeated file operations, especially useful for remote filesystems.
This example demonstrates how caching improves read performance. The first read populates the cache, while subsequent reads retrieve data directly from the cache, significantly reducing access time. It also shows that data can still be retrieved from the cache even if the original source becomes unavailable.
Caching in fsspec-utils is an enhanced mechanism that improves performance for repeated file operations, especially useful for remote filesystems where network latency can significantly impact performance.
The filesystem()
function provides several parameters for configuring caching:
cached
: When set toTrue
, enables caching for all read operationscache_storage
: Specifies the directory where cached files will be storedverbose
: When set toTrue
, provides detailed logging about cache operations
Step-by-step walkthrough:
-
First read (populating cache): When reading a file for the first time, the data is retrieved from the source (disk, network, etc.) and stored in the cache directory. This takes longer than subsequent reads because it involves both reading from the source and writing to the cache.
-
Second read (using cache): When the same file is read again, the data is retrieved directly from the cache instead of the source. This is significantly faster because it avoids network latency or disk I/O.
-
Demonstrating cache effectiveness: Even after the original file is removed, the cached version can still be accessed. This demonstrates that the cache acts as a persistent copy of the data, independent of the source file.
-
Performance comparison: The timing results clearly show the performance benefits of caching, with subsequent reads being orders of magnitude faster than the initial read.
This caching mechanism is particularly valuable when working with:
- Remote filesystems (S3, GCS, Azure) where network latency is a bottleneck
- Frequently accessed files that don't change often
- Applications that read the same data multiple times
- Environments with unreliable network connections
Setup and First Read (Populating Cache)¶
In this step, we create a sample JSON file and initialize the fsspec-utils
filesystem with caching enabled. The first read operation retrieves data from the source and populates the cache.
Setup steps:
- Create a temporary directory for our example
- Create sample data file
- Configure filesystem with caching
Second Read (Using Cache)¶
Now, let's read the same file again to see the performance improvement from using the cache.
The second read retrieves data directly from the cache, which is significantly faster than reading from the source again.
Reading from Cache after Deletion¶
To demonstrate that the cache is persistent, we'll remove the original file and try to read it again.
This step proves that the cache acts as a persistent copy of the data, allowing access even if the original source is unavailable.
Custom Filesystem Implementations¶
fsspec-utils
provides specialized filesystem implementations for unique use cases:
GitLab Filesystem (GitLabFileSystem
)¶
Access files directly from GitLab repositories. This is particularly useful for configuration files, datasets, or code stored in private or public GitLab instances.
Example: Reading from a GitLab Repository
Advanced Data Reading and Writing (read_files
, write_files
)¶
The fsspec_utils.core.ext
module (exposed via AbstractFileSystem
extensions) provides powerful functions for reading and writing various data formats (JSON, CSV, Parquet) with advanced features like:
- Batch Processing: Efficiently handle large datasets by processing files in configurable batches.
- Parallel Processing: Leverage multi-threading to speed up file I/O operations.
- Schema Unification & Optimization: Automatically unifies schemas when concatenating multiple files and optimizes data types for memory efficiency (e.g., using Polars'
opt_dtypes
or PyArrow's schema casting). - File Path Tracking: Optionally include the source file path as a column in the resulting DataFrame/Table.
Universal read_files
¶
The read_files
function acts as a universal reader, delegating to format-specific readers (JSON, CSV, Parquet) while maintaining consistent options.
Example: Reading CSVs in Batches with Parallelism
Reading and Processing Multiple Files (PyArrow Tables, Batch Processing)¶
fsspec-utils
simplifies reading multiple files of various formats (Parquet, CSV, JSON) from a folder into a single PyArrow Table or Polars DataFrame.
Reading multiple files into a single table is a powerful feature that allows you to efficiently process data distributed across multiple files. This is particularly useful when dealing with large datasets that are split into smaller files for better organization or parallel processing.
Key concepts demonstrated:
-
Glob patterns: The
**/*.parquet
,**/*.csv
, and**/*.json
patterns are used to select files recursively from the directory and its subdirectories. The**
pattern matches any directories, allowing the function to find files in nested directories. -
Concat parameter: The
concat=True
parameter tells the function to combine data from multiple files into a single table or DataFrame. When set toFalse
, the function would return a list of individual tables/DataFrames. -
Format flexibility: The same interface can be used to read different file formats (Parquet, CSV, JSON), making it easy to work with heterogeneous data sources.
Step-by-step explanation:
-
Creating sample data: We create two subdirectories and populate them with sample data in three different formats (Parquet, CSV, JSON). Each format contains the same structured data but in different serialization formats.
-
Reading Parquet files: Using
fs.read_parquet("**/*.parquet", concat=True)
, we read all Parquet files recursively and combine them into a single PyArrow Table. Parquet is a columnar storage format that is highly efficient for analytical workloads. -
Reading CSV files: Using
fs.read_csv("**/*.csv", concat=True)
, we read all CSV files and combine them into a Polars DataFrame, which we then convert to a PyArrow Table for consistency. -
Reading JSON files: Using
fs.read_json("**/*.json", as_dataframe=True, concat=True)
, we read all JSON files and combine them into a Polars DataFrame, then convert it to a PyArrow Table. -
Verification: Finally, we verify that all three tables have the same number of rows, confirming that the data was correctly read and combined across all files and formats.
The flexibility of fsspec-utils
allows you to use the same approach with different data sources, including remote filesystems like S3, GCS, or Azure Blob Storage, simply by changing the filesystem path.
Setup¶
First, we'll create a temporary directory with sample data in different formats.
Setup steps:
- Create a temporary directory for our example
- Create sample data in subdirectories
This step sets up the environment by creating a temporary directory and populating it with sample data files.
Reading Parquet Files¶
Now, let's read all the Parquet files from the directory and its subdirectories into a single PyArrow Table.
Reading Parquet files:
- Read Parquet files using glob pattern
- Display table information and sample data
We use the read_parquet
method with a glob pattern **/*.parquet
to find all Parquet files recursively. The concat=True
parameter combines them into a single table.
Reading CSV Files¶
Next, we'll read all the CSV files into a Polars DataFrame and then convert it to a PyArrow Table.
Reading CSV files:
- Read CSV files using glob pattern
- Display DataFrame information and sample data
- Convert to PyArrow Table for consistency
Similarly, we use read_csv
with the same glob pattern to read all CSV files.
Reading JSON Files¶
Finally, let's read all the JSON files.
Reading JSON files:
- Read JSON files using glob pattern
- Display DataFrame information and sample data
- Convert to PyArrow Table for consistency
The read_json
method is used to read all JSON files. We set as_dataframe=True
to get a Polars DataFrame.
Verification¶
Let's verify that all the tables have the same number of rows.
This final step confirms that our data reading and concatenation were successful.
This example shows how to read various file formats from a directory, including subdirectories, into a unified PyArrow Table or Polars DataFrame. It highlights the flexibility of fsspec-utils
in handling different data sources and formats.
fsspec-utils
enables efficient batch processing of large datasets by reading files in smaller, manageable chunks. This is particularly useful for memory-constrained environments or when processing streaming data.
Batch processing is a technique for handling large datasets by dividing them into smaller, manageable chunks. This is particularly important for:
- Memory-constrained environments: When working with datasets that are too large to fit in memory, batch processing allows you to process the data incrementally.
- Streaming data: When data is continuously generated (e.g., from IoT devices or real-time applications), batch processing enables you to process data as it arrives.
- Distributed processing: In distributed computing environments, batch processing allows different nodes to work on different chunks of data simultaneously.
The batch_size
parameter controls how many files or records are processed together in each batch. A smaller batch size reduces memory usage but may increase processing overhead, while a larger batch size improves throughput but requires more memory.
Step-by-step walkthrough:
-
Creating sample batched data: We generate sample data and distribute it across multiple files in each format (Parquet, CSV, JSON). Each file contains a subset of the total data, simulating a real-world scenario where data is split across multiple files.
-
Reading Parquet files in batches: Using
fs.read_parquet(parquet_path, batch_size=2)
, we read all Parquet files in batches of 2 files at a time. Each iteration of the loop processes a batch of files, and thebatch
variable contains the combined data from those files. -
Reading CSV files in batches: Similarly, we use
fs.read_csv(csv_path, batch_size=2)
to read CSV files in batches. The result is a Polars DataFrame for each batch, which we can process individually. -
Reading JSON files in batches: Finally, we use
fs.read_json(json_path, batch_size=2)
to read JSON files in batches. The JSON data is automatically converted to Polars DataFrames for easy processing.
The read_json
method is also used with batch_size=2
to process JSON files in batches.
This example illustrates how to read Parquet, CSV, and JSON files in batches using the batch_size
parameter. This approach allows for processing of large datasets without loading the entire dataset into memory at once.
Advanced Parquet Handling and Delta Lake Integration¶
fsspec-utils
enhances Parquet operations with deep integration with PyArrow and Pydala, enabling efficient dataset management, partitioning, and delta lake capabilities.
pyarrow_dataset
: Create PyArrow datasets for optimized querying, partitioning, and predicate pushdown.pyarrow_parquet_dataset
: Specialized for Parquet, handling_metadata
files for overall dataset schemas.pydala_dataset
: Integrates withpydala
for advanced features like Delta Lake operations (upserts, schema evolution).
Example: Writing to a PyArrow Dataset with Partitioning
Example: Delta Lake Operations with Pydala Dataset
fsspec-utils
facilitates integration with Delta Lake by providing StorageOptions
that can be used to configure deltalake
's DeltaTable
for various storage backends.
This example demonstrates how to use LocalStorageOptions
with deltalake
's DeltaTable
. It shows how to initialize a DeltaTable
instance by passing the fsspec-utils
storage options, enabling seamless interaction with Delta Lake tables across different storage types.
Step-by-step walkthrough:
- Create a temporary directory for our example
- Create a simple Polars DataFrame
- Write initial data to create the Delta table
- Create a LocalStorageOptions object for the temporary directory
- Create a DeltaTable instance, passing storage options
- Note: deltalake expects storage_options as a dict, which to_object_store_kwargs provides
- Read data from the DeltaTable
- Clean up the temporary directory
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads. It provides a reliable, scalable, and performant way to work with data lakes, combining the benefits of data lakes (low cost, flexibility) with data warehouses (reliability, performance).
Key features of Delta Lake:
- ACID transactions: Ensures data integrity even with concurrent operations
- Time travel: Allows querying data as it existed at any point in time
- Schema enforcement: Maintains data consistency with schema validation
- Scalable metadata: Handles billions of files efficiently
- Unified analytics: Supports both batch and streaming workloads
Integrating fsspec-utils with Delta Lake:
The fsspec-utils
StorageOptions
classes can be used to configure deltalake
's DeltaTable
for various storage backends. This integration allows you to:
- Use consistent configuration patterns across different storage systems
- Leverage the benefits of fsspec's unified filesystem interface
- Seamlessly switch between local and cloud storage without changing your Delta Lake code
The to_object_store_kwargs()
method converts fsspec-utils
storage options into a dictionary format that deltalake
expects for its storage_options
parameter. This is necessary because deltalake
requires storage options as a dictionary, while fsspec-utils
provides them as structured objects.
Step-by-step walkthrough:
-
Creating a temporary directory: We create a temporary directory to store our Delta table, ensuring the example is self-contained and doesn't leave artifacts on your system.
-
Creating sample data: We create a simple Polars DataFrame with sample data that will be written to our Delta table.
-
Writing to Delta table: Using the
write_delta
method, we convert our DataFrame into a Delta table. This creates the necessary Delta Lake metadata alongside the data files. -
Configuring storage options: We create a
LocalStorageOptions
object that points to our temporary directory. This object contains all the information needed to access the Delta table. -
Initializing DeltaTable: We create a
DeltaTable
instance by passing the table path and the storage options converted to a dictionary viato_object_store_kwargs()
. This allowsdeltalake
to locate and access the Delta table files. -
Verifying the DeltaTable: We check the version and files of our Delta table to confirm it was created correctly. Delta tables maintain version history, allowing you to track changes over time.
-
Reading data: Finally, we read the data from our Delta table back into a PyArrow Table, demonstrating that we can successfully interact with the Delta Lake table using the fsspec-utils configuration.
This integration is particularly valuable when working with Delta Lake in cloud environments, as it allows you to use the same configuration approach for local development and production deployments across different cloud providers.
Storage Options Management¶
fsspec-utils
provides a robust system for managing storage configurations, simplifying credential handling and environment setup.
Loading from Environment Variables¶
Instead of hardcoding credentials, you can load storage options directly from environment variables.
Example: Loading AWS S3 Configuration from Environment
Set these environment variables before running your script:
Then in Python:
Merging Storage Options¶
Combine multiple storage option configurations, useful for layering default settings with user-specific overrides.
Example: Merging S3 Options
Note on GitHub Examples¶
For a comprehensive collection of executable examples demonstrating various functionalities and advanced patterns of fsspec-utils
, including those discussed in this document, please refer to the examples directory on GitHub. Each example is designed to be runnable and provides detailed insights into practical usage.
Performance Tips¶
- Caching: Always consider using
cached=True
with thefilesystem
function, especially for remote filesystems, to minimize repeated downloads. - Parallel Reading: For multiple files, set
use_threads=True
inread_json
,read_csv
, andread_parquet
to leverage concurrent I/O. - Batch Processing: When dealing with a very large number of files or extremely large individual files, use the
batch_size
parameter in reading functions to process data in chunks, reducing memory footprint. opt_dtypes
: Utilizeopt_dtypes=True
in reading functions when converting to Polars or PyArrow to automatically optimize column data types, leading to more efficient memory usage and faster subsequent operations.- Parquet Datasets: For large, partitioned Parquet datasets, use
pyarrow_dataset
orpydala_dataset
. These leverage PyArrow's dataset API for efficient metadata handling, partition pruning, and predicate pushdown, reading only the necessary data. - Compression: When writing Parquet files, choose an appropriate compression codec (e.g.,
zstd
,snappy
) to reduce file size and improve I/O performance.zstd
often provides a good balance of compression ratio and speed.
Flexible Storage Configuration¶
fsspec-utils
simplifies configuring connections to various storage systems, including local filesystems, AWS S3, Azure Storage, and Google Cloud Storage, using StorageOptions
classes. These options can then be converted into fsspec
filesystems.
Local Storage Example¶
This example demonstrates how to initialize LocalStorageOptions
and use it to interact with the local filesystem.
Step-by-step walkthrough:
- Create a temporary directory for our test
- Create a test file and write content to it
- List files in the directory to verify our file was created
- Read the content back to verify it was written correctly
- Clean up the temporary directory
StorageOptions classes simplify configuration for different storage systems and provide a consistent interface for creating fsspec filesystem objects.
Conceptual AWS S3 Configuration¶
This example demonstrates the configuration pattern for AwsStorageOptions
. It is expected to fail when attempting to connect to actual cloud services because it uses dummy credentials.
Note: The to_filesystem()
method converts StorageOptions into fsspec-compatible objects, allowing seamless integration with any fsspec-compatible library.
Conceptual Azure Configuration¶
This example shows how to configure AzureStorageOptions
. It is expected to fail when attempting to connect to actual cloud services because it uses dummy credentials.
Conceptual GCS Configuration¶
This example shows how to configure GcsStorageOptions
. It is expected to fail when attempting to connect to actual cloud services because it uses dummy credentials.
StorageOptions classes provide a simplified, consistent interface for configuring connections to various storage systems. They abstract away the complexity of different storage backends and provide a unified way to create fsspec filesystem objects.
The to_filesystem()
method converts these options into fsspec
compatible objects, enabling seamless integration with any fsspec-compatible library or tool.
Important Note: The AWS, Azure, and GCS examples use dummy credentials and are for illustrative purposes only. These examples are expected to fail when attempting to connect to actual cloud services because:
- The endpoint URLs are not real service endpoints
- The credentials are placeholder values that don't correspond to actual accounts
- The connection strings and tokens are examples, not valid credentials
This approach allows you to understand the configuration pattern without needing actual cloud credentials. When using these examples in production, you would replace the dummy values with your real credentials and service endpoints.
```python from fsspec_utils.storage_options import GcsStorageOptions
print("=== Conceptual GcsStorageOptions Example (using a dummy token path) ===\n") gcs_options = GcsStorageOptions( protocol="gs", project="demo-project", token="path/to/dummy-service-account.json" )
gcs_fs = gcs_options.to_filesystem() print(f"Created fsspec filesystem for GCS: {type(gcs_fs).name}") print("GCS storage example completed.\n")