Examples Guide¶
This page provides an overview of the available examples in the fsspeckit repository. Each example is designed to be runnable and demonstrates real-world usage patterns.
Available Examples¶
Storage Configuration¶
Location: examples/storage_options/
Demonstrates how to create and use storage option objects for different cloud providers:
Topics covered: - Creating storage options for different providers - Converting to fsspec filesystems - Environment variable loading - YAML configuration
Directory Filesystem (DirFileSystem)¶
Location: examples/dir_file_system/
Shows how to use DirFileSystem for treating directories as files:
Topics covered: - DirFileSystem creation and usage - Path handling with directory boundaries - Combining with storage options
Caching¶
Location: examples/caching/
Demonstrates how to improve performance using the enhanced caching mechanism:
Topics covered: - Cache configuration and parameters - Performance monitoring - Cache persistence - Handling cache invalidation
Batch Processing¶
Location: examples/batch_processing/
Shows how to process large numbers of files in batches:
Topics covered: - Batch reading of multiple files - Memory-efficient processing - Batch aggregation - Progress tracking
Reading Folders¶
Location: examples/read_folder/
Demonstrates reading multiple files in various formats from a directory:
Topics covered: - Glob patterns for file discovery - Format-specific readers - Schema unification - Recursive directory traversal
S3/R2/MinIO with PyArrow Datasets¶
Location: examples/s3_pyarrow_dataset/
Shows how to work with partitioned datasets on object storage:
Topics covered: - Cloud object storage configuration - PyArrow dataset operations - Partitioned dataset reading - Metadata handling - Predicate pushdown
Delta Lake Integration¶
Location: examples/deltalake_delta_table/
Demonstrates integration with Delta Lake:
Topics covered: - Creating Delta tables - Storage options integration - Reading Delta metadata - Version tracking
PyDala Dataset¶
Location: examples/__pydala_dataset/
Shows how to work with Pydala datasets:
Topics covered: - Pydala dataset format - Format conversions - Dataset metadata
Running Examples¶
Prerequisites¶
Install fsspeckit with all optional dependencies:
For Delta Lake examples:
Execution¶
Most examples can be run directly:
Jupyter Notebooks¶
Examples are available as both .py files and .ipynb Jupyter notebooks for interactive exploration:
Example Naming Conventions¶
*_example.py- Standard Python script version*_example.ipynb- Jupyter notebook version*_example_mamo.py- Alternative implementation (if available)
Contributing Examples¶
To contribute a new example:
- Create a new subdirectory under
examples/with a descriptive name - Add both
.pyand.ipynbversions - Include sample data generation if needed
- Add docstrings explaining the example
- Update this guide with the new example
Quick Reference¶
| Use Case | Example | Key Methods |
|---|---|---|
| Cloud storage access | storage_options/ |
AwsStorageOptions.from_env() |
| Local directory handling | dir_file_system/ |
filesystem(..., dirfs=True) |
| Performance optimization | caching/ |
filesystem(..., cached=True) |
| Large data processing | batch_processing/ |
fs.read_csv(..., batch_size=N) |
| Multi-format reading | read_folder/ |
fs.read_files(..., format='auto') |
| Object storage datasets | s3_pyarrow_dataset/ |
fs.pyarrow_dataset(...) |
| Data lake integration | deltalake_delta_table/ |
DeltaTable(..., storage_options=...) |
Troubleshooting Examples¶
Missing dependencies: Install with pip install "fsspeckit[aws,gcp,azure]"
Cloud credentials not found: Set environment variables or update examples with credentials
Out of memory: Reduce batch size in batch processing examples
Network errors: Check connectivity to cloud services in cloud storage examples
For more help, see the Advanced Usage guide or check individual example source code.