Getting Started
This tutorial will walk you through your first steps with fsspeckit. You'll learn how to install the library, work with local and cloud storage, and perform basic dataset operations.
Prerequisites
- Python 3.11 or higher
- Basic familiarity with Python and data concepts
Installation
First, install fsspeckit with the dependencies you need:
| # Basic installation
pip install fsspeckit
# With cloud storage support
pip install "fsspeckit[aws,gcp,azure]"
# With all optional dependencies for data processing
pip install "fsspeckit[aws,gcp,azure]" duckdb pyarrow polars sqlglot
|
For detailed installation instructions, see the Installation Guide.
Your First Local Filesystem
Let's start by creating a local filesystem and performing basic operations:
| from fsspeckit.core.filesystem import filesystem
import os
# Create a local filesystem
# Note: filesystem() wraps the filesystem in DirFileSystem by default (dirfs=True)
# for path safety, confining all operations to the specified directory
fs = filesystem("file")
# Define a directory path
local_dir = "./my_data/"
os.makedirs(local_dir, exist_ok=True)
# Create and write a file
with fs.open(f"{local_dir}example.txt", "w") as f:
f.write("Hello, fsspeckit!")
# Read the file
with fs.open(f"{local_dir}example.txt", "r") as f:
content = f.read()
print(f"Content: {content}")
# List files in directory
files = fs.ls(local_dir)
print(f"Files: {files}")
|
Path Safety: The filesystem() function wraps filesystems in DirFileSystem by default (dirfs=True), which confines all operations to the specified directory path. This prevents accidental access to paths outside the intended directory.
Working with Cloud Storage
Now let's configure cloud storage. We'll use environment variables for credentials:
| from fsspeckit.storage_options import storage_options_from_env
from fsspeckit.core.filesystem import filesystem
# Set environment variables (or set them in your environment)
import os
os.environ["AWS_ACCESS_KEY_ID"] = "your_access_key"
os.environ["AWS_SECRET_ACCESS_KEY"] = "your_secret_key"
os.environ["AWS_DEFAULT_REGION"] = "us-east-1"
# Load AWS options from environment
aws_options = storage_options_from_env("s3")
fs = filesystem("s3", storage_options=aws_options.to_dict())
print(f"Created S3 filesystem in region: {aws_options.region}")
|
You can also configure storage manually:
| from fsspeckit.storage_options import AwsStorageOptions
# Configure AWS S3
aws_options = AwsStorageOptions(
region="us-east-1",
access_key_id="YOUR_ACCESS_KEY",
secret_access_key="YOUR_SECRET_KEY"
)
# Create filesystem
aws_fs = aws_options.to_filesystem()
|
Protocol Inference
The filesystem() function can automatically detect protocols from URIs:
| # Auto-detect protocols
s3_fs = filesystem("s3://bucket/path") # S3
gcs_fs = filesystem("gs://bucket/path") # Google Cloud Storage
az_fs = filesystem("az://container/path") # Azure Blob Storage
github_fs = filesystem("github://owner/repo") # GitHub
# All work with the same interface
for name, fs in [("S3", s3_fs), ("GCS", gcs_fs)]:
try:
files = fs.ls("/")
print(f"{name} files: {len(files)}")
except Exception as e:
print(f"{name} error: {e}")
|
Your First Dataset Operation
Let's perform a basic dataset operation using the DuckDB Parquet Handler:
| from fsspeckit.datasets import DuckDBParquetHandler
import polars as pl
# Initialize handler with storage options
storage_options = {"key": "value", "secret": "secret"}
handler = DuckDBParquetHandler(storage_options=storage_options)
# Create sample data
data = pl.DataFrame({
"id": [1, 2, 3, 4],
"category": ["A", "B", "A", "B"],
"value": [10.5, 20.3, 15.7, 25.1]
})
# Write dataset
handler.write_parquet_dataset(data, "s3://bucket/my-dataset/")
# Execute SQL queries
result = handler.execute_sql("""
SELECT category, COUNT(*) as count, AVG(value) as avg_value
FROM parquet_scan('s3://bucket/my-dataset/')
GROUP BY category
""")
print(result)
|
Domain Package Structure
fsspeckit is organized into domain-specific packages. Import from the appropriate package for your use case:
| # Filesystem creation and core functionality
from fsspeckit.core.filesystem import filesystem
# Storage configuration
from fsspeckit.storage_options import AwsStorageOptions, storage_options_from_env
# Dataset operations
from fsspeckit.datasets import DuckDBParquetHandler
# SQL filter translation
from fsspeckit.sql.filters import sql2pyarrow_filter, sql2polars_filter
# Common utilities
from fsspeckit.common.misc import run_parallel
from fsspeckit.common.types import dict_to_dataframe
# Backwards compatibility (legacy)
from fsspeckit.utils import DuckDBParquetHandler # Still works
|
Next Steps
Congratulations! You've completed the basic fsspeckit tutorial. Here are some recommended next steps:
Explore More Features
Common Use Cases
- Cloud Data Processing: Use
storage_options_from_env() for production deployments
- Dataset Operations: Use
DuckDBParquetHandler for large-scale parquet operations
- SQL Filtering: Use
sql2pyarrow_filter() and sql2polars_filter() for cross-framework compatibility
- Safe Operations: Use
DirFileSystem for security-critical applications
- Performance: Use
run_parallel() for concurrent file processing
Production Tips
- Use Domain Packages: Import from
fsspeckit.datasets, fsspeckit.storage_options, etc. instead of utils
- Environment Configuration: Load credentials from environment variables in production
- Error Handling: Always wrap remote filesystem operations in try-except blocks
- Type Safety: Use structured
StorageOptions classes instead of raw dictionaries
- Testing: Use
LocalStorageOptions and DirFileSystem for isolated test environments
For more detailed information, explore the other sections of the documentation.