Getting Started¶

This tutorial will walk you through your first steps with fsspeckit. You'll learn how to install the library, work with local and cloud storage, and perform basic dataset operations.

Prerequisites¶

Python 3.11 or higher
Basic familiarity with Python and data concepts

Installation¶

First, install fsspeckit with the dependencies you need:

# Basic installation
pip install fsspeckit

# With cloud storage support
pip install "fsspeckit[aws,gcp,azure]"

# With all optional dependencies for data processing
pip install "fsspeckit[aws,gcp,azure]" duckdb pyarrow polars sqlglot

For detailed installation instructions, see the Installation Guide.

Your First Local Filesystem¶

Let's start by creating a local filesystem and performing basic operations:

from fsspeckit.core.filesystem import filesystem
import os

# Create a local filesystem
# Note: filesystem() wraps the filesystem in DirFileSystem by default (dirfs=True)
# for path safety, confining all operations to the specified directory
fs = filesystem("file")

# Define a directory path
local_dir = "./my_data/"
os.makedirs(local_dir, exist_ok=True)

# Create and write a file
with fs.open(f"{local_dir}example.txt", "w") as f:
    f.write("Hello, fsspeckit!")

# Read the file
with fs.open(f"{local_dir}example.txt", "r") as f:
    content = f.read()
print(f"Content: {content}")

# List files in directory
files = fs.ls(local_dir)
print(f"Files: {files}")

Path Safety: The filesystem() function wraps filesystems in DirFileSystem by default (dirfs=True), which confines all operations to the specified directory path. This prevents accidental access to paths outside the intended directory.

Working with Cloud Storage¶

Now let's configure cloud storage. We'll use environment variables for credentials:

from fsspeckit.storage_options import storage_options_from_env
from fsspeckit.core.filesystem import filesystem

# Set environment variables (or set them in your environment)
import os
os.environ["AWS_ACCESS_KEY_ID"] = "your_access_key"
os.environ["AWS_SECRET_ACCESS_KEY"] = "your_secret_key"
os.environ["AWS_DEFAULT_REGION"] = "us-east-1"

# Load AWS options from environment
aws_options = storage_options_from_env("s3")
fs = filesystem("s3", storage_options=aws_options.to_dict())

print(f"Created S3 filesystem in region: {aws_options.region}")

You can also configure storage manually:

from fsspeckit.storage_options import AwsStorageOptions

# Configure AWS S3
aws_options = AwsStorageOptions(
    region="us-east-1",
    access_key_id="YOUR_ACCESS_KEY",
    secret_access_key="YOUR_SECRET_KEY"
)

# Create filesystem
aws_fs = aws_options.to_filesystem()

Protocol Inference¶

The filesystem() function can automatically detect protocols from URIs:

# Auto-detect protocols
s3_fs = filesystem("s3://bucket/path")      # S3
gcs_fs = filesystem("gs://bucket/path")      # Google Cloud Storage
az_fs = filesystem("az://container/path")    # Azure Blob Storage
github_fs = filesystem("github://owner/repo") # GitHub

# All work with the same interface
for name, fs in [("S3", s3_fs), ("GCS", gcs_fs)]:
    try:
        files = fs.ls("/")
        print(f"{name} files: {len(files)}")
    except Exception as e:
        print(f"{name} error: {e}")

Your First Dataset Operation¶

Let's perform a basic dataset operation using the DuckDB Parquet Handler:

from fsspeckit.datasets import DuckDBParquetHandler
import polars as pl

# Initialize handler with storage options
storage_options = {"key": "value", "secret": "secret"}
handler = DuckDBParquetHandler(storage_options=storage_options)

# Create sample data
data = pl.DataFrame({
    "id": [1, 2, 3, 4],
    "category": ["A", "B", "A", "B"],
    "value": [10.5, 20.3, 15.7, 25.1]
})

# Write dataset
handler.write_parquet_dataset(data, "s3://bucket/my-dataset/")

# Execute SQL queries
result = handler.execute_sql("""
    SELECT category, COUNT(*) as count, AVG(value) as avg_value
    FROM parquet_scan('s3://bucket/my-dataset/')
    GROUP BY category
""")

print(result)

Domain Package Structure¶

fsspeckit is organized into domain-specific packages. Import from the appropriate package for your use case:

# Filesystem creation and core functionality
from fsspeckit.core.filesystem import filesystem

# Storage configuration
from fsspeckit.storage_options import AwsStorageOptions, storage_options_from_env

# Dataset operations
from fsspeckit.datasets import DuckDBParquetHandler

# SQL filter translation
from fsspeckit.sql.filters import sql2pyarrow_filter, sql2polars_filter

# Common utilities
from fsspeckit.common.misc import run_parallel
from fsspeckit.common.types import dict_to_dataframe

# Backwards compatibility (legacy)
from fsspeckit.utils import DuckDBParquetHandler  # Still works

Next Steps¶

Congratulations! You've completed the basic fsspeckit tutorial. Here are some recommended next steps:

Explore More Features¶

How-to Guides: Dive into specific tasks with our How-to Guides
API Reference: Browse the API Reference for detailed documentation
Architecture & Concepts: Understand the design principles in Architecture & Concepts

Common Use Cases¶

Cloud Data Processing: Use storage_options_from_env() for production deployments
Dataset Operations: Use DuckDBParquetHandler for large-scale parquet operations
SQL Filtering: Use sql2pyarrow_filter() and sql2polars_filter() for cross-framework compatibility
Safe Operations: Use DirFileSystem for security-critical applications
Performance: Use run_parallel() for concurrent file processing

Production Tips¶

Use Domain Packages: Import from fsspeckit.datasets, fsspeckit.storage_options, etc. instead of utils
Environment Configuration: Load credentials from environment variables in production
Error Handling: Always wrap remote filesystem operations in try-except blocks
Type Safety: Use structured StorageOptions classes instead of raw dictionaries
Testing: Use LocalStorageOptions and DirFileSystem for isolated test environments

For more detailed information, explore the other sections of the documentation.