User Guide¶
This guide explains how to structure Duckalog configuration files, common configuration patterns, secret management, and how to troubleshoot issues using the new config-driven connection management system.
New Primary Workflow: duckalog run¶
Duckalog provides a unified run command for building catalogs and running queries with intelligent connection management and session state restoration.
Key Benefits¶
- Single Command: One command to connect, build, and query
- Smart Connection Management: Automatic connection pooling and reuse
- Session State Restoration: Pragmas, settings, and attachments are automatically restored
- Incremental Updates: Only missing views are created for faster builds
- Lazy Connections: Database connections are established only when needed
Basic Usage¶
# New primary workflow - single command
duckalog run config.yaml
# Interactive mode - opens SQL shell with catalog ready
duckalog run config.yaml --interactive
# Direct query execution
duckalog run config.yaml --query "SELECT COUNT(*) FROM users"
# Force rebuild of all views
duckalog run config.yaml --force-rebuild
Environment Variables and .env File Support¶
Duckalog provides comprehensive support for environment variables with automatic .env file loading, making configuration management simple and secure.
Automatic .env File Loading¶
Duckalog automatically discovers and loads .env files in your project hierarchy:
- Hierarchical Discovery: Searches for
.envfiles starting from your configuration file directory and moving upward - Zero Configuration: Works automatically without additional setup
- Security First: Sensitive data is never logged, and files are handled securely
- Graceful Handling: Missing or malformed
.envfiles don't break configuration loading
How .env Files Work¶
Create a .env file in your project directory:
# .env file
DATABASE_URL=postgres://user:pass@localhost:5432/mydb
API_KEY=secret123
FEATURE_ENABLED=true
Your configuration automatically has access to these variables:
# catalog.yaml - .env variables are automatically available
version: 1
duckdb:
database: "${env:DATABASE_URL}.duckdb"
pragmas:
- "SET api_key='${env:API_KEY}'"
- "SET feature_enabled='${env:FEATURE_ENABLED:false}'"
.env File Discovery¶
Duckalog searches for .env files in this order:
- Configuration file directory (highest priority)
- Parent directories (up to 10 levels)
- Current working directory (for remote configurations)
Example hierarchy:
project/
├── .env # Loaded (closest to config)
├── subdir/
│ ├── config.yaml # Uses parent .env
│ └── data/
└── another.env # Loaded if no closer .env found
Environment Variable Precedence¶
Variables are available in this priority order:
1. System environment variables (highest priority)
2. .env file variables (loaded automatically)
3. Default values in ${env:VAR:default} syntax
# System environment (highest priority)
export DATABASE_URL="system_db"
# .env file (medium priority)
DATABASE_URL="file_db"
# Configuration (gets "system_db" due to precedence)
# database: "${env:DATABASE_URL}" → "system_db"
Secret Management¶
Duckalog provides comprehensive secret management through the canonical SecretConfig model, ensuring secure credential handling without storing sensitive data in configuration files.
Environment Variable Integration¶
All sensitive configuration uses environment variables for security:
# Basic environment variable usage
duckdb:
database: catalog.duckdb
pragmas:
- "SET s3_access_key_id='${env:AWS_ACCESS_KEY_ID}'"
- "SET s3_secret_access_key='${env:AWS_SECRET_ACCESS_KEY}'"
- "SET azure_connection_string='${env:AZURE_CONNECTION_STRING}'"
attachments:
postgres:
- host: "${env:PG_HOST}"
user: "${env:PG_USER}"
password: "${env:PG_PASSWORD}"
Canonical Secret Configuration¶
Define secrets within the duckdb section and reference them implicitly across views:
duckdb:
database: catalog.duckdb
install_extensions:
- httpfs
secrets:
- name: s3_prod
type: s3
key_id: "${env:AWS_ACCESS_KEY_ID}"
secret: "${env:AWS_SECRET_ACCESS_KEY}"
region: "us-east-1"
endpoint: "https://s3.amazonaws.com" # Protocol automatically stripped for DuckDB compatibility
persistent: true # NEW: Persist secrets for session restoration
- name: azure_storage
type: azure
connection_string: "${env:AZURE_STORAGE_CONNECTION_STRING}"
account_name: "${env:AZURE_STORAGE_ACCOUNT}"
persistent: false # Default: temporary secrets
- name: gcs_service
type: gcs
service_account_key: "${env:GCS_SERVICE_ACCOUNT_JSON}"
persistent: true # Persist across session restarts
- name: http_api
type: http
bearer_token: "${env:API_BEARER_TOKEN}"
header: "Authorization"
persistent: false # Temporary token for security
views:
- name: production_data
source: parquet
uri: "s3://prod-bucket/data/*.parquet"
description: "Production data using S3 secret"
- name: azure_logs
source: parquet
uri: "abfs://logcontainer@storageaccount.dfs.core.windows.net/logs/*.parquet"
description: "Azure logs using storage secret"
- name: reference_tables
source: iceberg
catalog: main_catalog
table: analytics.reference_data
description: "Reference data using GCS secret"
Secret Types and DuckDB Integration¶
Each secret type maps to appropriate DuckDB CREATE SECRET statements:
S3 Secrets:
CREATE SECRET s3_prod (
TYPE S3,
KEY_ID 'AKIA...',
SECRET 'secret...',
REGION 'us-east-1',
ENDPOINT 'https://s3.amazonaws.com'
);
Azure Secrets:
PostgreSQL Secrets:
CREATE SECRET postgres_creds (
TYPE POSTGRES,
CONNECTION_STRING 'postgresql://user:pass@host:5432/db'
);
New: Secret Persistence¶
The new connection management system supports persistent secrets that are stored securely and restored across sessions:
duckdb:
secrets:
- name: production_s3
type: s3
key_id: "${env:AWS_ACCESS_KEY_ID}"
secret: "${env:AWS_SECRET_ACCESS_KEY}"
region: "us-west-2"
persistent: true # Secret persisted for session restoration
Benefits of Persistent Secrets: - Session Restoration: Secrets automatically available when reconnecting - Improved Performance: Avoid recreating secrets on each connection - Security: Encrypted storage with access controls - Convenience: No need to re-enter credentials for interactive sessions
Security Best Practices¶
- Never commit secrets: Use
.envfiles or environment variables - Rotate credentials: Regularly update access keys and tokens
- Use principle of least privilege: Grant minimal required permissions
- Audit access: Monitor which secrets are used where
- NEW - Secret Persistence: Use
persistent: truefor long-lived secrets,falsefor temporary ones - NEW - Session Management: Be aware that persistent secrets survive session restarts
Related: - Architecture - Secret Management - How-to - Secrets Persistence
Remote Configuration¶
Duckalog supports loading configurations from remote sources using a unified filesystem interface:
Loading Remote Configurations¶
# Load a remote configuration
duckalog run s3://my-bucket/catalog.yaml \
--fs-key "${AWS_ACCESS_KEY_ID}" \
--fs-secret "${AWS_SECRET_ACCESS_KEY}" \
--aws-profile my-profile
# GCS configuration
duckalog run gs://my-bucket/catalog.yaml \
--gcs-credentials-file /path/to/service-account.json
# Azure Blob Storage
duckalog run abfs://account@container/catalog.yaml \
--azure-connection-string "${AZURE_CONNECTION_STRING}"
# GitHub repository
duckalog run github://user/repo/catalog.yaml \
--fs-token "${GITHUB_TOKEN}"
# SFTP server
duckalog run sftp://server/path/catalog.yaml \
--sftp-host server.com \
--sftp-key-file ~/.ssh/id_rsa
# HTTP/HTTPS
duckalog run https://example.com/catalog.yaml \
--fs-token "${API_TOKEN}"
# Interactive remote sessions
duckalog run s3://my-bucket/catalog.yaml --interactive
# Query remote configurations directly
duckalog run s3://my-bucket/catalog.yaml --query "SELECT * FROM analytics_table"
Shared Filesystem Architecture¶
All remote access options are centralized through the CLI's shared filesystem handler:
- Protocol Detection: Automatic protocol inference from options
- Credential Management: Secure handling of authentication details
- Context Management: Filesystem objects shared across commands
- Error Handling: Descriptive error messages for connection issues
Configuration structure¶
At a high level, a config looks like this:
version: 1
duckdb:
database: catalog.duckdb
pragmas:
- "SET memory_limit='1GB'"
# DuckDB database with secrets
duckdb:
database: unified_catalog.duckdb
install_extensions:
- httpfs
secrets:
- name: s3_access
type: s3
key_id: "${env:AWS_ACCESS_KEY_ID}"
secret: "${env:AWS_SECRET_ACCESS_KEY}"
region: "us-east-1"
- name: postgres_creds
type: postgres
connection_string: "${env:POSTGRES_CONNECTION_STRING}"
attachments:
duckdb:
- alias: refdata
path: ./refdata.duckdb
read_only: true
sqlite:
- alias: legacy
path: ./legacy.db
postgres:
- alias: dw
host: "${env:PG_HOST}"
port: 5432
database: dw
user: "${env:PG_USER}"
password: "${env:PG_PASSWORD}"
iceberg_catalogs:
- name: main_ic
catalog_type: rest
uri: "https://iceberg-catalog.internal"
warehouse: "s3://my-warehouse/"
options:
token: "${env:ICEBERG_TOKEN}"
views:
- name: users
source: parquet
uri: "s3://my-bucket/data/users/*.parquet"
description: "User data from S3 using configured secret"
- name: events_delta
source: delta
uri: "s3://my-bucket/delta/events"
- name: ic_orders
source: iceberg
catalog: main_ic
table: analytics.orders
- name: ref_countries
source: duckdb
database: refdata
table: reference.countries
- name: vip_users
sql: |
SELECT *
FROM users
WHERE is_vip = TRUE
AND segment = 'premium'
Path Resolution¶
Duckalog automatically resolves relative file paths to absolute paths relative to the configuration file location. This ensures consistent behavior across different working directories while maintaining security.
How Path Resolution Works¶
- Automatic Detection: Duckalog detects whether paths are relative or absolute
- Relative Path Resolution: Resolves paths like
data/file.parquetagainst the config file's directory - Security Validation: Blocks dangerous directory traversal while allowing reasonable parent directory access
- Cross-Platform Support: Works correctly on Windows, macOS, and Linux
Configuration¶
Path resolution is enabled by default when loading configurations:
from duckalog import load_config
# Path resolution enabled (default)
config = load_config("catalog.yaml")
# Explicitly enable/disable resolution
config = load_config("catalog.yaml", resolve_paths=True)
config = load_config("catalog.yaml", resolve_paths=False)
Path Resolution Examples¶
Basic Relative Paths¶
Project Structure:
analytics/
├── catalog.yaml
├── data/
│ ├── events.parquet
│ └── users.parquet
└── databases/
└── reference.duckdb
Configuration:
version: 1
duckdb:
database: catalog.duckdb
attachments:
duckdb:
- alias: refdata
path: ./databases/reference.duckdb # Resolved to absolute path
read_only: true
views:
- name: events
source: parquet
uri: data/events.parquet # Resolved relative to catalog.yaml
- name: users
source: parquet
uri: ./data/users.parquet # Explicit relative path
Parent Directory Access¶
Project Structure:
company/
├── shared/
│ └── reference_data/
│ └── customers.parquet
└── analytics/
└── catalog.yaml
└── data/
└── events.parquet
Configuration (company/analytics/catalog.yaml):
version: 1
views:
- name: events
source: parquet
uri: ./data/events.parquet # Resolved to /company/analytics/data/events.parquet
- name: customers
source: parquet
uri: ../shared/reference_data/customers.parquet # Resolved to /company/shared/reference_data/customers.parquet
Mixed Path Types¶
version: 1
views:
# Relative path - resolved automatically
- name: local_data
source: parquet
uri: ./data/local.parquet
# Absolute path - unchanged
- name: absolute_data
source: parquet
uri: /absolute/path/data.parquet
# Remote URI - unchanged
- name: remote_data
source: parquet
uri: s3://my-bucket/data/remote.parquet
Security Features¶
Path resolution includes comprehensive security validation:
Allowed Patterns¶
# ✅ ALLOWED - Same directory
uri: data/file.parquet
# ✅ ALLOWED - Parent directory access (reasonable levels)
uri: ../shared/data.parquet
uri: ../../project/common.parquet
# ✅ ALLOWED - Subdirectories
uri: ./subdir/nested/file.parquet
Blocked Patterns¶
# ❌ BLOCKED - Excessive parent directory traversal
uri: ../../../../etc/passwd
# ❌ BLOCKED - System directory access
uri: ../etc/config.parquet
uri: ../../usr/local/data.parquet
Troubleshooting Path Resolution¶
Common Issues¶
Path resolution failed:
Solutions:
- Reduce the number of parent directory traversals (../)
- Avoid system directories (/etc/, /usr/, etc.)
- Use relative paths within reasonable bounds
File not found after resolution:
Solutions: - Verify the resolved path points to an existing file - Check file permissions - Ensure the file is not a directory
Debugging¶
from duckalog import generate_sql
# Generate SQL to see resolved paths
sql = generate_sql("catalog.yaml")
print(sql) # Shows absolute paths after resolution
For complete details, see the Path Resolution Guide.
Environment variables¶
Any string may contain ${env:VAR_NAME} placeholders. Duckalog resolves these
using the process environment before validation. If a variable is missing, a
ConfigError is raised.
Example:
duckdb:
pragmas:
- "SET s3_access_key_id='${env:AWS_ACCESS_KEY_ID}'"
- "SET s3_secret_access_key='${env:AWS_SECRET_ACCESS_KEY}'"
Attachments¶
Attachments let you expose tables from other databases inside DuckDB.
- DuckDB attachments: attach additional
.duckdbfiles. - SQLite attachments: attach local SQLite databases.
- Postgres attachments: connect to external Postgres instances.
Views that use attached databases set source to duckdb, sqlite, or
postgres and provide database (attachment alias) and table.
Iceberg catalogs¶
Iceberg catalogs are configured under iceberg_catalogs. Iceberg views can
either:
- Use a
uridirectly, or - Refer to a
catalog+tablecombination.
Duckalog validates that any catalog used by a view is defined in
iceberg_catalogs.
Common Configuration Patterns¶
Parquet-only configuration¶
For simple cases where you only need to create views over Parquet files:
version: 1
duckdb:
database: catalog.duckdb
pragmas:
- "SET memory_limit='1GB'"
- "SET s3_region='us-west-2'"
views:
- name: sales_data
source: parquet
uri: "s3://data-bucket/sales/*.parquet"
- name: customer_data
source: parquet
uri: "s3://data-bucket/customers/*.parquet"
- name: sales_by_customer
sql: |
SELECT
c.customer_id,
c.name,
c.region,
SUM(s.amount) as total_sales,
COUNT(s.order_id) as order_count
FROM customer_data c
JOIN sales_data s ON c.customer_id = s.customer_id
GROUP BY c.customer_id, c.name, c.region
Attachments-only configuration¶
For joining data from existing databases:
version: 1
duckdb:
database: unified_catalog.duckdb
attachments:
duckdb:
- alias: analytics
path: ./analytics_db.duckdb
read_only: true
- alias: warehouse
path: ./warehouse_db.duckdb
read_only: true
sqlite:
- alias: legacy
path: ./legacy_system.db
views:
- name: user_profiles
source: duckdb
database: analytics
table: users
- name: user_orders
source: duckdb
database: warehouse
table: orders
- name: legacy_customers
source: sqlite
database: legacy
table: customers
- name: complete_customer_view
sql: |
SELECT
p.user_id,
p.name,
p.email,
o.total_orders,
o.total_spent,
l.rating as legacy_rating
FROM user_profiles p
JOIN user_orders o ON p.user_id = o.user_id
LEFT JOIN legacy_customers l ON p.user_id = l.id
Iceberg-only configuration¶
For working exclusively with Iceberg tables:
version: 1
duckdb:
database: iceberg_catalog.duckdb
iceberg_catalogs:
- name: prod_catalog
catalog_type: rest
uri: "https://iceberg-catalog.company.com"
warehouse: "s3://data-warehouse/production/"
options:
token: "${env:ICEBERG_PROD_TOKEN}"
- name: staging_catalog
catalog_type: rest
uri: "https://iceberg-staging.company.com"
warehouse: "s3://data-warehouse/staging/"
options:
token: "${env:ICEBERG_STAGING_TOKEN}"
views:
- name: production_customers
source: iceberg
catalog: prod_catalog
table: analytics.customers
- name: staging_customers
source: iceberg
catalog: staging_catalog
table: analytics.customers
- name: customer_comparison
sql: |
SELECT
COALESCE(p.id, s.id) as customer_id,
p.name as prod_name,
s.name as staging_name,
p.updated_at as prod_updated,
s.updated_at as staging_updated
FROM production_customers p
FULL OUTER JOIN staging_customers s ON p.id = s.id
Multi-source configuration¶
This example combines multiple data sources in a single catalog:
version: 1
duckdb:
database: unified_analytics.duckdb
pragmas:
- "SET memory_limit='4GB'"
- "SET threads=4"
- "SET s3_region='us-east-1'"
attachments:
duckdb:
- alias: reference
path: ./reference_data.duckdb
read_only: true
iceberg_catalogs:
- name: data_lake
catalog_type: rest
uri: "https://iceberg.data-lake.internal"
warehouse: "s3://enterprise-data-lake/"
options:
token: "${env:ICEBERG_TOKEN}"
views:
# Reference data from attached database
- name: user_segments
source: duckdb
database: reference
table: user_segments
# Raw events from S3
- name: raw_events
source: parquet
uri: "s3://events-bucket/raw/*.parquet"
# Processed events from Iceberg
- name: processed_events
source: iceberg
catalog: data_lake
table: analytics.processed_events
# Unified analytics view
- name: analytics_events
sql: |
SELECT
e.event_id,
e.timestamp,
e.user_id,
e.event_type,
e.properties,
us.segment_name as user_segment,
us.tier as user_tier
FROM raw_events e
LEFT JOIN user_segments us ON e.user_id = us.user_id
WHERE e.timestamp >= CURRENT_DATE - INTERVAL 30 DAYS
Error Handling¶
Duckalog provides a comprehensive exception hierarchy to help you handle errors gracefully in your applications.
Exception Hierarchy¶
All Duckalog exceptions inherit from DuckalogError, making it easy to catch all library errors:
from duckalog import DuckalogError, load_config, build_catalog
try:
config = load_config("catalog.yaml")
build_catalog(config)
except DuckalogError as e:
# Handle any Duckalog-specific error
print(f"Duckalog error: {e}")
Specific Exception Types¶
For more targeted error handling, catch specific exception types:
from duckalog import (
ConfigError, # Configuration issues
EngineError, # Database/build failures
PathResolutionError, # Path resolution problems
RemoteConfigError, # Remote config loading failures
SQLFileError, # SQL file processing issues
)
try:
config = load_config("catalog.yaml")
except ConfigError as e:
print(f"Configuration error: {e}")
except PathResolutionError as e:
print(f"Path resolution failed: {e}")
print(f"Original path: {e.original_path}")
print(f"Resolved path: {e.resolved_path}")
except DuckalogError as e:
print(f"Other Duckalog error: {e}")
Best Practices¶
- Catch specific exceptions first: Handle the most specific errors first, then catch more general ones
- Use exception chaining: Duckalog preserves original exceptions to help with debugging
- Log with context: Include relevant information when logging errors
- Validate early: Use
validate_config()to catch configuration errors before building
import logging
from duckalog import ConfigError, EngineError, validate_config, build_catalog
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def create_catalog(config_path):
try:
# Validate configuration first
validate_config(config_path)
logger.info(f"Configuration {config_path} is valid")
# Build catalog
build_catalog(config_path)
logger.info("Catalog built successfully")
except ConfigError as e:
logger.error(f"Configuration error in {config_path}: {e}")
# Handle configuration issues (missing files, invalid syntax, etc.)
except EngineError as e:
logger.error(f"Engine error building catalog: {e}")
# Handle database issues (connection failures, SQL errors, etc.)
except Exception as e:
logger.error(f"Unexpected error: {e}")
raise # Re-raise unexpected errors
Troubleshooting¶
Common errors¶
1. Invalid configuration syntax¶
Error message: ConfigError: Invalid configuration file
Solutions:
- Validate YAML syntax first with yamllint or an online YAML validator
- Ensure proper indentation (YAML is sensitive to spaces)
- Check for unescaped special characters in strings
2. Missing environment variables¶
Error message: ConfigError: Environment variable 'AWS_ACCESS_KEY_ID' not found
Solutions:
- Check that all env:VAR_NAME variables are set
- Use duckalog validate config.yaml to check environment variables without building
- Consider using a .env file with export commands or a tool like direnv
3. Connection failures¶
Error message: DuckDB Error: Invalid Input Error: Failed to open file
Solutions: - Check file paths in attachments section - Ensure credentials for cloud storage are correct - Verify network connectivity and firewall rules - For local files, check file permissions
4. SQL errors in views¶
Error message: Parser Error: syntax error at or near "JOIN"
Solutions:
- Validate SQL syntax with duckalog generate-sql config.yaml before building
- Check that all referenced views and tables exist
- Verify column names match across join conditions
- Use proper DuckDB SQL syntax (similar to PostgreSQL)
5. Iceberg catalog issues¶
Error message: Invalid Input Error: Can't load extension iceberg
Solutions:
- Ensure DuckDB version supports Iceberg extensions
- Install required extensions: duckdb.install_extension("iceberg")
- Check catalog configuration and credentials
- Verify Iceberg catalog server is accessible
Debugging tips¶
- Validate before building: Always run
duckalog validate config.yamlfirst - Generate SQL first: Use
duckalog generate-sql config.yamlto inspect generated SQL - Start simple: Begin with a single view and gradually add complexity
- Use environment check: Create a simple script to verify required environment variables
- Check logs: Run with verbose logging to see detailed error information
Best Practices¶
- Separate configurations: Use different configs for different environments
- Version control: Keep your configs in Git alongside your code
- Modular views: Break complex SQL into multiple simpler views when possible
- Naming conventions: Use consistent naming for views and attachments
- Security: Never commit secrets to version control - use environment variables
Next steps¶
- Use
duckalog generate-sqlto inspect the SQL that will be executed. - Use the API reference for details on each public function and model.
- Check out the examples in the documentation for more advanced use cases.