Getting Started with Duckalog¶

Learn Duckalog from scratch with no prior experience. This comprehensive tutorial takes you from basic concepts to building a complete analytics catalog.

Learning Objectives¶

After completing this tutorial, you will be able to:

Understand Duckalog fundamentals - Configuration, views, and data sources
Use new connection management - duckalog run command with intelligent session management
Create basic catalogs - Single Parquet file configurations
Add SQL transformations - Business logic and data filtering
Implement multi-source joins - Combine data from multiple sources
Use config imports - Organize configuration for maintainability
Build and query catalogs - End-to-end workflow with new primary workflow

Understanding the `duckalog run` Workflow¶

The duckalog run command is the primary way to build and interact with catalogs. It handles config loading, view creation, and querying in a single step with intelligent connection management.

# Build and query in one command
duckalog run config.yaml --query "SELECT..."    # Single query
duckalog run config.yaml --interactive          # Interactive shell

Key Benefits¶

Single Command: Build and query in one step
Session State: Pragmas, settings, and attachments are automatically restored
Incremental Updates: Only missing views are created for faster builds
Connection Reuse: Intelligent connection pooling and management
Lazy Loading: Database connections established only when needed

All examples in this tutorial use the duckalog run workflow.

Prerequisites¶

Python 3.12+ installed and accessible
Duckalog package installed:
```
pip install duckalog
```
Basic command line familiarity
Text editor for configuration files
No prior Duckalog experience required

Step 1: Single Parquet File Catalog¶

Create your first Duckalog configuration with a single Parquet data source.

1.1 Create Sample Data¶

First, let's create some sample data to work with:

# Create project directory
mkdir duckalog-tutorial
cd duckalog-tutorial

# Create data directory
mkdir data

# Create sample Parquet data
python3 -c "
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Sample users data
users_data = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5],
    'username': ['alice', 'bob', 'charlie', 'diana', 'eve'],
    'email': ['alice@example.com', 'bob@example.com', 'charlie@example.com', 'diana@example.com', 'eve@example.com'],
    'created_at': ['2023-01-01', '2023-01-15', '2023-02-01', '2023-02-15', '2023-03-01'],
    'status': ['active', 'active', 'active', 'inactive', 'active']
})

# Sample orders data
orders_data = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105, 106],
    'user_id': [1, 2, 1, 3, 2, 4],
    'amount': [29.99, 49.99, 19.99, 89.99, 39.99, 59.99],
    'order_date': ['2023-01-02', '2023-01-16', '2023-01-03', '2023-02-02', '2023-02-16', '2023-03-02'],
    'status': ['completed', 'completed', 'completed', 'completed', 'completed', 'completed']
})

# Save as Parquet files
pq.write_table(users_data, 'data/users.parquet')
pq.write_table(orders_data, 'data/orders.parquet')

print('Sample data created successfully!')
"

1.2 Create Basic Configuration¶

Create your first Duckalog configuration file:

# step1_basic.yaml
version: 1

duckdb:
  database: tutorial.duckdb
  # Basic performance settings
  pragmas:
    - "SET memory_limit='1GB'"
    - "SET threads=2"

views:
  - name: users
    source: parquet
    uri: "data/users.parquet"
    description: "User data from sample dataset"

1.3 Build and Test¶

Build your first catalog and verify it works using the run command:

# NEW: Build and connect in one command
duckalog run step1_basic.yaml --interactive

# In the interactive shell, run these commands:
SHOW TABLES;
SELECT * FROM users LIMIT 5;
DESCRIBE users;
exit

# OR: Direct query mode
duckalog run step1_basic.yaml --query "SELECT * FROM users LIMIT 5"

duckalog run step1_basic.yaml
duckdb tutorial.duckdb -c "
SHOW TABLES;
SELECT * FROM users LIMIT 5;
DESCRIBE users;
"

Expected Output: - Database file: tutorial.duckdb - View: users with 5 records - Successful query execution

1.4 Understanding What Happened¶

Let's break down what the new run command did:

Read Configuration: Parsed step1_basic.yaml
Smart Connection: Connected to/created tutorial.duckdb with intelligent connection management
Applied Pragmas: Set memory limit and threads (automatically stored for session restoration)
Created View: Executed CREATE VIEW users AS SELECT * FROM 'data/users.parquet' (incremental update)
Session Management: Stored session state (pragmas, settings) for future connections
Validated: Ensured view was created successfully and connection is ready

Key Benefits: - Single Command: Build and query in one step - Session State: Pragmas and settings are automatically restored on reconnection - Incremental Updates: Only missing views are created for faster subsequent runs - Connection Reuse: Database connections are managed efficiently

Step 2: Environment Variables and .env Files¶

Learn how to use environment variables and .env files for secure, portable configuration.

2.1 Why Use Environment Variables?¶

Environment variables help you: - Keep secrets secure - No hardcoded passwords in configuration files - Support multiple environments - Different values for dev, staging, prod - Improve portability - Same config works everywhere with different variables - Follow security best practices - Separate configuration from code

2.2 Creating Your First .env File¶

Create a .env file in your project directory:

# Create .env file
cat > .env << EOF
# Database configuration
DB_HOST=localhost
DB_PORT=5432
DB_NAME=tutorial
DB_USER=postgres
DB_PASSWORD=your_password

# Application settings
MEMORY_LIMIT=1GB
THREAD_COUNT=2
ENVIRONMENT=development

# File paths
DATA_PATH=./data
CATALOG_NAME=tutorial
EOF

echo "✅ Created .env file with environment variables"

2.3 Using .env Variables in Configuration¶

Create a new configuration that uses the .env variables:

# step2_env_vars.yaml
version: 1

duckdb:
  database: "${env:CATALOG_NAME:tutorial}.duckdb"
  pragmas:
    - "SET memory_limit='${env:MEMORY_LIMIT:512MB}'"
    - "SET threads='${env:THREAD_COUNT:2}'"
    - "SET environment='${env:ENVIRONMENT:development}'"

views:
  - name: users
    source: parquet
    uri: "${env:DATA_PATH:./data}/users.parquet"
    description: "User data from ${env:DATA_PATH:./data}/users.parquet"

Note the syntax: - ${env:VARIABLE_NAME} - Use environment variable - ${env:VARIABLE_NAME:default_value} - Use variable with default value

2.4 Building with Environment Variables¶

# NEW: Build and connect with .env variables (automatically loaded)
duckalog run step2_env_vars.yaml --verbose --query "SELECT * FROM users LIMIT 3"

# Check the verbose output for .env loading:
# Loading .env files {'config_path': 'step2_env_vars.yaml', 'file_count': 1}
# Loaded .env file {'file_path': '/path/to/.env', 'var_count': 8}
# Completed .env file loading {'total_files': 1}

# Interactive mode with environment variables
duckalog run step2_env_vars.yaml --interactive
# In shell: SHOW TABLES; SELECT * FROM users LIMIT 3; exit

duckalog run step2_env_vars.yaml --verbose
duckdb tutorial.duckdb -c "
SHOW TABLES;
SELECT * FROM users LIMIT 3;
"

2.5 Testing Variable Resolution¶

Create a simple test to verify your environment variables:

# test_env.yaml
version: 1

test_section:
  db_host: "${env:DB_HOST:not_set}"
  memory_limit: "${env:MEMORY_LIMIT:not_set}"
  data_path: "${env:DATA_PATH:not_set}"
  missing_var: "${env:DOES_NOT_EXIST:default_value}"

# Test environment variable resolution
duckalog validate test_env.yaml

# Set a missing variable and test again
export DOES_NOT_EXIST="now_it_works"
duckalog validate test_env.yaml

2.6 Environment-Specific Configuration¶

Create different .env files for different environments:

# .env.development (for local development)
cat > .env.development << EOF
ENVIRONMENT=development
MEMORY_LIMIT=512MB
THREAD_COUNT=2
DATA_PATH=./data
EOF

# .env.production (for production deployment)
cat > .env.production << EOF
ENVIRONMENT=production
MEMORY_LIMIT=4GB
THREAD_COUNT=8
DATA_PATH=/data/production
EOF

# Use specific environment
source .env.development
duckalog run step2_env_vars.yaml  # Uses development values

source .env.production
duckalog run step2_env_vars.yaml  # Uses production values

# Test specific environments with queries
duckalog run step2_env_vars.yaml --query "SELECT current_setting('memory_limit')"

2.7 Understanding .env File Discovery¶

Duckalog automatically discovers .env files in this order:

Configuration file directory (highest priority)
Parent directories (up to 10 levels)
Current working directory

# Test .env discovery
mkdir -p subdir/configs
cp .env subdir/configs/

# Both configurations will find and use the .env file:
duckalog run step2_env_vars.yaml                    # Uses ./.env
duckalog run subdir/configs/step2_env_vars.yaml     # Uses subdir/configs/.env

2.8 Security Best Practices¶

# 1. Never commit .env files to version control
echo ".env*" >> .gitignore

# 2. Use secure file permissions
chmod 600 .env

# 3. Use different .env files for different environments
ls -la .env*

# 4. Validate your .env files don't contain secrets
grep -iE "(password|secret|key)" .env || echo "✅ No obvious secrets found"

2.9 Understanding Environment Variable Concepts¶

Key concepts you learned:

Automatic .env Loading: Duckalog discovers and loads .env files automatically
Variable Interpolation: ${env:VAR_NAME} syntax for environment variables
Default Values: ${env:VAR:default} provides fallback values
Hierarchical Discovery: Searches parent directories for .env files
Security: Keep .env files out of version control
Environment Isolation: Different .env files for different environments

Step 3: Adding SQL Transformations¶

Add business logic and data transformations using SQL views.

3.1 Create Transformed Views¶

Extend your configuration with SQL transformations:

# step3_transformations.yaml
version: 1

duckdb:
  database: "${env:CATALOG_NAME:tutorial}.duckdb"
  pragmas:
    - "SET memory_limit='${env:MEMORY_LIMIT:1GB}'"
    - "SET threads='${env:THREAD_COUNT:2}'"

views:
  - name: users
    source: parquet
    uri: "${env:DATA_PATH:./data}/users.parquet"
    description: "Raw user data"

  - name: active_users
    sql: |
      SELECT 
          user_id,
          username,
          email,
          created_at
      FROM users 
      WHERE status = 'active'
    description: "Active users only"

  - name: user_registrations_by_month
    sql: |
      SELECT 
          DATE_TRUNC('month', created_at) as registration_month,
          COUNT(*) as new_users
      FROM users 
      GROUP BY DATE_TRUNC('month', created_at)
      ORDER BY registration_month
    description: "Monthly user registration counts"

3.2 Build and Test Transformations¶

# NEW: Build and test with transformations
duckalog run step3_transformations.yaml --query "
-- Test active_users view
SELECT * FROM active_users;
"

# Interactive mode to explore all transformations
duckalog run step3_transformations.yaml --interactive
# In shell:
# SELECT * FROM active_users;
# SELECT * FROM user_registrations_by_month;
# SELECT COUNT(*) as total_users, SUM(CASE WHEN status = 'active' THEN 1 ELSE 0 END) as active_users FROM users;

duckalog run step3_transformations.yaml
duckdb tutorial.duckdb -c "
SELECT * FROM active_users;
SELECT * FROM user_registrations_by_month;
"

3.3 Understanding SQL Views¶

Key concepts you learned:

SQL Views: Virtual tables based on SELECT queries
Data Filtering: Using WHERE clauses to subset data
Aggregations: Using GROUP BY for summary statistics
Date Functions: DATE_TRUNC for time-based analysis
View Dependencies: Views can reference other views

Step 4: Multi-Source Joins¶

Combine data from multiple sources using joins and attachments.

4.1 Create Additional Data Source¶

Add an orders data source to join with users:

# step3_joins.yaml
version: 1

duckdb:
  database: tutorial.duckdb
  pragmas:
    - "SET memory_limit='1GB'"
    - "SET threads=2"

views:
  - name: users
    source: parquet
    uri: "data/users.parquet"

  - name: orders
    source: parquet
    uri: "data/orders.parquet"

  - name: user_orders
    sql: |
      SELECT 
          u.user_id,
          u.username,
          u.email,
          o.order_id,
          o.amount,
          o.order_date,
          o.status
      FROM users u
      LEFT JOIN orders o ON u.user_id = o.user_id
    description: "Users with their order information"

  - name: user_summary
    sql: |
      SELECT 
          u.user_id,
          u.username,
          u.email,
          COUNT(o.order_id) as order_count,
          COALESCE(SUM(o.amount), 0) as total_spent,
          MAX(o.order_date) as last_order_date
      FROM users u
      LEFT JOIN orders o ON u.user_id = o.user_id
      GROUP BY u.user_id, u.username, u.email
    description: "User order summary statistics"

4.2 Build and Test Joins¶

# Build with joins
duckalog run step3_joins.yaml

# Test the joined views
duckdb tutorial.duckdb -c "
-- Test user_orders join
SELECT * FROM user_orders LIMIT 5;

-- Test user summary
SELECT * FROM user_summary LIMIT 5;

-- Verify join logic
SELECT 
    COUNT(*) as total_users,
    COUNT(CASE WHEN o.order_id IS NOT NULL THEN 1 END) as users_with_orders
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id;
"

4.3 Understanding Join Concepts¶

Key concepts you learned:

LEFT JOINs: Include all records from left table, matching from right
Aggregations with JOINs: Group after joining for accurate summaries
NULL handling: Using COALESCE for missing values
Performance: Views can reference other views efficiently

Step 5: Using Config Imports for Modularity¶

Organize your configuration using imports for better maintainability.

5.1 Create Modular Configuration¶

Split your configuration into logical modules:

# config/base.yaml
version: 1

duckdb:
  database: tutorial.duckdb
  pragmas:
    - "SET memory_limit='1GB'"
    - "SET threads=2"

# config/views/users.yaml
version: 1

views:
  - name: users
    source: parquet
    uri: "../data/users.parquet"
    description: "Raw user data"

# config/views/orders.yaml
version: 1

views:
  - name: orders
    source: parquet
    uri: "../data/orders.parquet"
    description: "Order data"

# config/transforms/analytics.yaml
version: 1

views:
  - name: active_users
    sql: |
      SELECT 
          user_id,
          username,
          email,
          created_at
      FROM users 
      WHERE status = 'active'

  - name: user_orders
    sql: |
      SELECT 
          u.user_id,
          u.username,
          u.email,
          o.order_id,
          o.amount,
          o.order_date,
          o.status
      FROM users u
      LEFT JOIN orders o ON u.user_id = o.user_id

  - name: user_summary
    sql: |
      SELECT 
          u.user_id,
          u.username,
          u.email,
          COUNT(o.order_id) as order_count,
          COALESCE(SUM(o.amount), 0) as total_spent,
          MAX(o.order_date) as last_order_date
      FROM users u
      LEFT JOIN orders o ON u.user_id = o.user_id
      GROUP BY u.user_id, u.username, u.email

5.2 Create Main Configuration¶

Create a main configuration that imports the modules:

# step4_modular.yaml
version: 1
imports:
  - ./base.yaml
  - ./views/users.yaml
  - ./views/orders.yaml
  - ./transforms/analytics.yaml

5.3 Build and Test Modular Configuration¶

# NEW: Build and test with imports
duckalog run step4_modular.yaml --query "
-- List all views
SHOW TABLES;
"

# Interactive mode to explore modular views
duckalog run step4_modular.yaml --interactive
# In shell:
# SHOW TABLES;
# SELECT * FROM users LIMIT 3;
# SELECT * FROM orders LIMIT 3;
# SELECT * FROM active_users LIMIT 3;
# SELECT * FROM user_orders LIMIT 3;
# SELECT * FROM user_summary LIMIT 5;

# Quick test of imported views
duckalog run step4_modular.yaml --query "SELECT COUNT(*) FROM users"
duckalog run step4_modular.yaml --query "SELECT COUNT(*) FROM orders"

duckalog run step4_modular.yaml
duckdb tutorial.duckdb -c "SHOW TABLES; SELECT * FROM users LIMIT 3;"

5.4 Understanding Import Benefits¶

Advantages of using imports:

Organization: Group related configurations together
Reusability: Share base settings across projects
Team Collaboration: Different teams can own different modules
Maintainability: Easier to update specific functionality
Testing: Test individual modules in isolation

Check Your Understanding¶

Test your knowledge with these exercises:

Exercise 1: Basic Configuration¶

Create a configuration that: - Uses a Parquet file named products.parquet - Creates a view called active_products - Filters for products with status = 'active'

Click to see solution

# exercise1_solution.yaml
version: 1

duckdb:
  database: tutorial.duckdb

views:
  - name: active_products
    source: parquet
    uri: "data/products.parquet"
    sql: |
      SELECT * FROM products 
      WHERE status = 'active'

Exercise 2: SQL Transformation¶

Create a view that: - Uses the active_products view - Calculates total inventory value - Groups by product category

Click to see solution

# exercise2_solution.yaml
version: 1

duckdb:
  database: tutorial.duckdb

views:
  - name: active_products
    source: parquet
    uri: "data/products.parquet"
    sql: |
      SELECT * FROM products 
      WHERE status = 'active'

  - name: inventory_by_category
    sql: |
      SELECT 
          category,
          COUNT(*) as product_count,
          SUM(price * quantity) as total_value
      FROM active_products
      GROUP BY category

Exercise 3: Multi-Source Configuration¶

Create a configuration with imports that: - Has separate base, views, and analytics modules - Joins customer data with order data - Creates a customer lifetime value view

Click to see solution

# base.yaml
version: 1
duckdb:
  database: tutorial.duckdb
  pragmas:
    - "SET memory_limit='1GB'"
    - "SET threads=2"

# views/customers.yaml
version: 1
views:
  - name: customers
    source: parquet
    uri: "../data/customers.parquet"

# views/orders.yaml
version: 1
views:
  - name: orders
    source: parquet
    uri: "../data/orders.parquet"

# analytics.yaml
version: 1
views:
  - name: customer_lifetime_value
    sql: |
      SELECT 
          c.customer_id,
          c.name,
          COUNT(o.order_id) as total_orders,
          SUM(o.amount) as lifetime_value
      FROM customers c
      LEFT JOIN orders o ON c.customer_id = o.customer_id
      GROUP BY c.customer_id, c.name

# main.yaml
version: 1
imports:
  - ./base.yaml
  - ./views/customers.yaml
  - ./views/orders.yaml
  - ./analytics.yaml

Troubleshooting¶

Common Issues and Solutions¶

Build Fails¶

Issue: ConfigError: Field required

Solution: Ensure you have all required fields:

version: 1  # Required
duckdb:         # Required
  database: tutorial.duckdb
views: []         # Required (can be empty)

Data File Not Found¶

Issue: PathResolutionError: Failed to resolve import path

Solution: Check file paths are correct:

# Verify data files exist
ls -la data/

# Use absolute paths if needed
views:
  - name: users
    source: parquet
    uri: "/full/path/to/data/users.parquet"

SQL Syntax Errors¶

Issue: EngineError: Failed to create view

Solution: Test SQL in interactive mode first:

# Test SQL interactively
duckalog run step3_transformations.yaml --interactive
# Then test your SQL in the shell

duckdb tutorial.duckdb -c "YOUR SQL HERE"

Connection Issues¶

Issue: Connection failures or session not restored

Solution: Use verbose logging to diagnose:

# Check connection details with verbose output
duckalog run config.yaml --verbose --query "SELECT 1"

# Force rebuild if views are missing
duckalog run config.yaml --force-rebuild

Next Steps¶

After completing this tutorial:

Explore New Connection Management:
Session state restoration and persistent secrets
Incremental view updates for faster builds
Connection pooling and intelligent management
duckalog run command features
Explore Advanced Features:
Database attachments for multi-database scenarios
Semantic models for business-friendly metadata
Environment variables for secure configuration
Work Through Examples:
Multi-Source Analytics Example
Environment Variables Example
Config Imports Example
Learn Best Practices:
Performance Tuning Guide
Environment Management Guide
Troubleshooting Guide
Secrets Persistence Guide
Try the Dashboard:
Dashboard Tutorial
Interactive data exploration
Real-time query execution
Migration Tips:
Use duckalog run instead of build + query workflow
Leverage context managers for Python API (with connect_to_catalog(...))
Consider using persistent secrets for long-running applications

Congratulations!¶

You have successfully: - ✅ Created your first Duckalog configuration - ✅ Implemented SQL transformations - ✅ Built multi-source data joins - ✅ Organized configuration with imports - ✅ Understood core Duckalog concepts

You're now ready to tackle more advanced Duckalog scenarios and build sophisticated data catalogs for your analytics needs!

Need Help?¶

If you encountered issues: - Review the Troubleshooting Guide - Check the API Reference - Explore Examples for similar patterns - Ask questions in GitHub discussions

Ready to continue your Duckalog journey? Try the Dashboard Tutorial next!

Getting Started with Duckalog¶

Learning Objectives¶

Understanding the duckalog run Workflow¶

Key Benefits¶

Prerequisites¶

Step 1: Single Parquet File Catalog¶

1.1 Create Sample Data¶

1.2 Create Basic Configuration¶

1.3 Build and Test¶

1.4 Understanding What Happened¶

Step 2: Environment Variables and .env Files¶

2.1 Why Use Environment Variables?¶

2.2 Creating Your First .env File¶

2.3 Using .env Variables in Configuration¶

2.4 Building with Environment Variables¶

2.5 Testing Variable Resolution¶

2.6 Environment-Specific Configuration¶

2.7 Understanding .env File Discovery¶

2.8 Security Best Practices¶

2.9 Understanding Environment Variable Concepts¶

Step 3: Adding SQL Transformations¶

3.1 Create Transformed Views¶

3.2 Build and Test Transformations¶

3.3 Understanding SQL Views¶

Step 4: Multi-Source Joins¶

4.1 Create Additional Data Source¶

4.2 Build and Test Joins¶

4.3 Understanding Join Concepts¶

Step 5: Using Config Imports for Modularity¶

5.1 Create Modular Configuration¶

5.2 Create Main Configuration¶

5.3 Build and Test Modular Configuration¶

5.4 Understanding Import Benefits¶

Check Your Understanding¶

Exercise 1: Basic Configuration¶

Exercise 2: SQL Transformation¶

Exercise 3: Multi-Source Configuration¶

Troubleshooting¶

Common Issues and Solutions¶

Build Fails¶

Data File Not Found¶

SQL Syntax Errors¶

Connection Issues¶

Next Steps¶

Congratulations!¶

Need Help?¶

Understanding the `duckalog run` Workflow¶