Path Resolution Examples¶

This section provides practical examples of using Duckalog's path resolution feature in various scenarios.

Example 1: Basic Project Structure¶

Project Layout¶

data-analytics/
├── config/
│   └── catalog.yaml
├── data/
│   ├── raw/
│   │   ├── events.parquet
│   │   └── users.parquet
│   ├── processed/
│   │   ├── daily_metrics.parquet
│   │   └── user_segments.parquet
│   └── external/
│       └── reference_data.parquet
└── databases/
    └── reference.duckdb

Configuration (`config/catalog.yaml`)¶

version: 1

duckdb:
  database: catalog.duckdb
  pragmas:
    - "SET memory_limit='2GB'"

attachments:
  duckdb:
    - alias: refdb
      path: ../databases/reference.duckdb
      read_only: true

views:
  # Raw data files
  - name: raw_events
    source: parquet
    uri: ../data/raw/events.parquet
    description: "Raw event data from production"

  - name: raw_users
    source: parquet
    uri: ../data/raw/users.parquet
    description: "Raw user data"

  # Processed data files
  - name: daily_metrics
    source: parquet
    uri: ../data/processed/daily_metrics.parquet
    description: "Daily aggregated metrics"

  - name: user_segments
    source: parquet
    uri: ../data/processed/user_segments.parquet
    description: "User segmentation data"

  # External reference data
  - name: external_reference
    source: parquet
    uri: ../data/external/reference_data.parquet
    description: "External reference data"

  # Data from attached database
  - name: reference_customers
    source: duckdb
    database: refdb
    table: customers
    description: "Customer reference data"

  # Combined analytics view
  - name: analytics_dashboard
    sql: |
      SELECT 
        u.user_id,
        u.name,
        u.email,
        us.segment_name,
        dm.event_date,
        dm.total_events,
        dm.unique_users
      FROM raw_users u
      LEFT JOIN user_segments us ON u.user_id = us.user_id
      LEFT JOIN daily_metrics dm ON u.segment_id = dm.segment_id
      WHERE dm.event_date >= CURRENT_DATE - INTERVAL 30 DAYS

Loading with Path Resolution¶

from duckalog import load_config, generate_sql

# Load configuration with automatic path resolution
config = load_config("config/catalog.yaml")

print(f"Loaded config with {len(config.views)} views")

# Verify paths were resolved
for view in config.views:
    if hasattr(view, 'uri') and view.uri:
        print(f"View: {view.name}")
        print(f"  URI: {view.uri}")
        print(f"  Is absolute: {view.uri.startswith('/')}")

Example 2: Multi-Environment Configuration¶

Development vs Production¶

Development Structure:

dev-project/
├── config/
│   ├── catalog.yaml
│   └── catalog-dev.yaml
├── data/
│   ├── sample/
│   │   └── events.parquet
│   └── testing/
│       └── users.parquet
└── databases/
    └── dev_reference.duckdb

Production Structure:

prod-project/
├── config/
│   ├── catalog.yaml
│   └── catalog-prod.yaml
├── data/
│   ├── live/
│   │   └── events.parquet
│   └── archive/
│       └── users.parquet
└── databases/
    └── prod_reference.duckdb

Development Configuration (`dev-project/config/catalog.yaml`)¶

version: 1

duckdb:
  database: dev_catalog.duckdb
  pragmas:
    - "SET memory_limit='1GB'"

attachments:
  duckdb:
    - alias: refdb
      path: ../databases/dev_reference.duckdb
      read_only: true

views:
  - name: dev_events
    source: parquet
    uri: ../data/sample/events.parquet
    description: "Development sample events"

  - name: dev_users
    source: parquet
    uri: ../data/testing/users.parquet
    description: "Development test users"

Production Configuration (`prod-project/config/catalog.yaml`)¶

version: 1

duckdb:
  database: prod_catalog.duckdb
  pragmas:
    - "SET memory_limit='8GB'"
    - "SET threads=8"

attachments:
  duckdb:
    - alias: refdb
      path: ../databases/prod_reference.duckdb
      read_only: true

views:
  - name: prod_events
    source: parquet
    uri: ../data/live/events.parquet
    description: "Production live events"

  - name: prod_users
    source: parquet
    uri: ../data/archive/users.parquet
    description: "Production archived users"

Environment-Aware Loading¶

import os
from duckalog import load_config

def load_environment_config(base_path: str, env: str = "dev"):
    """Load configuration for specific environment"""
    config_path = os.path.join(base_path, f"{env}-project/config/catalog.yaml")
    return load_config(config_path)

# Load development configuration
dev_config = load_environment_config("/path/to/projects", "dev")
print(f"Development config: {dev_config.views[0].uri}")

# Load production configuration  
prod_config = load_environment_config("/path/to/projects", "prod")
print(f"Production config: {prod_config.views[0].uri}")

Example 3: Shared Enterprise Resources¶

Enterprise Structure¶

company/
├── shared/
│   ├── reference_data/
│   │   ├── geography.parquet
│   │   ├── currency_rates.parquet
│   │   └── company_calendar.parquet
│   ├── databases/
│   │   ├── enterprise_master.duckdb
│   │   └── analytics_cache.duckdb
│   └── schemas/
│       └── standard_views.sql
├── analytics/
│   ├── config/
│   │   └── catalog.yaml
│   ├── data/
│   │   ├── department_a/
│   │   │   └── metrics.parquet
│   │   └── department_b/
│   │       └── reports.parquet
│   └── sql/
│       └── custom_views.sql
└── finance/
    ├── config/
    │   └── catalog.yaml
    ├── data/
    │   ├── transactions/
    │   │   └── daily.parquet
    │   └── budgets/
    │       └── quarterly.parquet
    └── sql/
        └── financial_views.sql

Analytics Configuration (`company/analytics/config/catalog.yaml`)¶

version: 1

duckdb:
  database: analytics_catalog.duckdb
  pragmas:
    - "SET memory_limit='4GB'"
    - "SET threads=4"

# Attach shared enterprise databases
attachments:
  duckdb:
    - alias: enterprise_master
      path: ../../shared/databases/enterprise_master.duckdb
      read_only: true
    - alias: analytics_cache
      path: ../../shared/databases/analytics_cache.duckdb
      read_only: true

views:
  # Shared enterprise reference data
  - name: geography
    source: parquet
    uri: ../../shared/reference_data/geography.parquet
    description: "Geographic reference data"

  - name: currency_rates
    source: parquet
    uri: ../../shared/reference_data/currency_rates.parquet
    description: "Currency exchange rates"

  - name: company_calendar
    source: parquet
    uri: ../../shared/reference_data/company_calendar.parquet
    description: "Company calendar with holidays"

  # Local department data
  - name: dept_a_metrics
    source: parquet
    uri: ../data/department_a/metrics.parquet
    description: "Department A metrics"

  - name: dept_b_reports
    source: parquet
    uri: ../data/department_b/reports.parquet
    description: "Department B reports"

  # Views using enterprise database
  - name: enterprise_customers
    source: duckdb
    database: enterprise_master
    table: customers
    description: "Enterprise customer master"

  # Combined analytics views
  - name: regional_performance
    sql: |
      SELECT 
        g.region_name,
        g.country_code,
        dm.metric_type,
        dm.metric_value,
        dm.reporting_date
      FROM geography g
      JOIN dept_a_metrics dm ON g.region_id = dm.region_id
      WHERE dm.reporting_date >= CURRENT_DATE - INTERVAL 90 DAYS

  - name: budget_vs_actual
    sql: |
      SELECT 
        cc.fiscal_year,
        cc.fiscal_quarter,
        b.budget_amount,
        SUM(dm.metric_value) as actual_amount,
        (SUM(dm.metric_value) - b.budget_amount) as variance
      FROM company_calendar cc
      JOIN dept_b_reports dm ON cc.date_key = dm.reporting_date
      LEFT JOIN (
        SELECT * FROM enterprise_master.finance_budgets
      ) b ON cc.fiscal_quarter = b.quarter AND cc.fiscal_year = b.year
      GROUP BY cc.fiscal_year, cc.fiscal_quarter, b.budget_amount

Finance Configuration (`company/finance/config/catalog.yaml`)¶

version: 1

duckdb:
  database: finance_catalog.duckdb
  pragmas:
    - "SET memory_limit='6GB'"
    - "SET threads=6"

# Attach shared enterprise databases (same as analytics)
attachments:
  duckdb:
    - alias: enterprise_master
      path: ../../shared/databases/enterprise_master.duckdb
      read_only: true

views:
  # Shared enterprise reference data
  - name: currency_rates
    source: parquet
    uri: ../../shared/reference_data/currency_rates.parquet
    description: "Currency exchange rates"

  # Finance-specific data
  - name: daily_transactions
    source: parquet
    uri: ../data/transactions/daily.parquet
    description: "Daily transaction data"

  - name: quarterly_budgets
    source: parquet
    uri: ../data/budgets/quarterly.parquet
    description: "Quarterly budget data"

  # Financial analytics
  - name: transaction_analysis
    sql: |
      SELECT 
        dt.transaction_date,
        dt.customer_id,
        dt.amount,
        dt.currency,
        cr.usd_rate,
        (dt.amount * cr.usd_rate) as amount_usd,
        em.customer_segment
      FROM daily_transactions dt
      JOIN currency_rates cr ON dt.currency = cr.currency_code 
        AND dt.transaction_date = cr.effective_date
      LEFT JOIN enterprise_master.customers em ON dt.customer_id = em.customer_id
      WHERE dt.transaction_date >= CURRENT_DATE - INTERVAL 30 DAYS

Example 4: Cloud and Local Hybrid Setup¶

Hybrid Project Structure¶

hybrid-analytics/
├── config/
│   └── catalog.yaml
├── local-data/
│   ├── sensitive/
│   │   └── pii_data.parquet
│   └── cache/
│       └── local_cache.duckdb
├── databases/
│   └── reference.duckdb
└── sql/
    └── views/

Hybrid Configuration (`config/catalog.yaml`)¶

version: 1

duckdb:
  database: hybrid_catalog.duckdb
  pragmas:
    - "SET memory_limit='4GB'"
    - "SET s3_region='us-west-2'"
    - "SET s3_access_key_id='${env:AWS_ACCESS_KEY_ID}'"
    - "SET s3_secret_access_key='${env:AWS_SECRET_ACCESS_KEY}'"

attachments:
  duckdb:
    - alias: refdb
      path: ../databases/reference.duckdb
      read_only: true
    - alias: cache
      path: ../local-data/cache/local_cache.duckdb
      read_only: false  # Allow writing to cache

views:
  # Cloud data (remote URIs - not resolved)
  - name: cloud_events
    source: parquet
    uri: s3://data-lake/raw/events/*.parquet
    description: "Cloud-based raw events"

  - name: cloud_users  
    source: parquet
    uri: s3://data-lake/processed/users.parquet
    description: "Cloud-based processed users"

  # Local sensitive data (resolved relative paths)
  - name: sensitive_pii
    source: parquet
    uri: ../local-data/sensitive/pii_data.parquet
    description: "Local PII data (not stored in cloud)"

  # Local reference data
  - name: local_reference
    source: duckdb
    database: refdb
    table: reference_data
    description: "Local reference database"

  # Cache data (read-write attachment)
  - name: cached_results
    source: duckdb
    database: cache
    table: precomputed_metrics
    description: "Locally cached computation results"

  # Hybrid views combining cloud and local data
  - name: enhanced_events
    sql: |
      SELECT 
        ce.event_id,
        ce.timestamp,
        ce.event_type,
        ce.properties,
        sp.customer_name,
        sp.email,
        sp.segment,
        lr.internal_category
      FROM cloud_events ce
      LEFT JOIN sensitive_pii sp ON ce.user_id = sp.user_id
      LEFT JOIN local_reference lr ON ce.event_type = lr.event_category
      WHERE ce.timestamp >= CURRENT_DATE - INTERVAL 7 DAYS

  # View that updates local cache
  - name: update_cache
    sql: |
      -- Update local cache with latest aggregations
      INSERT OR REPLACE INTO cache.precomputed_metrics
      SELECT 
        DATE(timestamp) as event_date,
        event_type,
        COUNT(*) as event_count,
        COUNT(DISTINCT user_id) as unique_users,
        CURRENT_TIMESTAMP as cache_updated
      FROM cloud_events 
      WHERE timestamp >= CURRENT_DATE - INTERVAL 1 DAY
      GROUP BY DATE(timestamp), event_type

Example 5: Security Demonstration¶

Security Test Configuration¶

version: 1

duckdb:
  database: security_test.duckdb

views:
  # ✅ ALLOWED - Safe relative paths
  - name: safe_local_data
    source: parquet
    uri: ./data/local.parquet
    description: "Safe local data"

  - name: safe_parent_data
    source: parquet
    uri: ../shared/data.parquet
    description: "Safe parent directory data"

  - name: safe_grandparent_data
    source: parquet
    uri: ../../project/common.parquet
    description: "Safe grandparent directory data"

  # ✅ ALLOWED - Remote URIs (bypass local file security)
  - name: remote_s3_data
    source: parquet
    uri: s3://secure-bucket/data.parquet
    description: "Remote S3 data"

  - name: remote_http_data
    source: parquet
    uri: https://api.example.com/data.parquet
    description: "Remote HTTP data"

  # ❌ BLOCKED - Security violations (these would cause errors)
  # Uncommenting these would cause ConfigError during loading

  # - name: system_files
  #   source: parquet
  #   uri: ../../../etc/passwd
  #   description: "Blocked - system file access"

  # - name: system_config
  #   source: parquet
  #   uri: ../usr/local/config.parquet
  #   description: "Blocked - system directory access"

  # - name: excessive_traversal
  #   source: parquet
  #   uri: ../../../../../../../root/.ssh/id_rsa
  #   description: "Blocked - excessive directory traversal"

Security Validation Code¶

from duckalog import load_config, ConfigError
from duckalog.path_resolution import validate_path_security, is_relative_path
from pathlib import Path

def test_security_examples():
    """Demonstrate security validation"""

    config_dir = Path("/project/config")

    # Test safe paths
    safe_paths = [
        "./data/local.parquet",
        "../shared/data.parquet", 
        "../../project/common.parquet"
    ]

    print("Testing safe paths:")
    for path in safe_paths:
        is_rel = is_relative_path(path)
        is_safe = validate_path_security(path, config_dir)
        print(f"  {path}: relative={is_rel}, safe={is_safe}")

    # Test dangerous paths
    dangerous_paths = [
        "../../../etc/passwd",
        "../usr/local/config.parquet",
        "../../../../root/.ssh/id_rsa"
    ]

    print("\nTesting dangerous paths:")
    for path in dangerous_paths:
        is_rel = is_relative_path(path)
        is_safe = validate_path_security(path, config_dir)
        print(f"  {path}: relative={is_rel}, safe={is_safe}")

# Run security tests
test_security_examples()

# Test configuration loading with security validation
try:
    config = load_config("security_test_catalog.yaml")
    print("Configuration loaded successfully")
except ConfigError as e:
    print(f"Configuration error: {e}")

Example 6: Programmatic Path Resolution¶

Direct Path Resolution Usage¶

from duckalog.path_resolution import (
    resolve_relative_path,
    validate_path_security,
    detect_path_type,
    normalize_path_for_sql
)
from pathlib import Path

def demonstrate_path_resolution():
    """Demonstrate path resolution functionality"""

    config_dir = Path("/project/analytics/config")

    # Test different path types
    test_paths = [
        "data/events.parquet",
        "../shared/reference.parquet",
        "/absolute/path/data.parquet",
        "s3://bucket/data.parquet",
        "../../../etc/passwd",  # This should fail security validation
    ]

    print("Path Resolution Examples:")
    print("=" * 50)

    for path in test_paths:
        print(f"\nOriginal path: {path}")

        # Detect path type
        path_type = detect_path_type(path)
        print(f"  Path type: {path_type}")

        # Check if relative
        is_rel = is_relative_path(path)
        print(f"  Is relative: {is_rel}")

        # Validate security
        try:
            is_safe = validate_path_security(path, config_dir)
            print(f"  Security validation: {'PASS' if is_safe else 'BLOCKED'}")
        except Exception as e:
            print(f"  Security validation: ERROR - {e}")

        # Resolve if relative and safe
        if is_rel:
            try:
                resolved = resolve_relative_path(path, config_dir)
                print(f"  Resolved path: {resolved}")

                # Normalize for SQL
                sql_path = normalize_path_for_sql(resolved)
                print(f"  SQL format: {sql_path}")

            except Exception as e:
                print(f"  Resolution failed: {e}")
        else:
            print(f"  Path unchanged: {path}")

# Run demonstration
demonstrate_path_resolution()

# Example: Building dynamic configuration
def build_dynamic_config(config_file_path: str, data_patterns: dict):
    """Build configuration with dynamic path resolution"""

    config_dir = Path(config_file_path).parent
    resolved_views = []

    for view_name, path_pattern in data_patterns.items():
        try:
            # Resolve the path pattern
            resolved_path = resolve_relative_path(path_pattern, config_dir)

            # Create view with resolved path
            view = {
                "name": view_name,
                "source": "parquet",
                "uri": resolved_path,
                "description": f"Auto-resolved view for {view_name}"
            }
            resolved_views.append(view)

        except Exception as e:
            print(f"Failed to resolve {view_name}: {e}")

    return {
        "version": 1,
        "duckdb": {"database": "dynamic_catalog.duckdb"},
        "views": resolved_views
    }

# Usage example
data_sources = {
    "raw_events": "../data/events/raw/*.parquet",
    "processed_users": "../data/users/processed.parquet", 
    "reference_data": "./reference/common.parquet"
}

dynamic_config = build_dynamic_config("/project/config/catalog.yaml", data_sources)
print("\nDynamic configuration:")
print(dynamic_config)

Example 7: Migration from Absolute to Relative Paths¶

Before: Absolute Paths¶

# old-catalog.yaml
version: 1

duckdb:
  database: /Users/john/projects/analytics/catalog.duckdb

views:
  - name: events
    source: parquet
    uri: /Users/john/projects/analytics/data/events.parquet

  - name: users
    source: parquet
    uri: /Users/john/projects/analytics/data/users.parquet

attachments:
  duckdb:
    - alias: refdb
      path: /Users/john/shared/reference.duckdb
      read_only: true

After: Relative Paths¶

# new-catalog.yaml
version: 1

duckdb:
  database: catalog.duckdb

views:
  - name: events
    source: parquet
    uri: ./data/events.parquet

  - name: users
    source: parquet
    uri: ./data/users.parquet

attachments:
  duckdb:
    - alias: refdb
      path: ../shared/reference.duckdb
      read_only: true

Migration Script¶

import yaml
import re
from pathlib import Path
from duckalog.path_resolution import is_relative_path

def migrate_to_relative_paths(config_path: str, base_path: str = None):
    """Migrate absolute paths to relative paths in a configuration file"""

    config_path = Path(config_path)
    if base_path is None:
        base_path = config_path.parent

    # Load existing configuration
    with open(config_path, 'r') as f:
        config = yaml.safe_load(f)

    # Helper function to convert absolute to relative path
    def to_relative_path(abs_path: str) -> str:
        if not abs_path or not is_relative_path(abs_path):
            try:
                path_obj = Path(abs_path)
                if path_obj.is_absolute():
                    rel_path = path_obj.relative_to(base_path)
                    return str(rel_path)
            except (ValueError, OSError):
                pass
        return abs_path

    # Convert database path
    if 'duckdb' in config and 'database' in config['duckdb']:
        config['duckdb']['database'] = to_relative_path(
            config['duckdb']['database']
        )

    # Convert view URIs
    if 'views' in config:
        for view in config['views']:
            if 'uri' in view:
                view['uri'] = to_relative_path(view['uri'])

    # Convert attachment paths
    if 'attachments' in config:
        for attach_type in ['duckdb', 'sqlite']:
            if attach_type in config['attachments']:
                for attachment in config['attachments'][attach_type]:
                    if 'path' in attachment:
                        attachment['path'] = to_relative_path(attachment['path'])

    # Save migrated configuration
    backup_path = config_path.with_suffix('.yaml.backup')
    config_path.rename(backup_path)

    with open(config_path, 'w') as f:
        yaml.dump(config, f, default_flow_style=False)

    print(f"Configuration migrated to relative paths")
    print(f"Original saved to: {backup_path}")
    print(f"New config: {config_path}")

# Usage
migrate_to_relative_paths("old-catalog.yaml")

These examples demonstrate the flexibility and security of Duckalog's path resolution feature across various real-world scenarios. The key benefits are:

Portability: Configurations can be moved between environments
Security: Automatic protection against dangerous file access
Maintainability: Clear relative path structure
Flexibility: Works with local files, shared resources, and remote URIs
Cross-Platform: Consistent behavior across operating systems