How to Tune Performance¶
Optimize Duckalog performance for your specific workload and data characteristics through systematic tuning approaches.
Problem¶
Your Duckalog catalog builds or queries are running slowly, consuming excessive memory, or not scaling with your data volume.
Prerequisites¶
- Basic Duckalog and DuckDB knowledge
- Understanding of your data characteristics
- Access to performance metrics and monitoring
- Familiarity with SQL optimization concepts
Solution¶
1. Memory Optimization¶
Configure DuckDB memory settings for your workload:
Basic Memory Configuration¶
# config/performance.yaml
version: 1
duckdb:
database: analytics.duckdb
pragmas:
# Set memory limit (adjust based on available RAM)
- "SET memory_limit='4GB'"
# Enable memory-mapped I/O for large files
- "SET force_parallelism=true"
Advanced Memory Settings¶
# For memory-intensive workloads
duckdb:
database: analytics.duckdb
pragmas:
# Use all available memory
- "SET memory_limit='8GB'"
# Enable memory mapping for large datasets
- "SET enable_memory_map=true"
# Configure memory management
- "SET memory_allocator=arena'"
# Set checkpoint threshold
- "SET checkpoint_threshold='2GB'"
Memory Usage Monitoring¶
# Monitor memory usage during build
time -v duckalog run config/performance.yaml
# Check DuckDB memory usage
duckdb analytics.duckdb -c "
SELECT
current_memory_usage,
peak_memory_usage,
allocator_memory_limit
FROM pragma_memory_info();
"
2. Threading Optimization¶
Configure parallelism for your CPU and workload:
CPU-Based Threading¶
# For 4-core CPU
duckdb:
database: analytics.duckdb
pragmas:
# Use all available cores
- "SET threads=4"
# Enable parallel processing
- "SET enable_progress_bar=true"
# Configure parallel join behavior
- "SET parallel_join=true"
Workload-Specific Threading¶
# For I/O-heavy workloads (many small files)
duckdb:
database: analytics.duckdb
pragmas:
# More threads for parallel I/O
- "SET threads=8"
# Optimize for many small queries
- "SET enable_optimizer=true"
# Configure join order
- "SET join_order_type=automatic'"
Thread Testing and Validation¶
# Test optimal thread count
for threads in 1 2 4 6 8; do
echo "Testing with $threads threads..."
time duckalog run config/performance.yaml --threads $threads
done
# Monitor CPU usage
htop # Watch CPU utilization during builds
3. Query Optimization¶
Optimize SQL queries and view definitions:
Efficient View Design¶
# Good: Filtered views with proper indexing
views:
- name: recent_orders
source: parquet
uri: "data/orders.parquet"
sql: |
-- Filter early to reduce data processed
SELECT
order_id,
customer_id,
order_date,
amount
FROM orders
WHERE order_date >= '2023-01-01'
AND status = 'completed'
- name: customer_summary
source: duckdb
database: analytics
table: recent_orders # Use filtered view
sql: |
-- Aggregate pre-filtered data
SELECT
customer_id,
COUNT(*) as order_count,
SUM(amount) as total_spent
FROM recent_orders
GROUP BY customer_id
Column Pruning and Projection¶
# Good: Select only needed columns
views:
- name: order_metrics
sql: |
-- Select specific columns, not SELECT *
SELECT
order_id,
customer_id,
amount,
order_date
FROM orders
WHERE order_date >= '2023-01-01'
# Bad: Select all columns
views:
- name: order_all_columns
sql: |
-- Inefficient: selects all columns
SELECT * FROM orders
WHERE order_date >= '2023-01-01'
Join Optimization¶
# Good: Optimized joins with proper ordering
views:
- name: customer_orders_full
sql: |
-- Join on indexed columns with proper ordering
SELECT
c.customer_id,
c.customer_name,
o.order_id,
o.order_date,
o.amount
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2023-01-01'
ORDER BY c.customer_id, o.order_date # Helps with join ordering
4. File Format Optimization¶
Choose optimal file formats for your data:
Parquet Optimization¶
# For analytical workloads
views:
- name: analytics_data
source: parquet
uri: "data/analytics/*.parquet"
# Parquet configuration for performance
sql: |
SELECT * FROM analytics_data
WHERE event_date >= '2023-01-01'
File Organization¶
# Organize files for optimal access patterns
views:
- name: daily_events
source: parquet
# Date-partitioned files for partition pruning
uri: "data/events/year=2023/month=*/day=*/events.parquet"
sql: |
SELECT * FROM daily_events
WHERE event_date BETWEEN '2023-01-01' AND '2023-12-31'
5. DuckDB Extensions and Settings¶
Leverage DuckDB extensions for performance:
Enable Extensions¶
# For JSON and semi-structured data
duckdb:
database: analytics.duckdb
install_extensions:
- httpfs # For S3/cloud storage
- json # For JSON processing
- parquet # For Parquet optimization
load_extensions:
- httpfs
- json
- parquet
Extension-Specific Optimizations¶
# For full-text search
duckdb:
database: analytics.duckdb
install_extensions:
- fts # Full-text search
pragmas:
- "SET enable_fts=true"
views:
- name: searchable_documents
sql: |
-- Use FTS for text search
SELECT
doc_id,
title,
content
FROM documents
WHERE content MATCH 'search terms'
ORDER BY rank
6. Caching Strategies¶
Implement effective caching for repeated queries:
Pre-computed Views¶
# Create views for expensive computations using tables
views:
- name: customer_metrics_raw
source: parquet
uri: "data/orders.parquet"
sql: |
-- Expensive aggregation
SELECT
customer_id,
COUNT(*) as order_count,
SUM(amount) as total_spent,
AVG(amount) as avg_order_value
FROM orders
GROUP BY customer_id
- name: customer_metrics_table
sql: |
-- Create table for pre-computed results
CREATE OR REPLACE TABLE customer_metrics AS
SELECT * FROM customer_metrics_raw;
- name: customer_metrics
sql: |
-- Query pre-computed table
SELECT * FROM customer_metrics
WHERE customer_id = '{{customer_id}}'
Result Caching¶
# Cache query results
duckdb:
database: analytics.duckdb
pragmas:
- "SET enable_progress_bar=true"
- "SET preserve_insertion_order=true"
views:
- name: cached_results
sql: |
-- Use CREATE OR REPLACE for caching
CREATE OR REPLACE TABLE temp_results AS
SELECT * FROM expensive_computation
WHERE date_parameter = '{{date_param}}'
7. Build Performance Monitoring¶
Monitor and analyze build performance:
Build Timing¶
# Time the build process
time duckalog run config/performance.yaml
# Detailed timing with verbose output
duckalog run config/performance.yaml --verbose --timing
Resource Monitoring¶
# Monitor memory usage
/usr/bin/time -v duckalog run config/performance.yaml
# Monitor CPU usage
duckalog run config/performance.yaml &
BUILD_PID=$!
# Monitor in another terminal
top -p $BUILD_PID
# Check disk I/O
iotop -p $BUILD_PID
Performance Analysis¶
-- Analyze query performance
EXPLAIN ANALYZE SELECT * FROM expensive_view WHERE date_column = '2023-01-01';
-- Check for missing indexes
PRAGMA table_info('your_table_name');
-- Monitor memory usage
SELECT * FROM pragma_memory_info();
Verification¶
1. Performance Benchmarks¶
Establish baseline performance metrics:
# Benchmark current configuration
echo "Benchmarking current configuration..."
time duckalog run config/performance.yaml
# Test query performance
duckdb analytics.duckdb -c "
EXPLAIN ANALYZE SELECT * FROM your_view WHERE date_column = '2023-01-01';
"
# Record baseline metrics
echo "Build time: $BUILD_TIME seconds" > performance_baseline.txt
echo "Memory usage: $MEMORY_USAGE MB" >> performance_baseline.txt
2. Optimization Validation¶
Test each optimization individually:
# Test memory optimization
echo "Testing memory optimization..."
time duckalog run config/memory_optimized.yaml
# Test threading optimization
echo "Testing threading optimization..."
time duckalog run config/thread_optimized.yaml
# Test query optimization
echo "Testing query optimization..."
time duckalog run config/query_optimized.yaml
3. Comparative Analysis¶
Compare performance across configurations:
# Run comparative benchmarks
for config in baseline memory_optimized thread_optimized query_optimized; do
echo "Testing $config..."
time duckalog run configs/${config}.yaml
echo "---"
done
# Generate performance report
echo "Performance Comparison Results:" > performance_report.txt
echo "Configuration, Build Time, Memory Usage" >> performance_report.txt
# Add actual results from timing above
Common Variations¶
1. Large Dataset Optimization¶
For datasets larger than available memory:
# Out-of-core processing
duckdb:
database: analytics.duckdb
pragmas:
# Use disk-based processing
- "SET memory_limit='1GB'" # Low memory for metadata
- "SET force_parallelism=false" # Reduce memory pressure
- "SET enable_object_cache=true" # Cache metadata
- "SET object_cache_limit='256MB'" # Limit cache size
2. High-Concurrency Optimization¶
For many concurrent users:
# Concurrent access optimization
duckdb:
database: analytics.duckdb
pragmas:
# Enable read-only mode for readers
- "SET wal_autocheckpoint=1000000" # Less frequent checkpoints
- "SET enable_progress_bar=true" # Progress indication
- "SET preserve_insertion_order=false" # Better concurrency
3. Cloud Storage Optimization¶
For S3/cloud data sources:
# Cloud storage optimization
duckdb:
database: analytics.duckdb
install_extensions:
- httpfs
pragmas:
# S3-specific settings
- "SET s3_region='us-west-2'"
- "SET s3_access_key_id='${env:AWS_ACCESS_KEY_ID}'"
- "SET s3_secret_access_key='${env:AWS_SECRET_ACCESS_KEY}'"
# Connection pooling for cloud
- "SET s3_uploader_max_parts=8'"
- "SET enable_http_stats=true"
Troubleshooting¶
Memory Issues¶
Symptoms: Out of memory errors, slow performance
Solutions:
# Reduce memory usage
duckdb:
pragmas:
- "SET memory_limit='2GB'" # Reduce from 4GB
- "SET enable_memory_map=false" # Disable memory mapping
- "SET force_parallelism=false" # Reduce parallelism
Threading Issues¶
Symptoms: CPU not fully utilized, poor parallelism
Solutions:
# Optimize threading
duckdb:
pragmas:
- "SET threads=2" # Reduce from 8 to 2
- "SET enable_progress_bar=true" # Monitor progress
- "SET enable_optimizer=true" # Better query planning
I/O Bottlenecks¶
Symptoms: Slow builds with low CPU/memory usage
Solutions:
# Optimize I/O patterns
duckdb:
pragmas:
- "SET enable_progress_bar=true" # Monitor I/O
- "SET checkpoint_threshold='1GB'" # More frequent checkpoints
- "SET wal_autocheckpoint=100000" # Balance durability/performance
Best Practices¶
1. Performance Monitoring¶
- Establish baselines before optimization
- Monitor continuously in production
- Profile queries to identify bottlenecks
- Document performance characteristics of your data
2. Incremental Optimization¶
- Change one variable at a time for proper measurement
- Test each optimization with realistic workloads
- Roll back changes that don't improve performance
- Document successful optimizations for future reference
3. Data-Aware Tuning¶
- Understand your data characteristics (size, distribution, access patterns)
- Optimize for common queries rather than edge cases
- Use appropriate file formats for your data type
- Consider data growth when configuring resources
4. Environment-Specific Tuning¶
- Different settings for development vs production
- Scale resources based on actual workload
- Monitor resource usage in each environment
- Plan for scaling as data volume grows
Next Steps¶
After optimizing performance:
- Set up monitoring for ongoing performance tracking
- Create performance tests for regression detection
- Document optimization decisions and their impact
- Plan for scaling as data and users grow
- Explore advanced features like partitioning and indexing
You now have a comprehensive approach to optimize Duckalog performance for any workload and data characteristics.