Skip to main content
Skip to main content

DataStore Debugging

DataStore provides comprehensive debugging tools to understand and optimize your data pipelines.

Debugging Tools Overview

ToolPurposeWhen to Use
explain()View execution planUnderstand what SQL will run
ProfilerMeasure performanceFind slow operations
LoggingView execution detailsDebug unexpected behavior

Quick Decision Matrix

NeedToolCommand
See execution planexplain()ds.explain()
Measure performanceProfilerconfig.enable_profiling()
Debug SQL queriesLoggingconfig.enable_debug()
All of the aboveCombinedSee below

Quick Setup

Enable All Debugging

from chdb import datastore as pd
from chdb.datastore.config import config

# Enable all debugging
config.enable_debug()        # Verbose logging
config.enable_profiling()    # Performance tracking

ds = pd.read_csv("data.csv")
result = ds.filter(ds['age'] > 25).groupby('city').agg({'salary': 'mean'})

# View execution plan
result.explain()

# Get profiler report
from chdb.datastore.config import get_profiler
profiler = get_profiler()
profiler.report()

explain() Method

View the execution plan before running a query.

ds = pd.read_csv("data.csv")

query = (ds
    .filter(ds['amount'] > 1000)
    .groupby('region')
    .agg({'amount': ['sum', 'mean']})
)

# View plan
query.explain()

Output:

Pipeline:
  Source: file('data.csv', 'CSVWithNames')
  Filter: amount > 1000
  GroupBy: region
  Aggregate: sum(amount), avg(amount)

Generated SQL:
SELECT region, SUM(amount) AS sum, AVG(amount) AS mean
FROM file('data.csv', 'CSVWithNames')
WHERE amount > 1000
GROUP BY region

See explain() Documentation for details.


Profiling

Measure execution time for each operation.

from chdb.datastore.config import config, get_profiler

# Enable profiling
config.enable_profiling()

# Run operations
ds = pd.read_csv("large_data.csv")
result = (ds
    .filter(ds['amount'] > 100)
    .groupby('category')
    .agg({'amount': 'sum'})
    .sort('sum', ascending=False)
    .head(10)
    .to_df()
)

# View report
profiler = get_profiler()
profiler.report(min_duration_ms=0.1)

Output:

Performance Report
==================
Step                          Duration    Calls
----                          --------    -----
read_csv                      1.234s      1
filter                        0.002s      1
groupby                       0.001s      1
agg                           0.089s      1
sort                          0.045s      1
head                          0.001s      1
to_df (SQL execution)         0.567s      1
----                          --------    -----
Total                         1.939s      7

See Profiling Guide for details.


Logging

View detailed execution logs.

from chdb.datastore.config import config

# Enable debug logging
config.enable_debug()

# Run operations - logs will show:
# - SQL queries generated
# - Execution engine used
# - Cache hits/misses
# - Timing information

Log output example:

DEBUG - DataStore: Creating from file 'data.csv'
DEBUG - Query: SELECT region, SUM(amount) FROM ... WHERE amount > 1000 GROUP BY region
DEBUG - Engine: Using chdb for aggregation
DEBUG - Execution time: 0.089s
DEBUG - Cache: Storing result (key: abc123)

See Logging Configuration for details.


Common Debugging Scenarios

1. Query Not Returning Expected Results

# Step 1: View the execution plan
query = ds.filter(ds['age'] > 25).groupby('city').sum()
query.explain(verbose=True)

# Step 2: Enable logging to see SQL
config.enable_debug()

# Step 3: Run and check logs
result = query.to_df()

2. Query Running Slowly

# Step 1: Enable profiling
config.enable_profiling()

# Step 2: Run your query
result = process_data()

# Step 3: Check profiler report
profiler = get_profiler()
profiler.report()

# Step 4: Identify slow operations and optimize

3. Understanding Engine Selection

# Enable verbose logging
config.enable_debug()

# Run operations
result = ds.filter(ds['x'] > 10).apply(custom_func)

# Logs will show which engine was used for each operation:
# DEBUG - filter: Using chdb engine
# DEBUG - apply: Using pandas engine (custom function)

4. Debugging Cache Issues

# Enable debug to see cache operations
config.enable_debug()

# First run
result1 = ds.filter(ds['x'] > 10).to_df()
# LOG: Cache miss, executing query

# Second run (should use cache)
result2 = ds.filter(ds['x'] > 10).to_df()
# LOG: Cache hit, returning cached result

# If not caching when expected, check:
# - Are operations identical?
# - Is cache enabled? config.cache_enabled

Best Practices

1. Debug in Development, Not Production

# Development
config.enable_debug()
config.enable_profiling()

# Production
config.set_log_level(logging.WARNING)
config.set_profiling_enabled(False)

2. Use explain() Before Running Large Queries

# Build query
query = ds.filter(...).groupby(...).agg(...)

# Check plan first
query.explain()

# If plan looks good, execute
result = query.to_df()

3. Profile Before Optimizing

# Don't guess what's slow - measure it
config.enable_profiling()
result = your_pipeline()
get_profiler().report()

4. Check SQL When Results Are Wrong

# View generated SQL
print(query.to_sql())

# Compare with expected SQL
# Run SQL directly in ClickHouse to verify

Debugging Tools Summary

ToolCommandOutput
Explain plands.explain()Execution steps + SQL
Verbose explainds.explain(verbose=True)+ Metadata
View SQLds.to_sql()SQL query string
Enable debugconfig.enable_debug()Detailed logs
Enable profilingconfig.enable_profiling()Timing data
Profiler reportget_profiler().report()Performance summary
Clear profilerget_profiler().reset()Clear timing data

Next Steps