Skip to main content
Skip to main content
Edit this page

Performance Guide

DataStore delivers significant performance improvements over pandas for many operations. This guide explains why and how to optimize your workloads.

Why DataStore Is Faster

1. SQL Pushdown

Operations are pushed down to the data source:

# pandas: Loads ALL data, then filters in memory
df = pd.read_csv("huge.csv")       # Load 10GB
df = df[df['year'] == 2024]        # Filter in Python

# DataStore: Filter at source
ds = pd.read_csv("huge.csv")       # Just metadata
ds = ds[ds['year'] == 2024]        # Filter in SQL
df = ds.to_df()                    # Only load filtered data

2. Column Pruning

Only needed columns are read:

# DataStore: Only reads name, age columns
ds = pd.read_parquet("wide_table.parquet")
result = ds.select('name', 'age').to_df()

# vs pandas: Reads all 100 columns, then selects

3. Lazy Evaluation

Multiple operations compile to one query:

# DataStore: One optimized SQL query
result = (ds
    .filter(ds['amount'] > 100)
    .groupby('region')
    .agg({'amount': 'sum'})
    .sort('sum', ascending=False)
    .head(10)
    .to_df()
)

# Becomes:
# SELECT region, SUM(amount) FROM data
# WHERE amount > 100
# GROUP BY region ORDER BY sum DESC LIMIT 10

Benchmark: DataStore vs pandas

Test Environment

  • Data: 10 million rows
  • Hardware: Standard laptop
  • File format: CSV

Results

Operationpandas (ms)DataStore (ms)Winner
GroupBy count34717DataStore (19.93x)
Combined ops1,535234DataStore (6.56x)
Complex pipeline2,047380DataStore (5.39x)
MultiFilter+Sort+Head1,963366DataStore (5.36x)
Filter+Sort+Head1,537350DataStore (4.40x)
Head/Limit16645DataStore (3.69x)
Ultra-complex (10+ ops)1,070338DataStore (3.17x)
GroupBy agg406141DataStore (2.88x)
Select+Filter+Sort1,217443DataStore (2.75x)
Filter+GroupBy+Sort466184DataStore (2.53x)
Filter+Select+Sort1,285533DataStore (2.41x)
Sort (single)1,7421,197DataStore (1.45x)
Filter (single)276526Comparable
Sort (multiple)9471,477Comparable

Key Insights

  1. GroupBy operations: DataStore up to 19.93x faster
  2. Complex pipelines: DataStore 5-6x faster (SQL pushdown benefit)
  3. Simple slice operations: Performance comparable - difference negligible
  4. Best use case: Multi-step operations with groupby/aggregation
  5. Zero-copy: to_df() has no data conversion overhead

When DataStore Wins

Heavy Aggregations

# DataStore excels: 19.93x faster
result = ds.groupby('category')['amount'].sum()

Complex Pipelines

# DataStore excels: 5-6x faster
result = (ds
    .filter(ds['date'] >= '2024-01-01')
    .filter(ds['amount'] > 100)
    .groupby('region')
    .agg({'amount': ['sum', 'mean', 'count']})
    .sort('sum', ascending=False)
    .head(20)
)

Large File Processing

# DataStore: Only loads what you need
ds = pd.read_parquet("huge_file.parquet")
result = ds.filter(ds['id'] == 12345).to_df()  # Fast!

Multiple Column Operations

# DataStore: Combines into single SQL
ds['total'] = ds['price'] * ds['quantity']
ds['is_large'] = ds['total'] > 1000
ds = ds.filter(ds['is_large'])

When pandas Is Comparable

In most scenarios, DataStore matches or exceeds pandas performance. However, pandas may be slightly faster in these specific cases:

Small Datasets (<1,000 rows)

# For very small datasets, overhead is minimal for both
# Performance difference is negligible
small_df = pd.DataFrame({'x': range(100)})

Simple Slice Operations

# Single slice operations without aggregation
df = df[df['x'] > 10]  # pandas slightly faster
ds = ds[ds['x'] > 10]  # DataStore comparable

Custom Python Lambda Functions

# pandas required for custom Python code
def complex_function(row):
    return custom_logic(row)

df['result'] = df.apply(complex_function, axis=1)
Important

Even in scenarios where DataStore is "slower", performance is typically on par with pandas - the difference is negligible for practical use. DataStore's advantages in complex operations far outweigh these edge cases.

For fine-grained control over execution, see Execution Engine Configuration.


Zero-Copy DataFrame Integration

DataStore uses zero-copy for reading and writing pandas DataFrames. This means:

# to_df() does NOT copy data - it's a zero-copy operation
result = ds.filter(ds['x'] > 10).to_df()  # No data conversion overhead

# Same for creating DataStore from DataFrame
ds = DataStore(existing_df)  # No data copy

Key implications:

  • to_df() is essentially free - no serialization or memory copying
  • Creating DataStore from pandas DataFrame is instant
  • Memory is shared between DataStore and pandas views

Optimization Tips

1. Use Parquet Instead of CSV

# CSV: Slower, reads entire file
ds = pd.read_csv("data.csv")

# Parquet: Faster, columnar, compressed
ds = pd.read_parquet("data.parquet")

# Convert once, benefit forever
df = pd.read_csv("data.csv")
df.to_parquet("data.parquet")

Expected improvement: 3-10x faster reads

2. Filter Early

# Good: Filter first, then aggregate
result = (ds
    .filter(ds['date'] >= '2024-01-01')  # Reduce data early
    .groupby('category')['amount'].sum()
)

# Less optimal: Process all data
result = (ds
    .groupby('category')['amount'].sum()
    .filter(ds['sum'] > 1000)  # Filter too late
)

3. Select Only Needed Columns

# Good: Column pruning
result = ds.select('name', 'amount').filter(ds['amount'] > 100)

# Less optimal: All columns loaded
result = ds.filter(ds['amount'] > 100)  # Loads all columns

4. Leverage SQL Aggregations

# GroupBy is where DataStore shines
# Up to 20x speedup!
result = ds.groupby('category').agg({
    'amount': ['sum', 'mean', 'count', 'max'],
    'quantity': 'sum'
})

5. Use head() Instead of Full Queries

# Don't load entire result if you only need a sample
result = ds.filter(ds['type'] == 'A').head(100)  # LIMIT 100

# Avoid this for large results
# result = ds.filter(ds['type'] == 'A').to_df()  # Loads everything

6. Batch Operations

# Good: Single execution
result = ds.filter(ds['x'] > 10).filter(ds['y'] < 100).to_df()

# Bad: Multiple executions
result1 = ds.filter(ds['x'] > 10).to_df()  # Execute
result2 = result1[result1['y'] < 100]       # Execute again

7. Use explain() to Optimize

# View the query plan before executing
query = ds.filter(...).groupby(...).agg(...)
query.explain()  # Check if operations are pushed down

# Then execute
result = query.to_df()

Profiling Your Workload

Enable Profiling

from chdb.datastore.config import config, get_profiler

config.enable_profiling()

# Run your workload
result = your_pipeline()

# View report
profiler = get_profiler()
profiler.report()

Identify Bottlenecks

Performance Report
==================
Step                    Duration    % Total
----                    --------    -------
SQL execution           2.5s        62.5%     <- Bottleneck!
read_csv                1.2s        30.0%
Other                   0.3s        7.5%

Compare Approaches

# Test approach 1
profiler.reset()
result1 = approach1()
time1 = profiler.get_steps()[-1]['duration_ms']

# Test approach 2
profiler.reset()
result2 = approach2()
time2 = profiler.get_steps()[-1]['duration_ms']

print(f"Approach 1: {time1:.0f}ms")
print(f"Approach 2: {time2:.0f}ms")

Best Practices Summary

PracticeImpact
Use Parquet files3-10x faster reads
Filter earlyReduce data processing
Select needed columnsReduce I/O and memory
Use GroupBy/aggregationsUp to 20x faster
Batch operationsAvoid repeated execution
Profile before optimizingFind real bottlenecks
Use explain()Verify query optimization
Use head() for samplesAvoid full table scans

Quick Decision Guide

Your WorkloadRecommendation
GroupBy/aggregationUse DataStore
Complex multi-step pipelineUse DataStore
Large files with filtersUse DataStore
Simple slice operationsEither (comparable performance)
Custom Python lambda functionsUse pandas or convert late
Very small data (<1,000 rows)Either (negligible difference)
Tip

For automatic optimal engine selection, use config.set_execution_engine('auto') (default). See Execution Engine Configuration for details.