SQL Query Interface
The SQL Query Interface provides a DuckDB-like API for executing OLAP analytics queries on Arrow data using GPU/SIMD acceleration.
Overview
Status: Phase 1 Complete (v0.3.0) Coverage: 92.64% test coverage Performance: 2.78x faster aggregations (SIMD), 5-28x faster Top-K
Architecture
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ SQL Query │────▶│ QueryEngine │────▶│ QueryPlan │
│ (String) │ │ (Parser) │ │ (AST) │
└─────────────┘ └──────────────┘ └─────────────┘
│
▼
┌─────────────┐
│ Query │
│ Executor │
└─────────────┘
│
┌─────────────────────┼─────────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Filter │ │ Aggregate│ │ Top-K │
│ (WHERE) │ │ (SUM/AVG)│ │ (ORDER) │
└──────────┘ └──────────┘ └──────────┘
│ │ │
└─────────────────────┴─────────────────────┘
│
▼
┌─────────────┐
│ RecordBatch │
│ (Results) │
└─────────────┘
Supported Features
Phase 1 (Current)
✅ SELECT - Column projection
✅ WHERE - Filtering with comparison operators (>, >=, <, <=, =, !=/<>)
✅ Aggregations - SUM, AVG, COUNT, MIN, MAX
✅ ORDER BY - Ascending/descending sort with Top-K optimization
✅ LIMIT - Result set limiting
Phase 2 (Future)
❌ GROUP BY - Aggregations with grouping (planned) ❌ JOIN - Inner/outer joins (planned) ❌ Subqueries - Nested queries (planned) ❌ Window Functions - OVER clause (planned)
Quick Start
use trueno_db::query::{QueryEngine, QueryExecutor};
use trueno_db::storage::StorageEngine;
// Load data
let storage = StorageEngine::load_parquet("data/events.parquet")?;
// Initialize query engine
let engine = QueryEngine::new();
let executor = QueryExecutor::new();
// Parse and execute query
let plan = engine.parse("SELECT SUM(value) FROM events WHERE value > 100")?;
let result = executor.execute(&plan, &storage)?;
Examples
Example 1: Simple Aggregation
let sql = "SELECT COUNT(*), SUM(revenue), AVG(revenue) FROM sales";
let plan = engine.parse(sql)?;
let result = executor.execute(&plan, &storage)?;
// Extract results
let count = result.column(0).as_any().downcast_ref::<Int64Array>()?.value(0);
let sum = result.column(1).as_any().downcast_ref::<Float64Array>()?.value(0);
let avg = result.column(2).as_any().downcast_ref::<Float64Array>()?.value(0);
Example 2: Filtering
let sql = "SELECT user_id, score FROM leaderboard WHERE score > 1000";
let plan = engine.parse(sql)?;
let result = executor.execute(&plan, &storage)?;
Example 3: Top-K Query
Uses O(N log K) heap-based selection instead of O(N log N) full sort:
let sql = "SELECT player_id, score FROM rankings ORDER BY score DESC LIMIT 10";
let plan = engine.parse(sql)?;
let result = executor.execute(&plan, &storage)?;
Performance: 5-28x faster than sorting all rows for K << N.
Example 4: Filter + Aggregation
let sql = "SELECT MIN(price), MAX(price), AVG(price)
FROM products
WHERE category = 'Electronics'";
let plan = engine.parse(sql)?;
let result = executor.execute(&plan, &storage)?;
Performance Characteristics
Aggregations
- SIMD Acceleration: 2.78x faster than scalar baseline
- Backend Selection: Automatic GPU dispatch for large datasets (>10MB, >5x transfer cost)
- Data Types: Int32, Int64, Float32, Float64
Top-K Selection
- Algorithm: Heap-based selection (O(N log K))
- Speedup: 5-28x vs full sort for K << N
- Example: Top-100 from 1M rows = 28x faster
Filtering
- Zero-Copy: Uses Arrow compute functions
- SIMD: Vectorized comparison operations
- Selectivity: Best for <20% result sets
Implementation Details
Query Parsing
Uses sqlparser-rs for SQL parsing:
pub struct QueryPlan {
pub columns: Vec<String>, // SELECT columns
pub table: String, // FROM table
pub filter: Option<String>, // WHERE clause
pub aggregations: Vec<Aggregation>, // Aggregate functions
pub group_by: Vec<String>, // GROUP BY (Phase 2)
pub order_by: Vec<(String, OrderDirection)>, // ORDER BY
pub limit: Option<usize>, // LIMIT
}
Query Execution
- Parse: SQL → QueryPlan
- Combine Batches: Merge storage batches
- Filter: Apply WHERE predicate
- Aggregate: Execute SUM/AVG/COUNT/MIN/MAX
- Project: Select columns
- Sort: Apply Top-K if ORDER BY + LIMIT
- Return: Arrow RecordBatch
Backend Selection
// Cost-based dispatch
if data_size > 10MB && compute_time > 5x_transfer_time {
use_gpu();
} else {
use_simd();
}
Error Handling
match executor.execute(&plan, &storage) {
Ok(result) => /* process results */,
Err(Error::InvalidInput(msg)) => /* column not found, etc. */,
Err(Error::ParseError(msg)) => /* SQL syntax error */,
Err(e) => /* other errors */,
}
Testing
- 156 tests total (18 SQL-specific)
- Property-based tests with proptest
- Backend equivalence: GPU == SIMD == Scalar
- 92.64% code coverage
Benchmarks
Run SQL query benchmarks:
cargo bench --bench sql_query_benchmarks
Key benchmark groups:
sql_sum_aggregation: SUM performancesql_top_k_order_by_limit: Top-K optimizationsql_filter_aggregate: Combined operationsscalar_baseline_sum: Baseline comparison
Limitations (Phase 1)
- No GROUP BY: Only simple aggregations (all rows)
- No JOINs: Single table queries only
- Simple WHERE: Single predicate only (
col op value) - No type coercion: Explicit type matching required
Future Enhancements (Phase 2)
- GROUP BY with hash aggregation
- Hash JOIN (inner/outer)
- Complex WHERE (AND/OR/NOT)
- Window functions (OVER clause)
- GPU aggregation for large datasets
See Also
References
- sqlparser-rs: SQL parser library
- Apache Arrow: Columnar format
- DuckDB: API inspiration
- MonetDB/X100: Vectorized execution model