CQL (CSV Query Language) is a SQL-like query language designed specifically for querying and manipulating CSV files in PHP. It provides a familiar SQL syntax for filtering, joining, and transforming CSV data without requiring a database.
Important
This is v0.1.0 - a read-only query library. Write operations (INSERT, UPDATE, DELETE) are planned for future releases. See CHANGELOG.md for details.
- SQL-like Syntax - Write queries using familiar SELECT, FROM, WHERE, JOIN, and GROUP BY statements
- Aggregate Functions - COUNT, SUM, AVG, MIN, MAX for data analysis
- Date Functions - YEAR, MONTH, DAY, DATE for date-based analysis
- Multiple Data Sources - Define and query multiple CSV files in a single query
- JOIN Support - Perform INNER, LEFT, and RIGHT joins between CSV files
- Streaming Mode - Process large files (>50MB) with minimal memory usage
- Expression Engine - Support for mathematical and logical expressions in WHERE clauses
- Column Aliasing - Rename columns in your result set
- Wildcard Selection - Select all columns or prefix-specific columns with
*syntax - Header Detection - Automatically handle CSVs with or without headers
- Simple Facade API - Clean, intuitive interface with convenience methods
- Type-Safe - Built with PHP 8.4+ features including readonly classes and enums
- Extensible - Modular operator system for easy extension
- Well Tested - 50 tests with 186 assertions covering core functionality
- PHP 8.4 or higher
- Composer
composer require absorbing/cql-core<?php
require 'vendor/autoload.php';
use CQL\CQL;
// Create a CQL instance
$cql = new CQL();
// Execute a query
$results = $cql->execute("
DEFINE 'users.csv' AS users WITH HEADERS
SELECT name, age
FROM users
WHERE age > 25
");
// Iterate over results
foreach ($results as $row) {
echo "{$row['name']} is {$row['age']} years old\n";
}
// Or use convenience methods
$array = $cql->query($query); // Returns array
$first = $cql->first($query); // Returns first row
$count = $cql->count($query); // Returns countThe simplest valid CQL query requires:
DEFINE 'file.csv' AS alias WITH HEADERS
SELECT *
FROM aliasA query with all optional features:
DEFINE 'users.csv' AS users WITH HEADERS
DEFINE 'orders.csv' AS orders WITH HEADERS
SELECT users.name, orders.product, orders.amount
FROM users
LEFT JOIN orders ON users.id = orders.user_id
WHERE orders.amount > 100CQL queries follow a SQL-like syntax with specific requirements:
| Component | Required | Description |
|---|---|---|
DEFINE |
Yes | At least one data source definition |
SELECT |
Yes | Column selection |
FROM |
Yes | Primary data source |
WHERE |
No | Filter conditions |
JOIN |
No | Join additional data sources |
GROUP BY |
No | Group rows for aggregation |
AS |
No | Aliases for tables and columns |
Query Structure:
DEFINE 'file.csv' AS alias [WITH|WITHOUT HEADERS] [COLUMNS (col1, col2, ...)]
[DEFINE 'file2.csv' AS alias2 ...]
SELECT column1, column2, ... | *
FROM alias
[JOIN alias2 ON alias.key = alias2.key]
[WHERE condition]The CQL class provides a simple interface for executing queries:
use CQL\CQL;
// Basic usage
$cql = new CQL();
$results = $cql->execute($query);
// Convenience methods
$array = $cql->query($query); // Returns array
$first = $cql->first($query); // Returns first row
$count = $cql->count($query); // Returns row count
// Static factory methods
$cql = CQL::auto(); // Automatic mode (default)
$cql = CQL::streaming(); // Always stream
$cql = CQL::normal(); // Never stream
$cql = CQL::auto(10 * 1024 * 1024); // Custom 10MB threshold
// Configuration
$cql = new CQL([
'streaming' => true,
'autoStreamingThreshold' => 20 * 1024 * 1024
]);
// Fluent configuration
$cql = (new CQL())
->setStreaming(true)
->setAutoStreamingThreshold(30 * 1024 * 1024);The DEFINE statement declares a CSV data source and assigns it an alias.
Syntax:
DEFINE 'path/to/file.csv' AS alias [WITH|WITHOUT HEADERS] [COLUMNS (col1, col2, ...)]Parameters:
| Parameter | Required | Description | Default |
|---|---|---|---|
'path/to/file.csv' |
Yes | Path to CSV file (single or double quotes) | - |
AS alias |
No | Table alias | Auto-generated from filename |
WITH HEADERS |
No | CSV has header row | - |
WITHOUT HEADERS |
No | CSV has no headers (generates column_1, column_2, etc.) | Default if neither specified |
COLUMNS (...) |
No | Explicitly define column names | Auto-generated or from headers |
Examples:
-- Minimal: CSV with headers
DEFINE 'users.csv' AS users WITH HEADERS
-- CSV without headers (auto-generates column names)
DEFINE 'data.csv' AS data WITHOUT HEADERS
-- Explicitly define column names
DEFINE 'data.csv' AS data WITHOUT HEADERS COLUMNS (id, name, email)
-- Auto-generate alias from filename (becomes 'users')
DEFINE 'users.csv' WITH HEADERS
-- Multiple data sources
DEFINE 'users.csv' AS users WITH HEADERS
DEFINE 'orders.csv' AS orders WITH HEADERSRules:
- At least one DEFINE is required per query
- Aliases must be unique within a query
- Aliases are case-sensitive
- File paths are relative to execution directory
Select specific columns or use wildcards to select all columns.
Syntax:
SELECT column1, column2, ... | * | alias.*
SELECT column AS aliasOptions:
| Option | Description | Example |
|---|---|---|
column |
Select specific column by name | SELECT name |
alias.column |
Select column with table qualifier | SELECT users.name |
* |
Select all columns from all tables | SELECT * |
alias.* |
Select all columns from specific table | SELECT users.* |
column AS alias |
Rename column in results | SELECT name AS full_name |
Examples:
-- Select specific columns
SELECT name, age FROM users
-- Select all columns
SELECT * FROM users
-- Select all columns from specific table (useful in JOINs)
SELECT users.*, orders.total FROM users JOIN orders ON users.id = orders.user_id
-- Column aliasing
SELECT name AS full_name, age AS years FROM users
-- Qualified column names
SELECT users.name, users.email FROM users
-- Mix of qualified and unqualified
SELECT users.name, age FROM usersRules:
- At least one column or wildcard is required
- Column names are case-sensitive
- In JOINs, ambiguous columns should be qualified with table alias
- Wildcards can be mixed with specific columns
Specify the primary data source for the query.
Syntax:
FROM aliasParameters:
| Parameter | Required | Description |
|---|---|---|
alias |
Yes | Must match a DEFINE alias |
Examples:
FROM users
FROM data
FROM my_tableRules:
- Exactly one FROM clause is required
- Alias must be defined in a DEFINE statement
- Alias is case-sensitive
Filter rows based on conditions using comparison and logical operators.
Syntax:
WHERE condition
WHERE column operator value
WHERE expression operator expressionSupported Operators:
| Category | Operator | Description |
|---|---|---|
| Comparison | = |
Equal |
!=, <> |
Not equal | |
> |
Greater than | |
< |
Less than | |
>= |
Greater than or equal | |
<= |
Less than or equal | |
| Logical | AND |
Logical AND |
OR |
Logical OR | |
NOT |
Logical NOT | |
| Mathematical | + |
Addition |
- |
Subtraction | |
* |
Multiplication | |
/ |
Division | |
% |
Modulus | |
^ |
Power (exponentiation) |
Examples:
-- Simple comparison
WHERE age > 18
-- Logical operators
WHERE age > 18 AND status = 'active'
WHERE age < 18 OR age > 65
-- Mathematical expressions
WHERE price * quantity > 1000
WHERE (salary + bonus) > 100000
-- String comparison
WHERE name = 'John'
WHERE status != 'inactive'
-- Parentheses for precedence
WHERE (age > 18 AND age < 65) OR status = 'premium'Rules:
- WHERE clause is optional
- Only one condition is currently supported (no complex AND/OR chains)
- Column names must exist in the data source
- String values must be in single quotes
- Numeric values don't need quotes
Limitations:
- Complex nested conditions not yet supported
- IN operator not yet implemented
- LIKE operator not yet implemented
- BETWEEN operator not yet implemented
Join multiple CSV files together.
Syntax:
[INNER|LEFT|RIGHT] JOIN right_alias ON left_alias.key = right_alias.keyJoin Types:
| Join Type | Description |
|---|---|
INNER JOIN (default) |
Returns rows with matches in both tables |
LEFT JOIN |
Returns all rows from left table, matched rows from right |
RIGHT JOIN |
Returns all rows from right table, matched rows from left |
Parameters:
| Parameter | Required | Description |
|---|---|---|
right_alias |
Yes | Alias of table to join (must be defined) |
ON |
Yes | Join condition keyword |
left_alias.key |
Yes | Column from left table |
right_alias.key |
Yes | Column from right table |
Examples:
-- Inner join (default)
DEFINE 'users.csv' AS users WITH HEADERS
DEFINE 'orders.csv' AS orders WITH HEADERS
SELECT users.name, orders.product
FROM users
JOIN orders ON users.id = orders.user_id
-- Left join
SELECT users.name, orders.product
FROM users
LEFT JOIN orders ON users.id = orders.user_id
-- Multiple joins
DEFINE 'users.csv' AS users WITH HEADERS
DEFINE 'orders.csv' AS orders WITH HEADERS
DEFINE 'products.csv' AS products WITH HEADERS
SELECT users.name, orders.order_id, products.name
FROM users
JOIN orders ON users.id = orders.user_id
JOIN products ON orders.product_id = products.idRules:
- JOIN is optional
- Multiple JOINs are supported
- Both tables must be defined with DEFINE
- Join keys must exist in both tables
- Join condition must use
=operator - Column names in results are namespaced (e.g.,
users.name,orders.product)
Limitations:
- Only equality joins supported (no
>,<, etc.) - No self-joins
- FULL OUTER JOIN not yet implemented
Perform calculations across multiple rows.
Syntax:
SELECT function(column) [AS alias]
SELECT function(*)Supported Functions:
| Function | Description | Example |
|---|---|---|
COUNT(*) |
Count all rows | COUNT(*) AS total |
COUNT(column) |
Count non-null values | COUNT(name) AS users |
SUM(column) |
Sum of numeric values | SUM(amount) AS total_sales |
AVG(column) |
Average of numeric values | AVG(price) AS avg_price |
MIN(column) |
Minimum value | MIN(age) AS youngest |
MAX(column) |
Maximum value | MAX(salary) AS highest_paid |
Examples:
-- Overall statistics (no GROUP BY)
DEFINE 'sales.csv' AS sales WITH HEADERS
SELECT COUNT(*) AS total_orders,
SUM(amount) AS total_revenue,
AVG(amount) AS avg_order_value
FROM sales
-- Group by category
SELECT category,
COUNT(*) AS order_count,
SUM(quantity) AS total_quantity
FROM sales
GROUP BY category
-- Multiple aggregates with GROUP BY
SELECT region,
COUNT(*) AS orders,
AVG(price) AS avg_price,
MAX(quantity) AS max_quantity
FROM sales
GROUP BY region
-- Filter before aggregating
SELECT category, SUM(amount) AS revenue
FROM sales
WHERE amount > 100
GROUP BY categoryRules:
- Aggregate functions can be used with or without GROUP BY
- Without GROUP BY, aggregates apply to entire dataset
- With GROUP BY, aggregates apply to each group
- Can mix aggregates with GROUP BY columns in SELECT
- Aliases are recommended for aggregate results
Limitations:
- HAVING clause not yet implemented
- Cannot filter on aggregate results (use PHP post-processing)
- Cannot use aggregates in WHERE clause
Extract date components and perform date-based analysis.
Syntax:
SELECT function(date_column) [AS alias]
WHERE function(date_column) operator value
GROUP BY function(date_column)Supported Functions:
| Function | Description | Returns | Example |
|---|---|---|---|
YEAR(date) |
Extract year | Integer (2024) | YEAR(order_date) |
MONTH(date) |
Extract month | Integer (1-12) | MONTH(order_date) |
DAY(date) |
Extract day | Integer (1-31) | DAY(order_date) |
DATE(date) |
Format as date | String (YYYY-MM-DD) | DATE(created_at) |
Examples:
-- Extract date components
DEFINE 'orders.csv' AS orders WITH HEADERS
SELECT customer,
order_date,
YEAR(order_date) AS year,
MONTH(order_date) AS month,
DAY(order_date) AS day
FROM orders
-- Group by month
SELECT MONTH(order_date) AS month,
COUNT(*) AS order_count,
SUM(amount) AS revenue
FROM orders
GROUP BY MONTH(order_date)
-- Filter by date component
SELECT customer, amount
FROM orders
WHERE MONTH(order_date) = 2
-- Quarterly analysis
SELECT COUNT(*) AS q1_orders,
SUM(amount) AS q1_revenue
FROM orders
WHERE MONTH(order_date) <= 3Rules:
- Date functions work with standard date formats (YYYY-MM-DD, etc.)
- Can be used in SELECT, WHERE, and GROUP BY clauses
- Returns NULL for invalid dates
- Date parsing uses PHP's DateTime class
Supported Date Formats:
YYYY-MM-DD(2024-01-15)YYYY/MM/DD(2024/01/15)DD-MM-YYYY(15-01-2024)- Any format supported by PHP's
DateTimeconstructor
The following SQL features are not currently implemented:
Aggregation:
GROUP BY- Grouping rowsHAVING- Filter grouped resultsCOUNT(),SUM(),AVG(),MIN(),MAX()- Aggregate functionsDISTINCT- Remove duplicates
Sorting & Limiting:
ORDER BY- Sort resultsLIMIT- Limit number of resultsOFFSET- Skip results
Advanced Filtering:
IN (value1, value2, ...)- Match multiple valuesLIKE- Pattern matchingBETWEEN- Range queries- Complex nested WHERE conditions
Data Modification:
INSERT- Add rowsUPDATE- Modify rowsDELETE- Remove rows
Advanced Joins:
FULL OUTER JOIN- Full outer join- Self-joins
- Non-equality joins (>, <, etc.)
Other:
- Subqueries
UNION- Combine queries- Window functions
- CTEs (Common Table Expressions)
Many of these features can be implemented in PHP after querying:
$cql = new CQL();
$results = $cql->query($query);
// Sorting
usort($results, fn($a, $b) => $a['age'] <=> $b['age']);
// Limiting
$limited = array_slice($results, 0, 10);
// Distinct
$unique = array_unique($results, SORT_REGULAR);
// Aggregation
$total = array_sum(array_column($results, 'amount'));
$average = $total / count($results);
// Grouping
$grouped = [];
foreach ($results as $row) {
$grouped[$row['category']][] = $row;
}<?php
use CQL\CQL;
// users.csv:
// id,name,age,status
// 1,Alice,30,active
// 2,Bob,25,inactive
// 3,Charlie,35,active
$cql = new CQL();
$results = $cql->execute("
DEFINE 'users.csv' AS users WITH HEADERS
SELECT name, age
FROM users
WHERE age >= 30 AND status = 'active'
");
// Output:
// {"name":"Alice","age":"30"}
// {"name":"Charlie","age":"35"}<?php
use CQL\CQL;
// users.csv:
// id,name
// 1,Alice
// 2,Bob
// orders.csv:
// order_id,user_id,product,amount
// 101,1,Laptop,1200
// 102,1,Mouse,25
// 103,2,Keyboard,75
$cql = new CQL();
$results = $cql->execute("
DEFINE 'users.csv' AS users WITH HEADERS
DEFINE 'orders.csv' AS orders WITH HEADERS
SELECT users.name, orders.product, orders.amount
FROM users
JOIN orders ON users.id = orders.user_id
WHERE orders.amount > 50
");
// Output:
// {"users.name":"Alice","orders.product":"Laptop","orders.amount":"1200"}
// {"users.name":"Bob","orders.product":"Keyboard","orders.amount":"75"}<?php
use CQL\CQL;
$cql = new CQL();
$results = $cql->execute("
DEFINE 'employees.csv' AS emp WITH HEADERS
SELECT emp.first_name AS name, emp.salary AS pay
FROM emp
WHERE emp.department = 'Engineering'
");
// Output uses aliased column names:
// {"name":"John","pay":"95000"}
// {"name":"Jane","pay":"105000"}<?php
use CQL\CQL;
// products.csv:
// name,price,quantity
// Widget,10.50,100
// Gadget,25.00,50
// Doohickey,5.75,200
$cql = new CQL();
$results = $cql->execute("
DEFINE 'products.csv' AS products WITH HEADERS
SELECT name, price, quantity
FROM products
WHERE price * quantity > 1000
");
// Output:
// {"name":"Gadget","price":"25.00","quantity":"50"}
// {"name":"Doohickey","price":"5.75","quantity":"200"}<?php
use CQL\CQL;
// For large CSV files (>50MB), use streaming mode to reduce memory usage
// Explicit streaming
$cql = CQL::streaming();
$results = $cql->execute("
DEFINE 'large_dataset.csv' AS data WITH HEADERS
SELECT id, name, value
FROM data
WHERE value > 1000
");
// Automatic mode (recommended) - decides based on file size
$cql = CQL::auto();
$results = $cql->execute($query);
// Custom threshold
$cql = CQL::auto(10 * 1024 * 1024); // 10MB threshold
$results = $cql->execute($query);CQL Core is built with a clean, modular architecture consisting of three main components:
Breaks down query strings into tokens (keywords, identifiers, operators, etc.)
Converts tokens into an Abstract Syntax Tree (AST) representing the query structure
Executes the parsed query against CSV data sources and returns results
- CSVDataSource - Handles CSV file loading with streaming support
- Collection - Functional data manipulation with filter, map, and iteration
All of these components are accessed through the simple CQL facade class, which handles the entire pipeline automatically.
CQL Core provides specific exceptions for different error scenarios:
| Exception | Purpose |
|---|---|
DataSourceException |
CSV file errors (not found, permissions, format issues) |
InterpreterException |
Execution errors (undefined aliases, ambiguous columns) |
LexerException |
Tokenization errors |
ParserException |
Parsing errors (unexpected tokens, malformed syntax) |
SyntaxException |
Query syntax errors (invalid keywords, structure) |
use CQL\CQL;
use CQL\Exceptions\{
DataSourceException,
InterpreterException,
SyntaxException
};
try {
$cql = new CQL();
$results = $cql->execute($query);
} catch (DataSourceException $e) {
// Handle file not found, permission errors, etc.
echo "Data source error: " . $e->getMessage();
} catch (SyntaxException $e) {
// Handle invalid query syntax
echo "Syntax error: " . $e->getMessage();
} catch (InterpreterException $e) {
// Handle execution errors
echo "Execution error: " . $e->getMessage();
}Extend the operator system by implementing operator interfaces:
use CQL\Engine\Operators\BaseComparisonOperator;
use CQL\Engine\Operators\Contracts\ComparisonOperatorInterface;
class LikeOperator extends BaseComparisonOperator implements ComparisonOperatorInterface
{
public static function symbols(): array
{
return ['LIKE'];
}
public static function evaluate(mixed $left, mixed $right): bool
{
$pattern = str_replace('%', '.*', preg_quote($right, '/'));
return (bool) preg_match("/^{$pattern}$/i", $left);
}
}
// Register the operator
use CQL\Engine\Operators\Registry\OperatorRegistry;
OperatorRegistry::register(LikeOperator::class);Results are returned as a Collection object with helpful methods:
$cql = new CQL();
$results = $cql->execute($query);
// Get result count
$count = $results->count();
// Convert to array
$array = $results->toArray();
// Get first result
$first = $results->first();
// Chain operations
$filtered = $results
->filter(fn($row) => $row['age'] > 30)
->map(fn($row) => ['name' => strtoupper($row['name'])]);
// Iterate
foreach ($results as $row) {
// Process each row
}
// Or use convenience methods directly
$array = $cql->query($query); // Returns array
$first = $cql->first($query); // Returns first row
$count = $cql->count($query); // Returns countCQL offers three modes for handling CSV files:
1. Normal Mode - Explicitly loads entire CSV into memory
- Best for: Small to medium files (<50MB)
- Usage:
CQL::normal() - Pros: Faster query execution, supports complex operations
- Cons: High memory usage for large files
2. Streaming Mode - Explicitly processes rows one at a time
- Best for: Large files (>50MB)
- Usage:
CQL::streaming() - Pros: Minimal memory footprint, handles files of any size
- Cons: Slightly slower for small files, still loads results into memory
3. Automatic Mode - Automatically selects based on file size
- Best for: Unknown file sizes, dynamic workloads (recommended)
- Usage:
new CQL()orCQL::auto() - Usage:
new Interpreter($ast, streaming: null)ornew Interpreter($ast) - Default threshold: 50 MB (configurable)
- Pros: No manual decision needed, optimal for most cases
- Cons: Requires file size check
Mode Selection Examples:
use CQL\CQL;
// Explicit normal mode (always load into memory)
$cql = CQL::normal();
// Explicit streaming mode (always stream)
$cql = CQL::streaming();
// Automatic mode with default 50MB threshold (recommended)
$cql = new CQL();
// or
$cql = CQL::auto();
// Automatic mode with custom 10MB threshold
$cql = CQL::auto(10 * 1024 * 1024);
// Configuration via constructor
$cql = new CQL([
'streaming' => true,
'autoStreamingThreshold' => 20 * 1024 * 1024
]);Performance Comparison:
| File Size | Mode | Memory Usage | Speed |
|---|---|---|---|
| 1000 rows (~22KB) | Normal | ~500 KB | Fast |
| 1000 rows (~22KB) | Streaming | ~0 KB | Very Fast |
| 100,000 rows (~2MB) | Normal | ~50 MB | Fast |
| 100,000 rows (~2MB) | Streaming | ~5 MB | Fast |
| 1,000,000 rows (~20MB) | Normal | ~500 MB | Slow |
| 1,000,000 rows (~20MB) | Streaming | ~50 MB | Medium |
- Joins: Join operations create in-memory indexes. Multiple joins on large datasets may consume significant memory even in streaming mode.
- Filtering: WHERE clauses are applied during data loading in streaming mode, reducing memory usage.
- File Format: Ensure CSV files are well-formed to avoid parsing errors.
- Caching: Consider caching parsed ASTs for frequently-executed queries.
- No Aggregations: Currently no support for GROUP BY, COUNT, SUM, AVG, etc.
- No Sorting: ORDER BY is not yet implemented
- No Limits: LIMIT and OFFSET are not yet implemented
- Single WHERE Condition: Complex WHERE clauses with multiple conditions are limited
- No Subqueries: Nested queries are not supported
- No INSERT/UPDATE/DELETE: Read-only operations only
Future features under consideration:
- Aggregation functions (COUNT, SUM, AVG, MIN, MAX)
- GROUP BY and HAVING clauses
- ORDER BY with ASC/DESC
- LIMIT and OFFSET
- Complex WHERE conditions with nested logic
- Subqueries
- UNION operations
- DISTINCT keyword
- Additional operators (LIKE, BETWEEN, IN with arrays)
- Write operations (INSERT, UPDATE, DELETE)
- JSON and other data format support
Run the test suite:
composer testRun static analysis:
vendor/bin/phpstan analyzeRun code style checks:
vendor/bin/phpcs --standard=.phpcs.xml srcContributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Write tests for new functionality
- Ensure all tests pass
- Submit a pull request
MIT License - see LICENSE file for details
Developed by the Absorbing team.
- Issues: GitHub Issues
- Discussions: GitHub Discussions