PowerSQL
PowerSQL is an open source data transformation tool built for data analysts, engineers and scientists.
It utilizes the full power of SQL, so you can focus on improving the data and transforming the data to your needs.
PowerSQL helps you improving the quality of your data and code in the following ways:
- Automatic syntax and type checks based on your code. This helps mistakes from reaching
- Expressive data tests using
ASSERT
-tests. - Pure SQL, makes it easy to integrate with existing tooling like formatters and PowerSQL currently supports BigQuery and PostgreSQL as backends.
PowerSQL is fully open source, you can find it on GitHub
Installation
PowerSQL is primarily used as a command line tool. The current distribution is via a rust crate. For this you need first to install the build tool Cargo:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Then install PowerSQL like this for BigQuery:
cargo install powersql --features bigquery
Or install PowerSQL like this for PostgreSQL:
cargo install powersql --features postgres
PostgreSQL
To get started with PostgreSQL, simply create a new project in a file called powersql.toml
:
[project]
name = "my_project"
models = ["models"]
tests = ["tests]
Now create one or more models in the models
directory:
CREATE VIEW my_model AS SELECT id, category from my_source;
CREATE TABLE category_stats AS SELECT COUNT(*) category_count FROM my_model GROUP BY category;
PowerSQL automatically will create a DAG based on the relations in your database.
To run against the database, provide the following environment variables:
- PG_HOSTNAME
- PG_USERNAME
- PG_PORT
- PG_DATABASE
- PG_PASSWORD
BigQuery
To run against the database, provide the following environment variables:
- GOOGLE_APPLICATION_CREDENTIALS
- PROJECT_ID
- DATASET_ID
- LOCATION
GOOGLE_APPLICATION_CREDENTIALS
should refer to an service account key file (this can be set by an appliation rather than locally).
PROJECT_ID
is the id (not number) of the project and DATASET_ID
is the name of the dataset that is used by default.
LOCATION
is an (optional) datacenter location id where the query is being executed.
Commands
powersql check
: This will load all your.sql
files in the directories listed inmodels
. It will check the syntax of the SQL statements. After this, it will check the DAG and report if there is a circular dependency. Finally, it will run a type checker and report any type errors.powersql run
: Loads and runs the entire DAG of SQL statements.powersql test
: Loads and runs the data tests. By runningpowersql test --fail-fast
powersql will stop at the first failure.
Data tests
Data tests are ASSERT
statements that you can run on your database tables and views and perform checks on data quality, recency, etc. Assert statements checks the result of a condition - a boolean expression.
Assert-based testing are enabled for every backend, they are translated by PowerSQL to queries return a
boolean.
Some examples:
-- Column should be NOT NULL
ASSERT NOT EXISTS(
SELECT X
FROM t
WHERE column IS NULL
) AS 'column should be non null';
ASSERT NOT EXISTS (
SELECT quantity
FROM rev_per_product
WHERE quantity <= 0
) AS 'quantity should be positive';
ASSERT NOT EXISTS (
SELECT product_id
FROM rev_per_product
WHERE product_id IS NULL
) AS 'product_id should be not null';
ASSERT (
SELECT COUNT (*)
FROM rev_per_product
WHERE quantity < 10
) >= 0.7 * (
SELECT COUNT(*)
FROM rev_per_product
) AS 'At least 70% should have a quantity lower than 10'