Skip to content

Commit

Permalink
Release 0.3.0 (#575)
Browse files Browse the repository at this point in the history
Updates the docs, changelog and README for the 0.3.0 release.

---------

Co-authored-by: Jonathan Dance (JD) <[email protected]>
  • Loading branch information
JelteF and wuputah authored Feb 13, 2025
1 parent eb40e2f commit 953ffc4
Show file tree
Hide file tree
Showing 8 changed files with 227 additions and 44 deletions.
57 changes: 57 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,60 @@
# 0.3.0 (2025-02-13)

## Added

- Support using Postgres indexes and reading from partitioned tables. ([#477])
- The `AS (id bigint, name text)` syntax is no longer supported when using `read_parquet`, `iceberg_scan`, etc. The new syntax is as follows: ([#531])

```sql
SELECT * FROM read_parquet('file.parquet');
SELECT r['id'], r['name'] FROM read_parquet('file.parquet') r WHERE r['age'] > 21;
```

- Add a `duckdb.query` function which allows using DuckDB query syntax in Postgres. ([#531])
- Support the `approx_count_distinct` DuckDB aggregate. ([#499])
- Support the `bytea` (aka blob), `uhugeint`,`jsonb`, `timestamp_ns`, `timestamp_ms`, `timestamp_s` & `interval` types. ([#511], [#525], [#513], [#534], [(#573)])
- Support DuckDB [json functions and aggregates](https://duckdb.org/docs/data/json/json_functions.html). ([#546])
- Add support for the `duckdb.allow_community_extensions` setting.
- We have an official logo! 🎉 ([#575])

## Changed

- Update to DuckDB 1.2.0. ([#548])
- Allow executing `duckdb.raw_query`, `duckdb.cache_info`, `duckdb.cache_delete` and `duckdb.recycle_db` as non-superusers. ([#572])
- Only sync MotherDuck catalogs when there is DuckDB query activity. ([#582])

## Fixed

- Correctly parse parameter lists in `COPY` commands. This allows using `PARTITION_BY` as one of the `COPY` options. ([#465])
- Correctly read cache metadata for files larger than 4GB. ([#494])
- Fix bug in parameter handling for prepared statements and PL/pgSQL functions. ([#491])
- Fix comparisons and operators on the `timestamp with timezone` field by enabling DuckDB its `icu` extension by default. ([#512])
- Allow using `read_parquet` functions when not using superuser privileges. ([#550])
- Fix some case insensitivity issues when reading from Postgres tables. ([#563])
- Fix case where cancel requests (e.g. triggered by pressing Ctrl+C in `psql`) would be ignored ([#548], [#584], [#587])

[#477]: https://github.com/duckdb/pg_duckdb/pull/477
[#531]: https://github.com/duckdb/pg_duckdb/pull/531
[#499]: https://github.com/duckdb/pg_duckdb/pull/499
[#511]: https://github.com/duckdb/pg_duckdb/pull/511
[#525]: https://github.com/duckdb/pg_duckdb/pull/525
[#513]: https://github.com/duckdb/pg_duckdb/pull/513
[#534]: https://github.com/duckdb/pg_duckdb/pull/534
[#573]: https://github.com/duckdb/pg_duckdb/pull/573
[#546]: https://github.com/duckdb/pg_duckdb/pull/546
[#575]: https://github.com/duckdb/pg_duckdb/pull/575
[#548]: https://github.com/duckdb/pg_duckdb/pull/548
[#572]: https://github.com/duckdb/pg_duckdb/pull/572
[#582]: https://github.com/duckdb/pg_duckdb/pull/582
[#465]: https://github.com/duckdb/pg_duckdb/pull/465
[#494]: https://github.com/duckdb/pg_duckdb/pull/494
[#491]: https://github.com/duckdb/pg_duckdb/pull/491
[#512]: https://github.com/duckdb/pg_duckdb/pull/512
[#550]: https://github.com/duckdb/pg_duckdb/pull/550
[#563]: https://github.com/duckdb/pg_duckdb/pull/563
[#584]: https://github.com/duckdb/pg_duckdb/pull/584
[#587]: https://github.com/duckdb/pg_duckdb/pull/587

# 0.2.0 (2024-12-10)

## Added
Expand Down
22 changes: 12 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
<p align="center">
<img width="500" src="logo.png" alt="temporary logo" />
<picture>
<source media="(prefers-color-scheme: dark)" srcset="logo-dark.svg">
<img width="800" src="logo-light.svg" alt="pg_duckdb logo" />
</picture>
</p>

0.2.0 release is here 🎉 Please [try](#installation) it out!
0.3.0 release is here 🎉 Please [try](#installation) it out!

# pg_duckdb: Official Postgres extension for DuckDB

Expand All @@ -19,9 +22,9 @@ See our [official documentation][docs] for further details.
- If DuckDB cannot support the query for any reason, execution falls back to Postgres.
- Read and Write support for object storage (AWS S3, Azure, Cloudflare R2, or Google GCS):
- Read parquet, CSV and JSON files:
- `SELECT n FROM read_parquet('s3://bucket/file.parquet') AS (n int)`
- `SELECT n FROM read_csv('s3://bucket/file.csv') AS (n int)`
- `SELECT n FROM read_json('s3://bucket/file.json') AS (n int)`
- `SELECT * FROM read_parquet('s3://bucket/file.parquet')`
- `SELECT r['id'], r['name'] FROM read_csv('s3://bucket/file.csv') r`
- `SELECT count(*) FROM read_json('s3://bucket/file.json')`
- You can pass globs and arrays to these functions, just like in DuckDB
- Enable the DuckDB Iceberg extension using `SELECT duckdb.install_extension('iceberg')` and read Iceberg files with `iceberg_scan`.
- Enable the DuckDB Delta extension using `SELECT duckdb.install_extension('delta')` and read Delta files with `delta_scan`.
Expand All @@ -32,8 +35,8 @@ See our [official documentation][docs] for further details.

```sql
COPY (
SELECT count(*), name
FROM read_parquet('s3://bucket/file.parquet') AS (name text)
SELECT count(*), r['name']
FROM read_parquet('s3://bucket/file.parquet') r
GROUP BY name
ORDER BY count DESC
) TO 's3://bucket/results.parquet';
Expand Down Expand Up @@ -149,9 +152,8 @@ Querying data stored in Parquet, CSV, JSON, Iceberg and Delta format can be done
3. Perform analytics on your data.

```sql
SELECT SUM(price) AS total, item_id
FROM read_parquet('s3://your-bucket/purchases.parquet')
AS (price float, item_id int)
SELECT SUM(r['price']) AS total, r['item_id']
FROM read_parquet('s3://your-bucket/purchases.parquet') r
GROUP BY item_id
ORDER BY total DESC
LIMIT 100;
Expand Down
102 changes: 81 additions & 21 deletions docs/functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,16 @@ Note: `ALTER EXTENSION pg_duckdb WITH SCHEMA schema` is not currently supported.
| [`iceberg_snapshots`](#iceberg_snapshots) | Read Iceberg snapshot information |
| [`delta_scan`](#delta_scan) | Read a Delta dataset |

## JSON Functions

All of the DuckDB [json functions and aggregates](https://duckdb.org/docs/data/json/json_functions.html). Postgres JSON/JSONB functions are not supported.

## Aggregates

|Name|Description|
| :--- | :---------- |
|[`approx_count_distinct`](https://duckdb.org/docs/sql/functions/aggregates.html#approximate-aggregates)|Gives the approximate count of distinct elements using HyperLogLog|

## Cache Management Functions

| Name | Description |
Expand All @@ -29,7 +39,8 @@ Note: `ALTER EXTENSION pg_duckdb WITH SCHEMA schema` is not currently supported.
| Name | Description |
| :--- | :---------- |
| [`duckdb.install_extension`](#install_extension) | Installs a DuckDB extension |
| [`duckdb.raw_query`](#raw_query) | Runs a query directly against DuckDB (meant for debugging)|
| [`duckdb.query`](#query) | Runs a SELECT query directly against DuckDB |
| [`duckdb.raw_query`](#raw_query) | Runs any query directly against DuckDB (meant for debugging)|
| [`duckdb.recycle_ddb`](#recycle_ddb) | Force a reset the DuckDB instance in the current connection (meant for debugging) |

## Motherduck Functions
Expand All @@ -40,14 +51,16 @@ Note: `ALTER EXTENSION pg_duckdb WITH SCHEMA schema` is not currently supported.

## Detailed Descriptions

#### <a name="read_parquet"></a>`read_parquet(path TEXT or TEXT[], /* optional parameters */) -> SETOF record`
#### <a name="read_parquet"></a>`read_parquet(path TEXT or TEXT[], /* optional parameters */) -> SETOF duckdb.row`

Reads a parquet file, either from a remote location (via httpfs) or a local file.

Returns a record set (`SETOF record`). Functions that return record sets need to have their columns and types specified using `AS`. You must specify at least one column and any columns used in your query. For example:
This returns DuckDB rows, you can expand them using `*` or you can select specific columns using the `r['mycol']` syntax. If you want to select specific columns you should give the function call an easy alias, like `r`. For example:

```sql
SELECT COUNT(i) FROM read_parquet('file.parquet') AS (int i);
SELECT * FROM read_parquet('file.parquet');
SELECT r['id'], r['name'] FROM read_parquet('file.parquet') r WHERE r['age'] > 21;
SELECT COUNT(*) FROM read_parquet('file.parquet');
```

Further information:
Expand All @@ -65,14 +78,16 @@ Further information:

Optional parameters mirror [DuckDB's read_parquet function](https://duckdb.org/docs/data/parquet/overview.html#parameters). To specify optional parameters, use `parameter := 'value'`.

#### <a name="read_csv"></a>`read_csv(path TEXT or TEXT[], /* optional parameters */) -> SETOF record`
#### <a name="read_csv"></a>`read_csv(path TEXT or TEXT[], /* optional parameters */) -> SETOF duckdb.row`

Reads a CSV file, either from a remote location (via httpfs) or a local file.

Returns a record set (`SETOF record`). Functions that return record sets need to have their columns and types specified using `AS`. You must specify at least one column and any columns used in your query. For example:
This returns DuckDB rows, you can expand them using `*` or you can select specific columns using the `r['mycol']` syntax. If you want to select specific columns you should give the function call an easy alias, like `r`. For example:

```sql
SELECT COUNT(i) FROM read_csv('file.csv') AS (int i);
SELECT * FROM read_csv('file.csv');
SELECT r['id'], r['name'] FROM read_csv('file.csv') r WHERE r['age'] > 21;
SELECT COUNT(*) FROM read_csv('file.csv');
```

Further information:
Expand All @@ -95,14 +110,16 @@ Compatibility notes:
* `columns` is not currently supported.
* `nullstr` must be an array (`TEXT[]`).

#### <a name="read_json"></a>`read_json(path TEXT or TEXT[], /* optional parameters */) -> SETOF record`
#### <a name="read_json"></a>`read_json(path TEXT or TEXT[], /* optional parameters */) -> SETOF duckdb.row`

Reads a JSON file, either from a remote location (via httpfs) or a local file.

Returns a record set (`SETOF record`). Functions that return record sets need to have their columns and types specified using `AS`. You must specify at least one column and any columns used in your query. For example:
This returns DuckDB rows, you can expand them using `*` or you can select specific columns using the `r['mycol']` syntax. If you want to select specific columns you should give the function call an easy alias, like `r`. For example:

```sql
SELECT COUNT(i) FROM read_json('file.json') AS (int i);
SELECT * FROM read_parquet('file.parquet');
SELECT r['id'], r['name'] FROM read_parquet('file.parquet') r WHERE r['age'] > 21;
SELECT COUNT(*) FROM read_parquet('file.parquet');
```

Further information:
Expand All @@ -123,7 +140,7 @@ Compatibility notes:

* `columns` is not currently supported.

#### <a name="iceberg_scan"></a>`iceberg_scan(path TEXT, /* optional parameters */) -> SETOF record`
#### <a name="iceberg_scan"></a>`iceberg_scan(path TEXT, /* optional parameters */) -> SETOF duckdb.row`

Reads an Iceberg table, either from a remote location (via httpfs) or a local directory.

Expand All @@ -133,10 +150,12 @@ To use `iceberg_scan`, you must enable the `iceberg` extension:
SELECT duckdb.install_extension('iceberg');
```

Returns a record set (`SETOF record`). Functions that return record sets need to have their columns and types specified using `AS`. You must specify at least one column and any columns used in your query. For example:
This returns DuckDB rows, you can expand them using `*` or you can select specific columns using the `r['mycol']` syntax. If you want to select specific columns you should give the function call an easy alias, like `r`. For example:

```sql
SELECT COUNT(i) FROM iceberg_scan('data/iceberg/table') AS (int i);
SELECT * FROM iceberg_scan('data/iceberg/table');
SELECT r['id'], r['name'] FROM iceberg_scan('data/iceberg/table') r WHERE r['age'] > 21;
SELECT COUNT(*) FROM iceberg_scan('data/iceberg/table');
```

Further information:
Expand Down Expand Up @@ -209,22 +228,25 @@ Optional parameters mirror DuckDB's `iceberg_metadata` function based on the Duc

TODO

#### <a name="delta_scan"></a>`delta_scan(path TEXT) -> SETOF record`
#### <a name="delta_scan"></a>`delta_scan(path TEXT) -> SETOF duckdb.row`

Reads a delta dataset, either from a remote (via httpfs) or a local location.

Returns a record set (`SETOF record`). Functions that return record sets need to have their columns and types specified using `AS`. You must specify at least one column and any columns used in your query. For example:

To use `delta_scan`, you must enable the `delta` extension:

```sql
SELECT duckdb.install_extension('delta');
```

This returns DuckDB rows, you can expand them using `*` or you can select specific columns using the `r['mycol']` syntax. If you want to select specific columns you should give the function call an easy alias, like `r`. For example:

```sql
SELECT COUNT(i) FROM delta_scan('/path/to/delta/dataset') AS (int i);
SELECT * FROM delta_scan('/path/to/delta/dataset');
SELECT r['id'], r['name'] FROM delta_scan('/path/to/delta/dataset') r WHERE r['age'] > 21;
SELECT COUNT(*) FROM delta_scan('/path/to/delta/dataset');
```


Further information:

* [DuckDB Delta extension documentation](https://duckdb.org/docs/extensions/delta)
Expand All @@ -248,7 +270,6 @@ Note that cache management is not automated. Cached data must be deleted manuall
| path | text | The path to a remote httpfs location to cache. |
| type | text | File type, either `parquet` or `csv` |


#### <a name="cache_info"></a>`duckdb.cache_info() -> (remote_path text, cache_key text, cache_file_size BIGINT, cache_file_timestamp TIMESTAMPTZ)`

Inspects which remote files are currently cached in DuckDB. The returned data is as follows:
Expand Down Expand Up @@ -280,15 +301,54 @@ WHERE remote_path = '...';

#### <a name="install_extension"></a>`duckdb.install_extension(extension_name TEXT) -> bool`

TODO
Installs a DuckDB extension and configures it to be loaded automatically in
every session that uses pg_duckdb.

```sql
SELECT duckdb.install_extension('iceberg');
```

##### Security

Since this function can be used to install and download any of the official
extensions it can only be executed by a superuser by default. To allow
execution by some other admin user, such as `my_admin`, you can grant such a
user the following permissions:

```sql
GRANT ALL ON FUNCTION duckdb.install_extension(TEXT) TO my_admin;
GRANT ALL ON TABLE duckdb.extensions TO my_admin;
GRANT ALL ON SEQUENCE duckdb.extensions_table_seq TO my_admin;
```

##### Required Arguments

| Name | Type | Description |
| :--- | :--- | :---------- |
| extension_name | text | The name of the extension to install |

#### <a name="query"></a>`duckdb.query(query TEXT) -> SETOF duckdb.row`

Executes the given SELECT query directly against DuckDB. This can be useful if DuckDB syntax makes the query easier to write or if you want to use a function that is not exposed by pg_duckdb yet. If you use it because of a missing function in pg_duckdb, please also open an issue on the GitHub repository so that we can add support. For example the below query shows a query that puts `FROM` before `SELECT` and uses a list comprehension. Both of those features are not supported in Postgres.

```sql
SELECT * FROM duckdb.query('FROM range(10) as a(a) SELECT [a for i in generate_series(0, a)] as arr');
```

#### <a name="raw_query"></a>`duckdb.raw_query(extension_name TEXT) -> void`

TODO
Runs an arbitrary query directly against DuckDB. Compared to `duckdb.query`, this function can execute any query, not just SELECT queries. The main downside is that it doesn't return its result as rows, but instead sends the query result to the logs. So the recommendation is to use `duckdb.query` when possible, but if you need to run e.g. some DDL you can use this function.

#### <a name="recycle_ddb"></a>`duckdb.recycle_ddb() -> void`

TODO
pg_duckdb keeps the DuckDB instance open inbetween transactions. This is done
to save session level state, such as manually done `SET` commands. If you want
to clear this session level state for some reason you can close the currently
open DuckDB instance using:

```sql
CALL duckdb.recycle_ddb();
```

#### <a name="force_motherduck_sync"></a>`duckdb.force_motherduck_sync(drop_with_cascade BOOLEAN DEFAULT false)`

Expand Down
14 changes: 11 additions & 3 deletions docs/settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,14 @@ Whether known extensions are allowed to be automatically loaded when a DuckDB qu

Default: `true`

### `duckdb.allow_community_extensions`

Disable installing community extensions.

Default: `false`

Access: Superuser-only

### `duckdb.enable_external_access` (experimental)

Allow the DuckDB to access external access (e.g., HTTP, S3, etc.). This setting is not tested very well yet and disabling it may break unintended `pg_duckdb` functionality.
Expand Down Expand Up @@ -109,11 +117,11 @@ Default: `-1`

Access: Superuser-only

### `duckdb.max_threads_per_postgres_scan` (experimental)
### `duckdb.max_workers_per_postgres_scan`

Maximum number of DuckDB threads used for a single Postgres scan on heap tables (Postgres its regular storage format). In early testing, setting this to `1` has shown to be faster in most cases (for now). So changing this setting to a higher value than the default is currently not recommended.
Maximum number of PostgreSQL workers used for a single Postgres scan. This is similar to Postgres its `max_parallel_workers_per_gather` setting.

Default: `1`
Default: `2`

Access: General

Expand Down
Loading

0 comments on commit 953ffc4

Please sign in to comment.