Skip to content

Commit

Permalink
Correctly handle tables with attndims=0
Browse files Browse the repository at this point in the history
  • Loading branch information
JelteF committed Feb 19, 2025
1 parent 82d3a0f commit 03686bf
Show file tree
Hide file tree
Showing 6 changed files with 145 additions and 2 deletions.
7 changes: 6 additions & 1 deletion docs/types.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Able to read many [data types](https://www.postgresql.org/docs/current/datatype.
- `boolean`
- `uuid`
- `json`/`jsonb`
- `arrays` for all of the above types
- `arrays` for all of the above types, but see limitations below about multi-dimensional arrays

## Known limitations

Expand All @@ -26,6 +26,11 @@ to fix these limitations:
5. `jsonb` columns are converted to `json` columns when reading from DuckDB. This is because DuckDB does not have a `jsonb` type.
6. Many Postgres `json` and `jsonb` functions and operators are not implemented in DuckDB. Instead you can use DuckDB json functions and operators. See the [DuckDB documentation](https://duckdb.org/docs/data/json/json_functions) for more information on these functions.
7. The DuckDB `tinyint` type is converted to a `char` type in Postgres. This is because Postgres does not have a `tinyint` type. This causes it to be displayed as a hex code instead of a regular number.
8. Conversion between in Postgres multi-dimensional arrays and DuckDB nested `LIST`s in DuckDB can run into various problems, because neither database supports the thing that the other supports exactly. Specifically in Postgres it's allowed for different arrays in a column to have a different number of dimensions, e.g. `[1]` and `[[1], [2]]` can both occur in the same column. In DuckDB that's not allowed, i.e. the amount of nesting should always be the same. On the other hand, in DuckDB it's valid for different lists at the same nest-level to contain a different number of elements, e.g. `[[1], [1, 2]`. This is not allowed in Postgres. So conversion between these types is only possible when the arrays follow the subset. Another possible problem that you can run into is that pg_duckdb uses the Postgres column metadata to determine the number of dimensions that an array has. Since Postgres doesn't complain when you add arrays of different dimensions, it's possible that the number of dimensions in the column metadata does not match the actual number of dimensions. To solve this you need to alter the column type:
```sql
-- This configures the column to be a 3-dimensional array of text
ALTER TABLE s ALTER COLUMN a SET DATA TYPE text[][][];
```

## Special types

Expand Down
7 changes: 7 additions & 0 deletions include/pgduckdb/pg/types.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#pragma once

#include "pgduckdb/pg/declarations.hpp"

namespace pgduckdb::pg {
bool IsArrayType(Oid type_oid);
}
16 changes: 16 additions & 0 deletions src/pg/types.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#include "pgduckdb/pg/types.hpp"
#include "pgduckdb/pgduckdb_utils.hpp"
extern "C" {
#include "postgres.h"
#include "utils/lsyscache.h"
}

namespace pgduckdb::pg {

bool
IsArrayType(Oid type_oid) {
// inlined type_is_array
return PostgresFunctionGuard(get_element_type, type_oid) != InvalidOid;
}

} // namespace pgduckdb::pg
27 changes: 26 additions & 1 deletion src/pgduckdb_types.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
#include "pgduckdb/pgduckdb_types.hpp"
#include "pgduckdb/pgduckdb_utils.hpp"
#include "pgduckdb/scan/postgres_scan.hpp"
#include "pgduckdb/pg/types.hpp"

extern "C" {

Expand Down Expand Up @@ -982,9 +983,33 @@ ConvertPostgresToBaseDuckColumnType(Form_pg_attribute &attribute) {
duckdb::LogicalType
ConvertPostgresToDuckColumnType(Form_pg_attribute &attribute) {
auto base_type = ConvertPostgresToBaseDuckColumnType(attribute);

if (!pg::IsArrayType(attribute->atttypid)) {
return base_type;
}

auto dimensions = attribute->attndims;

/*
* Multi-dimensional arrays in Postgres and nested lists in DuckDB are
* quite different in behaviour. We try to map them to eachother anyway,
* because in a lot of cases that works fine. But there's also quite a few
* where users will get errors.
*
* To support multi-dimensional arrays that are stored in Postgres tables,
* we assume that the attndims value is correct. If people have specified
* the matching number of [] when creating the table, that is the case.
* It's even possible to store arrays of different dimensions in a single
* column. DuckDB does not support that.
*
* In certain cases (such as tables created by a CTAS) attndims can even be
* 0 for array types. It's impossible for us to find out what the actual
* dimensions are without reading the first row. Given that it's most
* to use single-dimensional arrays, we assume that such a column stores
* those.
*/
if (dimensions == 0) {
return base_type;
dimensions = 1;
}

for (int i = 0; i < dimensions; i++) {
Expand Down
57 changes: 57 additions & 0 deletions test/regression/expected/array_problems.out
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
set duckdb.force_execution TO FALSE;
CREATE TABLE s (a text[]);
INSERT INTO s VALUES (ARRAY['abc', 'def', 'ghi']);
-- Because the next table is created using a CTAS, attndims is set to 0. That
-- confused us in the past. See #556 for details. We assume it's single
-- dimensional now.
CREATE TABLE t AS TABLE s;
SELECT * FROM s;
a
---------------
{abc,def,ghi}
(1 row)

SELECT * FROM t;
a
---------------
{abc,def,ghi}
(1 row)

SET duckdb.force_execution TO true;
SELECT * FROM s;
a
---------------
{abc,def,ghi}
(1 row)

SELECT * FROM t;
a
---------------
{abc,def,ghi}
(1 row)

-- Processing arryas of different dimensions in the same column is not something
-- that DuckDB can handle.
INSERT INTO s VALUES(ARRAY[['a', 'b'],['c','d']]);
SELECT * FROM s;
ERROR: (PGDuckDB/Duckdb_ExecCustomScan_Cpp) Invalid Input Error: Dimensionality of the schema and the data does not match, data contains more dimensions than the amount of dimensions specified by the schema
TRUNCATE s;
-- And we assume that the table metadata is correct about the dimensionality.
-- So even if the stored dimensionality is consistently wrong we will throw an
-- error.
INSERT INTO s VALUES(ARRAY[['a', 'b'],['c','d']]);
SELECT * FROM s;
ERROR: (PGDuckDB/Duckdb_ExecCustomScan_Cpp) Invalid Input Error: Dimensionality of the schema and the data does not match, data contains more dimensions than the amount of dimensions specified by the schema
-- But if you change the defintion of the table, we will be able to handle it.
ALTER TABLE s ALTER COLUMN a SET DATA TYPE text[][];
SELECT * FROM s;
a
---------------
{{a,b},{c,d}}
(1 row)

-- Similarly Posgres cannot support nested lists where different sub-lists at
-- the same level have a different length.
SELECT * FROM duckdb.query($$ SELECT ARRAY[ARRAY[1,2], ARRAY[3,4,5]] arr $$);
ERROR: (PGDuckDB/Duckdb_ExecCustomScan_Cpp) Invalid Input Error: Expected 2 values in list at dimension 1, found 3 instead
DROP TABLE s, t;
33 changes: 33 additions & 0 deletions test/regression/sql/array_problems.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
set duckdb.force_execution TO FALSE;
CREATE TABLE s (a text[]);
INSERT INTO s VALUES (ARRAY['abc', 'def', 'ghi']);
-- Because the next table is created using a CTAS, attndims is set to 0. That
-- confused us in the past. See #556 for details. We assume it's single
-- dimensional now.
CREATE TABLE t AS TABLE s;
SELECT * FROM s;
SELECT * FROM t;
SET duckdb.force_execution TO true;
SELECT * FROM s;
SELECT * FROM t;

-- Processing arryas of different dimensions in the same column is not something
-- that DuckDB can handle.
INSERT INTO s VALUES(ARRAY[['a', 'b'],['c','d']]);
SELECT * FROM s;
TRUNCATE s;

-- And we assume that the table metadata is correct about the dimensionality.
-- So even if the stored dimensionality is consistently wrong we will throw an
-- error.
INSERT INTO s VALUES(ARRAY[['a', 'b'],['c','d']]);
SELECT * FROM s;
-- But if you change the defintion of the table, we will be able to handle it.
ALTER TABLE s ALTER COLUMN a SET DATA TYPE text[][];
SELECT * FROM s;

-- Similarly Posgres cannot support nested lists where different sub-lists at
-- the same level have a different length.
SELECT * FROM duckdb.query($$ SELECT ARRAY[ARRAY[1,2], ARRAY[3,4,5]] arr $$);

DROP TABLE s, t;

0 comments on commit 03686bf

Please sign in to comment.