Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for DuckDB #90

Open
johntruckenbrodt opened this issue Mar 18, 2024 · 5 comments
Open

support for DuckDB #90

johntruckenbrodt opened this issue Mar 18, 2024 · 5 comments

Comments

@johntruckenbrodt
Copy link

Hi,
I am currently trying to figure out a way to use pygeofilter for converting CQL2 to a DuckDB query using the spatial extension. This extension is quite prototypical and I stumble across a couple of caveats. Here's my current approach:

import duckdb

duckdb.install_extension('spatial')
duckdb.load_extension('spatial')

Define the CQL2 filter

cql2_filter = {
    "op": "and",
    "args": [
        {
            "op": ">=",
            "args": [
                {
                    "property": "end_datetime"
                },
                '2020-03-28T20:05:46+02'
            ]
        },
        {
            "op": "<=",
            "args": [
                {
                    "property": "start_datetime"
                },
                '2020-03-28T22:06:15+02'
            ]
        },
        {
            "op": "=",
            "args": [
                {
                    "property": "sar:instrument_mode"
                },
                'IW'
            ]
        }
    ]
}

Optionally add spatial filtering

# ext = None
ext = {'xmin': -4, 'xmax': -2, 'ymin': 6, 'ymax': 8}

if ext is not None:
    arg = {
        'op': 's_intersects',
        'args': [
            {
                'property': 'geometry'
            },
            {
                'type': 'Polygon',
                'coordinates': [[[ext['xmin'], ext['ymin']],
                                 [ext['xmin'], ext['ymax']],
                                 [ext['xmax'], ext['ymax']],
                                 [ext['xmax'], ext['ymin']],
                                 [ext['xmin'], ext['ymin']]]]
            }
        ]
    }
    cql2_filter['args'].append(arg)

Convert CQL2 filter to SQL where clause

from pygeofilter.parsers.cql2_json import parse as json_parse

filter = json_parse(cql2_filter)
from pygeofilter.backends.sql.evaluate import to_sql_where

sql_where = to_sql_where(filter, {
    's1:datatake': 's1:datatake',
    'datetime': 'datetime',
    'sar:instrument_mode': 'sar:instrument_mode',
    'end_datetime': 'end_datetime',
    'start_datetime': 'start_datetime',
    'geometry': 'geometry'
})

Create DuckDB query for a geoparquet file:
Here it gets ugly because (1) the WKB column needs to be converted to GEOMETRY type and (2) the DuckDB-spatial implementation of ST_GeomFromWKB cannot read the WKB-HEX representation returned by to_sql_where.

import re

sql_query = "SELECT * EXCLUDE geometry, ST_GeomFromWKB(geometry) AS geometry FROM '20200301_20200401.parquet' WHERE %s" % sql_where

if ext is not None:
    # convert WKB blob to GEOMETRY
    sql_query = sql_query.replace('ST_Intersects("geometry"', 'ST_Intersects(ST_GeomFromWKB(geometry)')
    
    # duckdb_spatial apparently cannot yet read wkb_hex representation -> convert it back to text
    spatial = ("ST_GeomFromText('POLYGON(({xmin} {ymin}, {xmin} {ymax}, "
               "{xmax} {ymax}, {xmax} {ymin}, {xmin} {ymin}))')")
    sql_query = re.sub(r'ST_GeomFromWKB\(x\'.*\'\)', spatial.format(**ext), sql_query)
sql_query

Execute the query:

df = duckdb.query(sql_query)

I wonder, how would you do this? Do you think there is anything that could/should be modified on the pygeofilter end or is it entirely up to the DuckDB-spatial package? I'd appreciate any help.

@constantinius
Copy link
Contributor

@johntruckenbrodt
Thanks for submitting this issue! I will take a look into this. I assume, that we could simply replace the mechanism to load the spatial values.

@johntruckenbrodt
Copy link
Author

Awesome! Thanks @constantinius. Let me know if there's anything I can do.

@constantinius
Copy link
Contributor

@johntruckenbrodt

The cleanest way I found would be to make a specific subclass for the evaluator and simply override the functions acting up. The following snippet works for me:

from pygeofilter.backends.sql.evaluate import to_sql_where, SQLEvaluator
from pygeofilter.backends.evaluator import handle
import shapely.geometry


class DuckDBEvaluator(SQLEvaluator):
    @handle(ast.Attribute)
    def attribute(self, node: ast.Attribute):
        if node.name == "geometry":
            return f'ST_GeomFromWKB({node.name})'
        return f'"{self.attribute_map[node.name]}"'

    @handle(values.Geometry)
    def geometry(self, node: values.Geometry):
        wkb_hex = shapely.geometry.shape(node).wkb_hex
        return f"ST_GeomFromHEXEWKB('{wkb_hex}')"

    @handle(values.Envelope)
    def envelope(self, node: values.Envelope):
        wkb_hex = shapely.geometry.box(node.x1, node.y1, node.x2, node.y2).wkb_hex
        return f"ST_GeomFromHEXEWKB('{wkb_hex}')"


sql_where = DuckDBEvaluator({
    's1:datatake': 's1:datatake',
    'datetime': 'datetime',
    'sar:instrument_mode': 'sar:instrument_mode',
    'end_datetime': 'end_datetime',
    'start_datetime': 'start_datetime',
    'geometry': 'geometry'
}, {}).evaluate(filter)

Afterwards, no string manipulation is required.

I'm actually not sure how to backport this into pygeofilter. A quick glance around, and I found vast differences in all of the different SQL based databases with spatial support. I'm thinking that there is no way to provide a general solution.

Please let me know if this solution works for you.

@johntruckenbrodt
Copy link
Author

This works like a charm! Thanks a lot @constantinius, this is of great help. I'll be using this solution and give credit of course. Can't this class find its way into pygeofilter.backends.sql.evaluate so that users can choose between different SQL dialects?
For me the issue is solved. I'll leave it up to you to close it.

@rockliffelewis
Copy link

rockliffelewis commented Nov 14, 2024

For those playing the duckdb game, you also need to put quotes around the datetime type like so:

    @handle(*values.LITERALS)
    def literal(self, node):
        if isinstance(node, str):
            return f"'{node}'"
        elif isinstance(node, datetime.datetime):
            return f"'{node}'"
        else:
            return node

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants