|
| 1 | +(support-features)= |
| 2 | +(support-utilities)= |
| 3 | +# Support Features |
| 4 | + |
| 5 | +The package bundles a few support and utility functions that try to fill a few |
| 6 | +gaps you will observe when working with CrateDB, a distributed OLAP database, |
| 7 | +since it lacks certain features, usually found in traditional OLTP databases. |
| 8 | + |
| 9 | +A few of the features outlined below are referred to as [polyfills], and |
| 10 | +emulate a few functionalities, for example, to satisfy compatibility issues on |
| 11 | +downstream frameworks or test suites. You can use them at your disposal, but |
| 12 | +you should know what you are doing, as some of them can seriously impact |
| 13 | +performance. |
| 14 | + |
| 15 | +Other features include efficiency support utilities for 3rd-party frameworks, |
| 16 | +which can be used to increase performance, mostly on INSERT operations. |
| 17 | + |
| 18 | + |
| 19 | +(support-insert-bulk)= |
| 20 | +## Bulk Support for pandas and Dask |
| 21 | + |
| 22 | +:::{rubric} Background |
| 23 | +::: |
| 24 | +CrateDB's [](inv:crate-reference#http-bulk-ops) interface enables efficient |
| 25 | +INSERT, UPDATE, and DELETE operations for batches of data. It enables |
| 26 | +bulk operations, which are executed as single calls on the database server. |
| 27 | + |
| 28 | +:::{rubric} Utility |
| 29 | +::: |
| 30 | +The `insert_bulk` utility provides efficient bulk data transfers when using |
| 31 | +dataframe libraries like pandas and Dask. {ref}`dataframe` dedicates a whole |
| 32 | +page to corresponding topics, about choosing the right chunk sizes, concurrency |
| 33 | +settings, and beyond. |
| 34 | + |
| 35 | +:::{rubric} Synopsis |
| 36 | +::: |
| 37 | +Use `method=insert_bulk` on pandas' or Dask's `to_sql()` method. |
| 38 | +```python |
| 39 | +import sqlalchemy as sa |
| 40 | +from sqlalchemy_cratedb.support import insert_bulk |
| 41 | +from pueblo.testing.pandas import makeTimeDataFrame |
| 42 | + |
| 43 | +# Create a pandas DataFrame, and connect to CrateDB. |
| 44 | +df = makeTimeDataFrame(nper=42, freq="S") |
| 45 | +engine = sa.create_engine("crate://") |
| 46 | + |
| 47 | +# Insert content of DataFrame using batches of records. |
| 48 | +df.to_sql( |
| 49 | + name="testdrive", |
| 50 | + con=engine, |
| 51 | + if_exists="replace", |
| 52 | + index=False, |
| 53 | + method=insert_bulk, |
| 54 | +) |
| 55 | +``` |
| 56 | + |
| 57 | +(support-autoincrement)= |
| 58 | +## Synthetic Autoincrement using Timestamps |
| 59 | + |
| 60 | +:::{rubric} Background |
| 61 | +::: |
| 62 | +CrateDB does not provide traditional sequences or `SERIAL` data type support, |
| 63 | +which enable automatically assigning incremental values when inserting records. |
| 64 | + |
| 65 | + |
| 66 | +:::{rubric} Utility |
| 67 | +::: |
| 68 | +- The `patch_autoincrement_timestamp` utility emulates autoincrement / |
| 69 | + sequential ID behavior for designated columns, based on assigning timestamps |
| 70 | + on record insertion. |
| 71 | +- It will simply assign `sa.func.now()` as a column `default` on the ORM model |
| 72 | + column. |
| 73 | +- It works on the SQLAlchemy column types `sa.BigInteger`, `sa.DateTime`, |
| 74 | + and `sa.String`. |
| 75 | +- You can use it if adjusting ORM models for your database adapter is not |
| 76 | + an option. |
| 77 | + |
| 78 | +:::{rubric} Synopsis |
| 79 | +::: |
| 80 | +After activating the patch, you can use `autoincrement=True` on column definitions. |
| 81 | +```python |
| 82 | +import sqlalchemy as sa |
| 83 | +from sqlalchemy.orm import declarative_base |
| 84 | +from sqlalchemy_cratedb.support import patch_autoincrement_timestamp |
| 85 | + |
| 86 | +# Enable patch. |
| 87 | +patch_autoincrement_timestamp() |
| 88 | + |
| 89 | +# Define database schema. |
| 90 | +Base = declarative_base() |
| 91 | + |
| 92 | +class FooBar(Base): |
| 93 | + id = sa.Column(sa.DateTime, primary_key=True, autoincrement=True) |
| 94 | +``` |
| 95 | + |
| 96 | +:::{warning} |
| 97 | +CrateDB's [`TIMESTAMP`](inv:crate-reference#type-timestamp) data type provides |
| 98 | +milliseconds granularity. This has to be considered when evaluating collision |
| 99 | +safety in high-traffic environments. |
| 100 | +::: |
| 101 | + |
| 102 | + |
| 103 | +(support-synthetic-refresh)= |
| 104 | +## Synthetic Table REFRESH after DML |
| 105 | + |
| 106 | +:::{rubric} Background |
| 107 | +::: |
| 108 | +CrateDB is [eventually consistent]. Data written with a former statement is |
| 109 | +not guaranteed to be fetched with the next following select statement for the |
| 110 | +affected rows. |
| 111 | + |
| 112 | +Data written to CrateDB is flushed periodically, the refresh interval is |
| 113 | +1000 milliseconds by default, and can be changed. More details can be found in |
| 114 | +the reference documentation about [table refreshing](inv:crate-reference#refresh_data). |
| 115 | + |
| 116 | +There are situations where stronger consistency is required, for example when |
| 117 | +needing to satisfy test suites of 3rd party frameworks, which usually do not |
| 118 | +take such special behavior of CrateDB into consideration. |
| 119 | + |
| 120 | +:::{rubric} Utility |
| 121 | +::: |
| 122 | +- The `refresh_after_dml` utility will configure an SQLAlchemy engine or session |
| 123 | + to automatically invoke `REFRESH TABLE` statements after each DML |
| 124 | + operation (INSERT, UPDATE, DELETE). |
| 125 | +- Only relevant (dirty) entities / tables will be considered to be refreshed. |
| 126 | + |
| 127 | +:::{rubric} Synopsis |
| 128 | +::: |
| 129 | +```python |
| 130 | +import sqlalchemy as sa |
| 131 | +from sqlalchemy_cratedb.support import refresh_after_dml |
| 132 | + |
| 133 | +engine = sa.create_engine("crate://") |
| 134 | +refresh_after_dml(engine) |
| 135 | +``` |
| 136 | + |
| 137 | +```python |
| 138 | +import sqlalchemy as sa |
| 139 | +from sqlalchemy.orm import sessionmaker |
| 140 | +from sqlalchemy_cratedb.support import refresh_after_dml |
| 141 | + |
| 142 | +engine = sa.create_engine("crate://") |
| 143 | +session = sessionmaker(bind=engine)() |
| 144 | +refresh_after_dml(session) |
| 145 | +``` |
| 146 | + |
| 147 | +:::{warning} |
| 148 | +Refreshing the table after each DML operation can cause serious performance |
| 149 | +degradations, and should only be used on low-volume, low-traffic data, |
| 150 | +when applicable, and if you know what you are doing. |
| 151 | +::: |
| 152 | + |
| 153 | + |
| 154 | +(support-unique)= |
| 155 | +## Synthetic UNIQUE Constraints |
| 156 | + |
| 157 | +:::{rubric} Background |
| 158 | +::: |
| 159 | +CrateDB does not provide `UNIQUE` constraints in DDL statements. Because of its |
| 160 | +distributed nature, supporting such a feature natively would cause expensive |
| 161 | +database cluster operations, negating many benefits of using database clusters |
| 162 | +firsthand. |
| 163 | + |
| 164 | +:::{rubric} Utility |
| 165 | +::: |
| 166 | +- The `check_uniqueness_factory` utility emulates "unique constraints" |
| 167 | + functionality by querying the table for unique values before invoking |
| 168 | + SQL `INSERT` operations. |
| 169 | +- It uses SQLALchemy [](inv:sa#orm_event_toplevel), more specifically |
| 170 | + the [before_insert] mapper event. |
| 171 | +- When the uniqueness constraint is violated, the adapter will raise a |
| 172 | + corresponding exception. |
| 173 | + ```python |
| 174 | + IntegrityError: DuplicateKeyException in table 'foobar' on constraint 'name' |
| 175 | + ``` |
| 176 | + |
| 177 | +:::{rubric} Synopsis |
| 178 | +::: |
| 179 | +```python |
| 180 | +import sqlalchemy as sa |
| 181 | +from sqlalchemy.orm import declarative_base |
| 182 | +from sqlalchemy.event import listen |
| 183 | +from sqlalchemy_cratedb.support import check_uniqueness_factory |
| 184 | + |
| 185 | +# Define database schema. |
| 186 | +Base = declarative_base() |
| 187 | + |
| 188 | +class FooBar(Base): |
| 189 | + id = sa.Column(sa.String, primary_key=True) |
| 190 | + name = sa.Column(sa.String) |
| 191 | + |
| 192 | +# Add synthetic UNIQUE constraint on `name` column. |
| 193 | +listen(FooBar, "before_insert", check_uniqueness_factory(FooBar, "name")) |
| 194 | +``` |
| 195 | + |
| 196 | +[before_insert]: https://docs.sqlalchemy.org/en/20/orm/events.html#sqlalchemy.orm.MapperEvents.before_insert |
| 197 | + |
| 198 | +:::{note} |
| 199 | +This feature will only work well if table data is consistent, which can be |
| 200 | +ensured by invoking a `REFRESH TABLE` statement after any DML operation. |
| 201 | +For conveniently enabling "always refresh", please refer to the documentation |
| 202 | +section about [](#support-synthetic-refresh). |
| 203 | +::: |
| 204 | + |
| 205 | +:::{warning} |
| 206 | +Querying the table before each INSERT operation can cause serious performance |
| 207 | +degradations, and should only be used on low-volume, low-traffic data, |
| 208 | +when applicable, and if you know what you are doing. |
| 209 | +::: |
| 210 | + |
| 211 | + |
| 212 | +[eventually consistent]: https://en.wikipedia.org/wiki/Eventual_consistency |
| 213 | +[polyfills]: https://en.wikipedia.org/wiki/Polyfill_(programming) |
0 commit comments