SQLAlchemy: Add insert_bulk fast-path INSERT method for pandas

amotl · hlcianfagna · amotl · commit decce23011f9 · 2023-05-11T20:44:22.000+02:00
This method supports efficient batch inserts using CrateDB's bulk operations endpoint. https://crate.io/docs/crate/reference/en/latest/interfaces/http.html#bulk-operations Co-authored-by: hlcianfagna <110453267+hlcianfagna@users.noreply.github.com>
diff --git a/CHANGES.txt b/CHANGES.txt
@@ -5,6 +5,9 @@ Changes for crate
 Unreleased
 ==========
 
+- SQLAlchemy: Added ``insert_bulk`` fast-path ``INSERT`` method for pandas, in
+  order to support efficient batch inserts using CrateDB's bulk operations endpoint.
+
 
 2023/04/18 0.31.1
 =================
diff --git a/docs/by-example/index.rst b/docs/by-example/index.rst
@@ -48,6 +48,7 @@ its corresponding API interfaces, see also :ref:`sqlalchemy-support`.
     sqlalchemy/working-with-types
     sqlalchemy/advanced-querying
     sqlalchemy/inspection-reflection
+    sqlalchemy/dataframe
 
 
 .. _Python DB API: https://peps.python.org/pep-0249/
diff --git a/docs/by-example/sqlalchemy/dataframe.rst b/docs/by-example/sqlalchemy/dataframe.rst
@@ -0,0 +1,138 @@
+.. _sqlalchemy-pandas:
+.. _sqlalchemy-dataframe:
+
+================================
+SQLAlchemy: DataFrame operations
+================================
+
+About
+=====
+
+This section of the documentation demonstrates support for efficient batch/bulk
+``INSERT`` operations with `pandas`_ and `Dask`_, using the CrateDB SQLAlchemy dialect.
+
+Efficient bulk operations are needed for typical `ETL`_ batch processing and
+data streaming workloads, for example to move data in- and out of OLAP data
+warehouses, as contrasted to interactive online transaction processing (OLTP)
+applications. The strategies of `batching`_ together series of records for
+improving performance are also referred to as `chunking`_.
+
+
+Introduction
+============
+
+The :ref:`pandas DataFrame <pandas:api.dataframe>` is a structure that contains
+two-dimensional data and its corresponding labels. DataFrames are widely used
+in data science, machine learning, scientific computing, and many other
+data-intensive fields.
+
+DataFrames are similar to SQL tables or the spreadsheets that you work with in
+Excel or Calc. In many cases, DataFrames are faster, easier to use, and more
+powerful than tables or spreadsheets because they are an integral part of the
+`Python`_ and `NumPy`_ ecosystems.
+
+The :ref:`pandas I/O subsystem <pandas:api.io>` for `relational databases`_
+using `SQL`_ is based on `SQLAlchemy`_.
+
+
+.. rubric:: Table of Contents
+
+.. contents::
+   :local:
+
+
+Efficient ``INSERT`` operations with pandas
+===========================================
+
+The package provides a ``bulk_insert`` function to use the
+:meth:`pandas:pandas.DataFrame.to_sql` method more efficiently, based on the
+`CrateDB bulk operations`_ endpoint. It will effectively split your insert
+workload across multiple batches, using a defined chunk size.
+
+    >>> import sqlalchemy as sa
+    >>> from pandas._testing import makeTimeDataFrame
+    >>> from crate.client.sqlalchemy.support import insert_bulk
+    ...
+    >>> # Define number of records, and chunk size.
+    >>> INSERT_RECORDS = 42
+    >>> CHUNK_SIZE = 8
+    ...
+    >>> # Create a pandas DataFrame, and connect to CrateDB.
+    >>> df = makeTimeDataFrame(nper=INSERT_RECORDS, freq="S")
+    >>> engine = sa.create_engine(f"crate://{crate_host}")
+    ...
+    >>> # Insert content of DataFrame using batches of records.
+    >>> # Effectively, it's six. 42 / 8 = 5.25.
+    >>> df.to_sql(
+    ...     name="test-testdrive",
+    ...     con=engine,
+    ...     if_exists="replace",
+    ...     index=False,
+    ...     chunksize=CHUNK_SIZE,
+    ...     method=insert_bulk,
+    ... )
+
+.. TIP::
+
+    You will observe that the optimal chunk size highly depends on the shape of
+    your data, specifically the width of each record, i.e. the number of columns
+    and their individual sizes. You will need to determine a good chunk size by
+    running corresponding experiments on your own behalf. For that purpose, you
+    can use the `insert_pandas.py`_ program as a blueprint.
+
+    A few details should be taken into consideration when determining the optimal
+    chunk size for a specific dataset. We are outlining the two major ones.
+
+    - First, when working with data larger than the main memory available on your
+      machine, each chunk should be small enough to fit into the memory, but large
+      enough to minimize the overhead of a single data insert operation. Depending
+      on whether you are running other workloads on the same machine, you should
+      also account for the total share of heap memory you will assign to each domain,
+      to prevent overloading the system as a whole.
+
+    - Second, as each batch is submitted using HTTP, you should know about the request
+      size limits and other constraints of your HTTP infrastructure, which may include
+      any types of HTTP intermediaries relaying information between your database client
+      application and your CrateDB cluster. For example, HTTP proxy servers or load
+      balancers not optimally configured for performance, or web application firewalls
+      and intrusion prevention systems may hamper HTTP communication, sometimes in
+      subtle ways, for example based on request size constraints, or throttling
+      mechanisms. If you are working with very busy systems, and hosting it on shared
+      infrastructure, details like `SNAT port exhaustion`_ may also come into play.
+
+    You will need to determine a good chunk size by running corresponding experiments
+    on your own behalf. For that purpose, you can use the `insert_pandas.py`_ program
+    as a blueprint.
+
+    It is a good idea to start your explorations with a chunk size of 5_000, and
+    then see if performance improves when you increase or decrease that figure.
+    Chunk sizes of 20000 may also be applicable, but make sure to take the limits
+    of your HTTP infrastructure into consideration.
+
+    In order to learn more about what wide- vs. long-form (tidy, stacked, narrow)
+    data means in the context of `DataFrame computing`_, let us refer you to `a
+    general introduction <wide-narrow-general_>`_, the corresponding section in
+    the `Data Computing book <wide-narrow-data-computing_>`_, and a `pandas
+    tutorial <wide-narrow-pandas-tutorial_>`_ about the same topic.
+
+
+.. hidden: Disconnect from database
+
+    >>> engine.dispose()
+
+
+.. _batching: https://en.wikipedia.org/wiki/Batch_processing#Common_batch_processing_usage
+.. _chunking: https://en.wikipedia.org/wiki/Chunking_(computing)
+.. _CrateDB bulk operations: https://crate.io/docs/crate/reference/en/latest/interfaces/http.html#bulk-operations
+.. _DataFrame computing: https://realpython.com/pandas-dataframe/
+.. _insert_pandas.py: https://github.com/crate/crate-python/blob/master/examples/insert_pandas.py
+.. _NumPy: https://en.wikipedia.org/wiki/NumPy
+.. _pandas: https://en.wikipedia.org/wiki/Pandas_(software)
+.. _pandas DataFrame: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html
+.. _Python: https://en.wikipedia.org/wiki/Python_(programming_language)
+.. _relational databases: https://en.wikipedia.org/wiki/Relational_database
+.. _SQL: https://en.wikipedia.org/wiki/SQL
+.. _SQLAlchemy: https://aosabook.org/en/v2/sqlalchemy.html
+.. _wide-narrow-general: https://en.wikipedia.org/wiki/Wide_and_narrow_data
+.. _wide-narrow-data-computing: https://dtkaplan.github.io/DataComputingEbook/chap-wide-vs-narrow.html#chap:wide-vs-narrow
+.. _wide-narrow-pandas-tutorial: https://anvil.works/blog/tidy-data
diff --git a/docs/conf.py b/docs/conf.py
@@ -12,7 +12,8 @@
 intersphinx_mapping.update({
     'py': ('https://docs.python.org/3/', None),
     'sa': ('https://docs.sqlalchemy.org/en/14/', None),
-    'urllib3': ('https://urllib3.readthedocs.io/en/1.26.13/', None)
+    'urllib3': ('https://urllib3.readthedocs.io/en/1.26.13/', None),
+    'pandas': ('https://pandas.pydata.org/docs/', None),
     })
 
 
diff --git a/setup.py b/setup.py
@@ -70,6 +70,7 @@ def read(path):
               'createcoverage>=1,<2',
               'stopit>=1.1.2,<2',
               'flake8>=4,<7',
+              'pandas>=2,<3',
               'pytz',
               # `test_http.py` needs `setuptools.ssl_support`
               'setuptools<57',
diff --git a/src/crate/client/sqlalchemy/support.py b/src/crate/client/sqlalchemy/support.py
@@ -0,0 +1,62 @@
+# -*- coding: utf-8; -*-
+#
+# Licensed to CRATE Technology GmbH ("Crate") under one or more contributor
+# license agreements.  See the NOTICE file distributed with this work for
+# additional information regarding copyright ownership.  Crate licenses
+# this file to you under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.  You may
+# obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the
+# License for the specific language governing permissions and limitations
+# under the License.
+#
+# However, if you have executed another commercial license agreement
+# with Crate these terms will supersede the license and you may use the
+# software solely pursuant to the terms of the relevant commercial agreement.
+import logging
+
+
+logger = logging.getLogger(__name__)
+
+
+def insert_bulk(pd_table, conn, keys, data_iter):
+    """
+    Use CrateDB's "bulk operations" endpoint as a fast path for pandas' and Dask's `to_sql()` [1] method.
+
+    The idea is to break out of SQLAlchemy, compile the insert statement, and use the raw
+    DBAPI connection client, in order to invoke a request using `bulk_parameters` [2]::
+
+        cursor.execute(sql=sql, bulk_parameters=data)
+
+    The vanilla implementation, used by SQLAlchemy, is::
+
+        data = [dict(zip(keys, row)) for row in data_iter]
+        conn.execute(pd_table.table.insert(), data)
+
+    Batch chunking will happen outside of this function, for example [3] demonstrates
+    the relevant code in `pandas.io.sql`.
+
+    [1] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html
+    [2] https://crate.io/docs/crate/reference/en/latest/interfaces/http.html#bulk-operations
+    [3] https://github.com/pandas-dev/pandas/blob/v2.0.1/pandas/io/sql.py#L1011-L1027
+    """
+
+    # Compile SQL statement and materialize batch.
+    sql = str(pd_table.table.insert().compile(bind=conn))
+    data = list(data_iter)
+
+    # For debugging and tracing the batches running through this method.
+    # Because it's a performance-optimized code path, the log statements are not active by default.
+    # logger.info(f"Bulk SQL:     {sql}")
+    # logger.info(f"Bulk records: {len(data)}")
+    # logger.info(f"Bulk data:    {data}")
+
+    # Invoke bulk insert operation.
+    cursor = conn._dbapi_connection.cursor()
+    cursor.execute(sql=sql, bulk_parameters=data)
+    cursor.close()
diff --git a/src/crate/client/sqlalchemy/tests/bulk_test.py b/src/crate/client/sqlalchemy/tests/bulk_test.py
@@ -18,7 +18,7 @@
 # However, if you have executed another commercial license agreement
 # with Crate these terms will supersede the license and you may use the
 # software solely pursuant to the terms of the relevant commercial agreement.
-
+import math
 from unittest import TestCase, skipIf
 from unittest.mock import patch, MagicMock
 
@@ -36,8 +36,7 @@
 
 
 fake_cursor = MagicMock(name='fake_cursor')
-FakeCursor = MagicMock(name='FakeCursor', spec=Cursor)
-FakeCursor.return_value = fake_cursor
+FakeCursor = MagicMock(name='FakeCursor', spec=Cursor, return_value=fake_cursor)
 
 
 class SqlAlchemyBulkTest(TestCase):
@@ -168,3 +167,41 @@ def test_bulk_save_modern(self):
             'Callisto', 37,
         )
         self.assertSequenceEqual(expected_bulk_args, bulk_args)
+
+    @patch('crate.client.connection.Cursor', mock_cursor=FakeCursor)
+    def test_bulk_save_pandas(self, mock_cursor):
+        """
+        Verify bulk INSERT with pandas.
+        """
+        import sqlalchemy as sa
+        from pandas._testing import makeTimeDataFrame
+        from crate.client.sqlalchemy.support import insert_bulk
+
+        # 42 records / 8 chunksize = 5.25, which means 6 batches will be emitted.
+        INSERT_RECORDS = 42
+        CHUNK_SIZE = 8
+        OPCOUNT = math.ceil(INSERT_RECORDS / CHUNK_SIZE)
+
+        # Create a DataFrame to feed into the database.
+        df = makeTimeDataFrame(nper=INSERT_RECORDS, freq="S")
+
+        dburi = "crate://localhost:4200"
+        engine = sa.create_engine(dburi, echo=True)
+        retval = df.to_sql(
+            name="test-testdrive",
+            con=engine,
+            if_exists="replace",
+            index=False,
+            chunksize=CHUNK_SIZE,
+            method=insert_bulk,
+        )
+        self.assertIsNone(retval)
+
+        # Initializing the query has an overhead of two calls to the cursor object, probably one
+        # initial connection from the DB-API driver, to inquire the database version, and another
+        # one, for SQLAlchemy. SQLAlchemy will use it to inquire the table schema using `information_schema`,
+        # and to eventually issue the `CREATE TABLE ...` statement.
+        effective_op_count = mock_cursor.call_count - 2
+
+        # Verify number of batches.
+        self.assertEqual(effective_op_count, OPCOUNT)
diff --git a/src/crate/client/tests.py b/src/crate/client/tests.py
@@ -385,6 +385,7 @@ def test_suite():
         'docs/by-example/sqlalchemy/working-with-types.rst',
         'docs/by-example/sqlalchemy/advanced-querying.rst',
         'docs/by-example/sqlalchemy/inspection-reflection.rst',
+        'docs/by-example/sqlalchemy/dataframe.rst',
         module_relative=False,
         setUp=setUpCrateLayerSqlAlchemy,
         tearDown=tearDownDropEntitiesSqlAlchemy,