fixup! SQLAlchemy: Add insert_bulk fast-path INSERT method for pandas

amotl · amotl · commit 13f4b7b25f5d · 2023-05-11T18:17:08.000+02:00
diff --git a/docs/by-example/sqlalchemy/dataframe.rst b/docs/by-example/sqlalchemy/dataframe.rst
@@ -82,12 +82,25 @@ workload across multiple batches, using a defined chunk size.
     and their individual sizes, which will in the end determine the total size of
     each batch/chunk.
 
-    Two specific things should be taken into consideration when determining the
-    optimal chunk size for a specific dataset. First, when working with data
-    larger than main memory available on your machine, each chunk should be small
-    enough to fit into the memory, but large enough to minimize the overhead of a
-    single data insert operation. Second, as each batch is submitted using HTTP,
-    you should know about the request size limits of your HTTP infrastructure.
+    A few details should be taken into consideration when determining the optimal
+    chunk size for a specific dataset. We are outlining the two major ones.
+
+    - First, when working with data larger than the main memory available on your
+      machine, each chunk should be small enough to fit into the memory, but large
+      enough to minimize the overhead of a single data insert operation. Depending
+      on whether you are running other workloads on the same machine, you should
+      also account for the total share of heap memory you will assign to each domain,
+      to prevent overloading the system as a whole.
+
+    - Second, as each batch is submitted using HTTP, you should know about the request
+      size limits and other constraints of your HTTP infrastructure, which may include
+      any types of HTTP intermediaries relaying information between your database client
+      application and your CrateDB cluster. For example, HTTP proxy servers or load
+      balancers not optimally configured for performance, or web application firewalls
+      and intrusion prevention systems may hamper HTTP communication, sometimes in
+      subtle ways, for example based on request size constraints, or throttling
+      mechanisms. If you are working with very busy systems, and hosting it on shared
+      infrastructure, details like `SNAT port exhaustion`_ may also come into play.
 
     You will need to determine a good chunk size by running corresponding experiments
     on your own behalf. For that purpose, you can use the `insert_pandas.py`_ program
@@ -122,6 +135,7 @@ workload across multiple batches, using a defined chunk size.
 .. _pandas DataFrame: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html
 .. _Python: https://en.wikipedia.org/wiki/Python_(programming_language)
 .. _relational databases: https://en.wikipedia.org/wiki/Relational_database
+.. _SNAT port exhaustion: https://learn.microsoft.com/en-us/azure/load-balancer/troubleshoot-outbound-connection
 .. _SQL: https://en.wikipedia.org/wiki/SQL
 .. _SQLAlchemy: https://aosabook.org/en/v2/sqlalchemy.html
 .. _the chunksize should not be too small: https://acepor.github.io/2017/08/03/using-chunksize/