Skip to content

Commit 13f4b7b

Browse files
committed
fixup! SQLAlchemy: Add insert_bulk fast-path INSERT method for pandas
1 parent 9ee5cd4 commit 13f4b7b

File tree

1 file changed

+20
-6
lines changed

1 file changed

+20
-6
lines changed

docs/by-example/sqlalchemy/dataframe.rst

+20-6
Original file line numberDiff line numberDiff line change
@@ -82,12 +82,25 @@ workload across multiple batches, using a defined chunk size.
8282
and their individual sizes, which will in the end determine the total size of
8383
each batch/chunk.
8484

85-
Two specific things should be taken into consideration when determining the
86-
optimal chunk size for a specific dataset. First, when working with data
87-
larger than main memory available on your machine, each chunk should be small
88-
enough to fit into the memory, but large enough to minimize the overhead of a
89-
single data insert operation. Second, as each batch is submitted using HTTP,
90-
you should know about the request size limits of your HTTP infrastructure.
85+
A few details should be taken into consideration when determining the optimal
86+
chunk size for a specific dataset. We are outlining the two major ones.
87+
88+
- First, when working with data larger than the main memory available on your
89+
machine, each chunk should be small enough to fit into the memory, but large
90+
enough to minimize the overhead of a single data insert operation. Depending
91+
on whether you are running other workloads on the same machine, you should
92+
also account for the total share of heap memory you will assign to each domain,
93+
to prevent overloading the system as a whole.
94+
95+
- Second, as each batch is submitted using HTTP, you should know about the request
96+
size limits and other constraints of your HTTP infrastructure, which may include
97+
any types of HTTP intermediaries relaying information between your database client
98+
application and your CrateDB cluster. For example, HTTP proxy servers or load
99+
balancers not optimally configured for performance, or web application firewalls
100+
and intrusion prevention systems may hamper HTTP communication, sometimes in
101+
subtle ways, for example based on request size constraints, or throttling
102+
mechanisms. If you are working with very busy systems, and hosting it on shared
103+
infrastructure, details like `SNAT port exhaustion`_ may also come into play.
91104

92105
You will need to determine a good chunk size by running corresponding experiments
93106
on your own behalf. For that purpose, you can use the `insert_pandas.py`_ program
@@ -122,6 +135,7 @@ workload across multiple batches, using a defined chunk size.
122135
.. _pandas DataFrame: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html
123136
.. _Python: https://en.wikipedia.org/wiki/Python_(programming_language)
124137
.. _relational databases: https://en.wikipedia.org/wiki/Relational_database
138+
.. _SNAT port exhaustion: https://learn.microsoft.com/en-us/azure/load-balancer/troubleshoot-outbound-connection
125139
.. _SQL: https://en.wikipedia.org/wiki/SQL
126140
.. _SQLAlchemy: https://aosabook.org/en/v2/sqlalchemy.html
127141
.. _the chunksize should not be too small: https://acepor.github.io/2017/08/03/using-chunksize/

0 commit comments

Comments
 (0)