@@ -82,12 +82,25 @@ workload across multiple batches, using a defined chunk size.
82
82
and their individual sizes, which will in the end determine the total size of
83
83
each batch/chunk.
84
84
85
- Two specific things should be taken into consideration when determining the
86
- optimal chunk size for a specific dataset. First, when working with data
87
- larger than main memory available on your machine, each chunk should be small
88
- enough to fit into the memory, but large enough to minimize the overhead of a
89
- single data insert operation. Second, as each batch is submitted using HTTP,
90
- you should know about the request size limits of your HTTP infrastructure.
85
+ A few details should be taken into consideration when determining the optimal
86
+ chunk size for a specific dataset. We are outlining the two major ones.
87
+
88
+ - First, when working with data larger than the main memory available on your
89
+ machine, each chunk should be small enough to fit into the memory, but large
90
+ enough to minimize the overhead of a single data insert operation. Depending
91
+ on whether you are running other workloads on the same machine, you should
92
+ also account for the total share of heap memory you will assign to each domain,
93
+ to prevent overloading the system as a whole.
94
+
95
+ - Second, as each batch is submitted using HTTP, you should know about the request
96
+ size limits and other constraints of your HTTP infrastructure, which may include
97
+ any types of HTTP intermediaries relaying information between your database client
98
+ application and your CrateDB cluster. For example, HTTP proxy servers or load
99
+ balancers not optimally configured for performance, or web application firewalls
100
+ and intrusion prevention systems may hamper HTTP communication, sometimes in
101
+ subtle ways, for example based on request size constraints, or throttling
102
+ mechanisms. If you are working with very busy systems, and hosting it on shared
103
+ infrastructure, details like `SNAT port exhaustion `_ may also come into play.
91
104
92
105
You will need to determine a good chunk size by running corresponding experiments
93
106
on your own behalf. For that purpose, you can use the `insert_pandas.py `_ program
@@ -122,6 +135,7 @@ workload across multiple batches, using a defined chunk size.
122
135
.. _pandas DataFrame : https://pandas.pydata.org/pandas-docs/stable/reference/frame.html
123
136
.. _Python : https://en.wikipedia.org/wiki/Python_(programming_language)
124
137
.. _relational databases : https://en.wikipedia.org/wiki/Relational_database
138
+ .. _SNAT port exhaustion : https://learn.microsoft.com/en-us/azure/load-balancer/troubleshoot-outbound-connection
125
139
.. _SQL : https://en.wikipedia.org/wiki/SQL
126
140
.. _SQLAlchemy : https://aosabook.org/en/v2/sqlalchemy.html
127
141
.. _the chunksize should not be too small : https://acepor.github.io/2017/08/03/using-chunksize/
0 commit comments