Skip to content

Commit 55928e9

Browse files
committed
Documentation: Improve section about batch operations with pandas
Specifically, outline _two_ concrete considerations for determining the optimal chunk size.
1 parent dbf9293 commit 55928e9

File tree

1 file changed

+21
-7
lines changed

1 file changed

+21
-7
lines changed

docs/by-example/sqlalchemy/dataframe.rst

+21-7
Original file line numberDiff line numberDiff line change
@@ -70,14 +70,26 @@ multiple batches, using a defined chunk size.
7070

7171
You will observe that the optimal chunk size highly depends on the shape of
7272
your data, specifically the width of each record, i.e. the number of columns
73-
and their individual sizes. You will need to determine a good chunk size by
74-
running corresponding experiments on your own behalf. For that purpose, you
75-
can use the `insert_pandas.py`_ program as a blueprint.
73+
and their individual sizes, which will in the end determine the total size of
74+
each batch/chunk.
7675

77-
It is a good idea to start your explorations with a chunk size of 5000, and
76+
Two specific things should be taken into consideration when determining the
77+
optimal chunk size for a specific dataset. First, when working with data
78+
larger than main memory available on your machine, each chunk should be small
79+
enough to fit into the memory, but large enough to minimize the overhead of a
80+
single data insert operation. Second, as each batch is submitted using HTTP,
81+
you should know about the request size limits of your HTTP infrastructure.
82+
83+
You will need to determine a good chunk size by running corresponding experiments
84+
on your own behalf. For that purpose, you can use the `insert_pandas.py`_ program
85+
as a blueprint.
86+
87+
It is a good idea to start your explorations with a chunk size of 5_000, and
7888
then see if performance improves when you increase or decrease that figure.
79-
Chunk sizes of 20000 may also be applicable, but make sure to take the limits
80-
of your HTTP infrastructure into consideration.
89+
People are reporting that 10_000 is their optimal setting, but if you have,
90+
for example, just three columns, you may also experiment with `leveling up to
91+
200_000`_, because `the chunksize should not be too small`_. If it is too
92+
small, the I/O cost will be too high to overcome the benefit of batching.
8193

8294
In order to learn more about what wide- vs. long-form (tidy, stacked, narrow)
8395
data means in the context of `DataFrame computing`_, let us refer you to `a
@@ -93,14 +105,16 @@ multiple batches, using a defined chunk size.
93105
94106
.. _CrateDB bulk operations: https://crate.io/docs/crate/reference/en/latest/interfaces/http.html#bulk-operations
95107
.. _DataFrame computing: https://realpython.com/pandas-dataframe/
96-
.. _insert_pandas.py: https://github.com/crate/crate-python/blob/master/examples/insert_pandas.py
108+
.. _insert_pandas.py: https://github.com/crate/cratedb-examples/blob/main/by-language/python/insert_pandas.py
109+
.. _leveling up to 200_000: https://acepor.github.io/2017/08/03/using-chunksize/
97110
.. _NumPy: https://en.wikipedia.org/wiki/NumPy
98111
.. _pandas: https://en.wikipedia.org/wiki/Pandas_(software)
99112
.. _pandas DataFrame: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html
100113
.. _Python: https://en.wikipedia.org/wiki/Python_(programming_language)
101114
.. _relational databases: https://en.wikipedia.org/wiki/Relational_database
102115
.. _SQL: https://en.wikipedia.org/wiki/SQL
103116
.. _SQLAlchemy: https://aosabook.org/en/v2/sqlalchemy.html
117+
.. _the chunksize should not be too small: https://acepor.github.io/2017/08/03/using-chunksize/
104118
.. _wide-narrow-general: https://en.wikipedia.org/wiki/Wide_and_narrow_data
105119
.. _wide-narrow-data-computing: https://dtkaplan.github.io/DataComputingEbook/chap-wide-vs-narrow.html#chap:wide-vs-narrow
106120
.. _wide-narrow-pandas-tutorial: https://anvil.works/blog/tidy-data

0 commit comments

Comments
 (0)