Skip to content

Commit 6bfc869

Browse files
committed
Documentation: Improve section about batch operations with pandas
Specifically, outline _two_ concrete considerations for determining the optimal chunk size, and improve wording.
1 parent decce23 commit 6bfc869

File tree

1 file changed

+10
-6
lines changed

1 file changed

+10
-6
lines changed

docs/by-example/sqlalchemy/dataframe.rst

+10-6
Original file line numberDiff line numberDiff line change
@@ -76,9 +76,8 @@ workload across multiple batches, using a defined chunk size.
7676

7777
You will observe that the optimal chunk size highly depends on the shape of
7878
your data, specifically the width of each record, i.e. the number of columns
79-
and their individual sizes. You will need to determine a good chunk size by
80-
running corresponding experiments on your own behalf. For that purpose, you
81-
can use the `insert_pandas.py`_ program as a blueprint.
79+
and their individual sizes, which will in the end determine the total size of
80+
each batch/chunk.
8281

8382
A few details should be taken into consideration when determining the optimal
8483
chunk size for a specific dataset. We are outlining the two major ones.
@@ -106,8 +105,11 @@ workload across multiple batches, using a defined chunk size.
106105

107106
It is a good idea to start your explorations with a chunk size of 5_000, and
108107
then see if performance improves when you increase or decrease that figure.
109-
Chunk sizes of 20000 may also be applicable, but make sure to take the limits
110-
of your HTTP infrastructure into consideration.
108+
People are reporting that 10_000-20_000 is their optimal setting, but if you
109+
process, for example, just three "small" columns, you may also experiment with
110+
`leveling up to 200_000`_, because `the chunksize should not be too small`_.
111+
If it is too small, the I/O cost will be too high to overcome the benefit of
112+
batching.
111113

112114
In order to learn more about what wide- vs. long-form (tidy, stacked, narrow)
113115
data means in the context of `DataFrame computing`_, let us refer you to `a
@@ -125,14 +127,16 @@ workload across multiple batches, using a defined chunk size.
125127
.. _chunking: https://en.wikipedia.org/wiki/Chunking_(computing)
126128
.. _CrateDB bulk operations: https://crate.io/docs/crate/reference/en/latest/interfaces/http.html#bulk-operations
127129
.. _DataFrame computing: https://realpython.com/pandas-dataframe/
128-
.. _insert_pandas.py: https://github.com/crate/crate-python/blob/master/examples/insert_pandas.py
130+
.. _insert_pandas.py: https://github.com/crate/cratedb-examples/blob/main/by-language/python-sqlalchemy/insert_pandas.py
131+
.. _leveling up to 200_000: https://acepor.github.io/2017/08/03/using-chunksize/
129132
.. _NumPy: https://en.wikipedia.org/wiki/NumPy
130133
.. _pandas: https://en.wikipedia.org/wiki/Pandas_(software)
131134
.. _pandas DataFrame: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html
132135
.. _Python: https://en.wikipedia.org/wiki/Python_(programming_language)
133136
.. _relational databases: https://en.wikipedia.org/wiki/Relational_database
134137
.. _SQL: https://en.wikipedia.org/wiki/SQL
135138
.. _SQLAlchemy: https://aosabook.org/en/v2/sqlalchemy.html
139+
.. _the chunksize should not be too small: https://acepor.github.io/2017/08/03/using-chunksize/
136140
.. _wide-narrow-general: https://en.wikipedia.org/wiki/Wide_and_narrow_data
137141
.. _wide-narrow-data-computing: https://dtkaplan.github.io/DataComputingEbook/chap-wide-vs-narrow.html#chap:wide-vs-narrow
138142
.. _wide-narrow-pandas-tutorial: https://anvil.works/blog/tidy-data

0 commit comments

Comments
 (0)