Skip to content

Commit c1a52cf

Browse files
committed
Documentation: Improve section about batch operations with pandas
Specifically, outline _two_ concrete considerations for determining the optimal chunk size, and improve wording.
1 parent dbf9293 commit c1a52cf

File tree

1 file changed

+22
-7
lines changed

1 file changed

+22
-7
lines changed

docs/by-example/sqlalchemy/dataframe.rst

+22-7
Original file line numberDiff line numberDiff line change
@@ -70,14 +70,27 @@ multiple batches, using a defined chunk size.
7070

7171
You will observe that the optimal chunk size highly depends on the shape of
7272
your data, specifically the width of each record, i.e. the number of columns
73-
and their individual sizes. You will need to determine a good chunk size by
74-
running corresponding experiments on your own behalf. For that purpose, you
75-
can use the `insert_pandas.py`_ program as a blueprint.
73+
and their individual sizes, which will in the end determine the total size of
74+
each batch/chunk.
7675

77-
It is a good idea to start your explorations with a chunk size of 5000, and
76+
Two specific things should be taken into consideration when determining the
77+
optimal chunk size for a specific dataset. First, when working with data
78+
larger than main memory available on your machine, each chunk should be small
79+
enough to fit into the memory, but large enough to minimize the overhead of a
80+
single data insert operation. Second, as each batch is submitted using HTTP,
81+
you should know about the request size limits of your HTTP infrastructure.
82+
83+
You will need to determine a good chunk size by running corresponding experiments
84+
on your own behalf. For that purpose, you can use the `insert_pandas.py`_ program
85+
as a blueprint.
86+
87+
It is a good idea to start your explorations with a chunk size of 5_000, and
7888
then see if performance improves when you increase or decrease that figure.
79-
Chunk sizes of 20000 may also be applicable, but make sure to take the limits
80-
of your HTTP infrastructure into consideration.
89+
People are reporting that 10_000-20_000 is their optimal setting, but if you
90+
process, for example, just three "small" columns, you may also experiment with
91+
`leveling up to 200_000`_, because `the chunksize should not be too small`_.
92+
If it is too small, the I/O cost will be too high to overcome the benefit of
93+
batching.
8194

8295
In order to learn more about what wide- vs. long-form (tidy, stacked, narrow)
8396
data means in the context of `DataFrame computing`_, let us refer you to `a
@@ -93,14 +106,16 @@ multiple batches, using a defined chunk size.
93106
94107
.. _CrateDB bulk operations: https://crate.io/docs/crate/reference/en/latest/interfaces/http.html#bulk-operations
95108
.. _DataFrame computing: https://realpython.com/pandas-dataframe/
96-
.. _insert_pandas.py: https://github.com/crate/crate-python/blob/master/examples/insert_pandas.py
109+
.. _insert_pandas.py: https://github.com/crate/cratedb-examples/blob/main/by-language/python-sqlalchemy/insert_pandas.py
110+
.. _leveling up to 200_000: https://acepor.github.io/2017/08/03/using-chunksize/
97111
.. _NumPy: https://en.wikipedia.org/wiki/NumPy
98112
.. _pandas: https://en.wikipedia.org/wiki/Pandas_(software)
99113
.. _pandas DataFrame: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html
100114
.. _Python: https://en.wikipedia.org/wiki/Python_(programming_language)
101115
.. _relational databases: https://en.wikipedia.org/wiki/Relational_database
102116
.. _SQL: https://en.wikipedia.org/wiki/SQL
103117
.. _SQLAlchemy: https://aosabook.org/en/v2/sqlalchemy.html
118+
.. _the chunksize should not be too small: https://acepor.github.io/2017/08/03/using-chunksize/
104119
.. _wide-narrow-general: https://en.wikipedia.org/wiki/Wide_and_narrow_data
105120
.. _wide-narrow-data-computing: https://dtkaplan.github.io/DataComputingEbook/chap-wide-vs-narrow.html#chap:wide-vs-narrow
106121
.. _wide-narrow-pandas-tutorial: https://anvil.works/blog/tidy-data

0 commit comments

Comments
 (0)