@@ -76,9 +76,8 @@ workload across multiple batches, using a defined chunk size.
76
76
77
77
You will observe that the optimal chunk size highly depends on the shape of
78
78
your data, specifically the width of each record, i.e. the number of columns
79
- and their individual sizes. You will need to determine a good chunk size by
80
- running corresponding experiments on your own behalf. For that purpose, you
81
- can use the `insert_pandas.py `_ program as a blueprint.
79
+ and their individual sizes, which will in the end determine the total size of
80
+ each batch/chunk.
82
81
83
82
A few details should be taken into consideration when determining the optimal
84
83
chunk size for a specific dataset. We are outlining the two major ones.
@@ -106,8 +105,11 @@ workload across multiple batches, using a defined chunk size.
106
105
107
106
It is a good idea to start your explorations with a chunk size of 5_000, and
108
107
then see if performance improves when you increase or decrease that figure.
109
- Chunk sizes of 20000 may also be applicable, but make sure to take the limits
110
- of your HTTP infrastructure into consideration.
108
+ People are reporting that 10_000-20_000 is their optimal setting, but if you
109
+ process, for example, just three "small" columns, you may also experiment with
110
+ `leveling up to 200_000 `_, because `the chunksize should not be too small `_.
111
+ If it is too small, the I/O cost will be too high to overcome the benefit of
112
+ batching.
111
113
112
114
In order to learn more about what wide- vs. long-form (tidy, stacked, narrow)
113
115
data means in the context of `DataFrame computing `_, let us refer you to `a
@@ -125,14 +127,16 @@ workload across multiple batches, using a defined chunk size.
125
127
.. _chunking : https://en.wikipedia.org/wiki/Chunking_(computing)
126
128
.. _CrateDB bulk operations : https://crate.io/docs/crate/reference/en/latest/interfaces/http.html#bulk-operations
127
129
.. _DataFrame computing : https://realpython.com/pandas-dataframe/
128
- .. _insert_pandas.py : https://github.com/crate/crate-python/blob/master/examples/insert_pandas.py
130
+ .. _insert_pandas.py : https://github.com/crate/cratedb-examples/blob/main/by-language/python-sqlalchemy/insert_pandas.py
131
+ .. _leveling up to 200_000 : https://acepor.github.io/2017/08/03/using-chunksize/
129
132
.. _NumPy : https://en.wikipedia.org/wiki/NumPy
130
133
.. _pandas : https://en.wikipedia.org/wiki/Pandas_(software)
131
134
.. _pandas DataFrame : https://pandas.pydata.org/pandas-docs/stable/reference/frame.html
132
135
.. _Python : https://en.wikipedia.org/wiki/Python_(programming_language)
133
136
.. _relational databases : https://en.wikipedia.org/wiki/Relational_database
134
137
.. _SQL : https://en.wikipedia.org/wiki/SQL
135
138
.. _SQLAlchemy : https://aosabook.org/en/v2/sqlalchemy.html
139
+ .. _the chunksize should not be too small : https://acepor.github.io/2017/08/03/using-chunksize/
136
140
.. _wide-narrow-general : https://en.wikipedia.org/wiki/Wide_and_narrow_data
137
141
.. _wide-narrow-data-computing : https://dtkaplan.github.io/DataComputingEbook/chap-wide-vs-narrow.html#chap:wide-vs-narrow
138
142
.. _wide-narrow-pandas-tutorial : https://anvil.works/blog/tidy-data
0 commit comments