@@ -70,14 +70,27 @@ multiple batches, using a defined chunk size.
70
70
71
71
You will observe that the optimal chunk size highly depends on the shape of
72
72
your data, specifically the width of each record, i.e. the number of columns
73
- and their individual sizes. You will need to determine a good chunk size by
74
- running corresponding experiments on your own behalf. For that purpose, you
75
- can use the `insert_pandas.py `_ program as a blueprint.
73
+ and their individual sizes, which will in the end determine the total size of
74
+ each batch/chunk.
76
75
77
- It is a good idea to start your explorations with a chunk size of 5000, and
76
+ Two specific things should be taken into consideration when determining the
77
+ optimal chunk size for a specific dataset. First, when working with data
78
+ larger than main memory available on your machine, each chunk should be small
79
+ enough to fit into the memory, but large enough to minimize the overhead of a
80
+ single data insert operation. Second, as each batch is submitted using HTTP,
81
+ you should know about the request size limits of your HTTP infrastructure.
82
+
83
+ You will need to determine a good chunk size by running corresponding experiments
84
+ on your own behalf. For that purpose, you can use the `insert_pandas.py `_ program
85
+ as a blueprint.
86
+
87
+ It is a good idea to start your explorations with a chunk size of 5_000, and
78
88
then see if performance improves when you increase or decrease that figure.
79
- Chunk sizes of 20000 may also be applicable, but make sure to take the limits
80
- of your HTTP infrastructure into consideration.
89
+ People are reporting that 10_000-20_000 is their optimal setting, but if you
90
+ process, for example, just three "small" columns, you may also experiment with
91
+ `leveling up to 200_000 `_, because `the chunksize should not be too small `_.
92
+ If it is too small, the I/O cost will be too high to overcome the benefit of
93
+ batching.
81
94
82
95
In order to learn more about what wide- vs. long-form (tidy, stacked, narrow)
83
96
data means in the context of `DataFrame computing `_, let us refer you to `a
@@ -93,14 +106,16 @@ multiple batches, using a defined chunk size.
93
106
94
107
.. _CrateDB bulk operations : https://crate.io/docs/crate/reference/en/latest/interfaces/http.html#bulk-operations
95
108
.. _DataFrame computing : https://realpython.com/pandas-dataframe/
96
- .. _insert_pandas.py : https://github.com/crate/crate-python/blob/master/examples/insert_pandas.py
109
+ .. _insert_pandas.py : https://github.com/crate/cratedb-examples/blob/main/by-language/python-sqlalchemy/insert_pandas.py
110
+ .. _leveling up to 200_000 : https://acepor.github.io/2017/08/03/using-chunksize/
97
111
.. _NumPy : https://en.wikipedia.org/wiki/NumPy
98
112
.. _pandas : https://en.wikipedia.org/wiki/Pandas_(software)
99
113
.. _pandas DataFrame : https://pandas.pydata.org/pandas-docs/stable/reference/frame.html
100
114
.. _Python : https://en.wikipedia.org/wiki/Python_(programming_language)
101
115
.. _relational databases : https://en.wikipedia.org/wiki/Relational_database
102
116
.. _SQL : https://en.wikipedia.org/wiki/SQL
103
117
.. _SQLAlchemy : https://aosabook.org/en/v2/sqlalchemy.html
118
+ .. _the chunksize should not be too small : https://acepor.github.io/2017/08/03/using-chunksize/
104
119
.. _wide-narrow-general : https://en.wikipedia.org/wiki/Wide_and_narrow_data
105
120
.. _wide-narrow-data-computing : https://dtkaplan.github.io/DataComputingEbook/chap-wide-vs-narrow.html#chap:wide-vs-narrow
106
121
.. _wide-narrow-pandas-tutorial : https://anvil.works/blog/tidy-data
0 commit comments