@@ -70,14 +70,26 @@ multiple batches, using a defined chunk size.
70
70
71
71
You will observe that the optimal chunk size highly depends on the shape of
72
72
your data, specifically the width of each record, i.e. the number of columns
73
- and their individual sizes. You will need to determine a good chunk size by
74
- running corresponding experiments on your own behalf. For that purpose, you
75
- can use the `insert_pandas.py `_ program as a blueprint.
73
+ and their individual sizes, which will in the end determine the total size of
74
+ each batch/chunk.
76
75
77
- It is a good idea to start your explorations with a chunk size of 5000, and
76
+ Two specific things should be taken into consideration when determining the
77
+ optimal chunk size for a specific dataset. First, when working with data
78
+ larger than main memory available on your machine, each chunk should be small
79
+ enough to fit into the memory, but large enough to minimize the overhead of a
80
+ single data insert operation. Second, as each batch is submitted using HTTP,
81
+ you should know about the request size limits of your HTTP infrastructure.
82
+
83
+ You will need to determine a good chunk size by running corresponding experiments
84
+ on your own behalf. For that purpose, you can use the `insert_pandas.py `_ program
85
+ as a blueprint.
86
+
87
+ It is a good idea to start your explorations with a chunk size of 5_000, and
78
88
then see if performance improves when you increase or decrease that figure.
79
- Chunk sizes of 20000 may also be applicable, but make sure to take the limits
80
- of your HTTP infrastructure into consideration.
89
+ People are reporting that 10_000 is their optimal setting, but if you have,
90
+ for example, just three columns, you may also experiment with `leveling up to
91
+ 200_000 `_, because `the chunksize should not be too small `_. If it is too
92
+ small, the I/O cost will be too high to overcome the benefit of batching.
81
93
82
94
In order to learn more about what wide- vs. long-form (tidy, stacked, narrow)
83
95
data means in the context of `DataFrame computing `_, let us refer you to `a
@@ -93,14 +105,16 @@ multiple batches, using a defined chunk size.
93
105
94
106
.. _CrateDB bulk operations : https://crate.io/docs/crate/reference/en/latest/interfaces/http.html#bulk-operations
95
107
.. _DataFrame computing : https://realpython.com/pandas-dataframe/
96
- .. _insert_pandas.py : https://github.com/crate/crate-python/blob/master/examples/insert_pandas.py
108
+ .. _insert_pandas.py : https://github.com/crate/cratedb-examples/blob/main/by-language/python/insert_pandas.py
109
+ .. _leveling up to 200_000 : https://acepor.github.io/2017/08/03/using-chunksize/
97
110
.. _NumPy : https://en.wikipedia.org/wiki/NumPy
98
111
.. _pandas : https://en.wikipedia.org/wiki/Pandas_(software)
99
112
.. _pandas DataFrame : https://pandas.pydata.org/pandas-docs/stable/reference/frame.html
100
113
.. _Python : https://en.wikipedia.org/wiki/Python_(programming_language)
101
114
.. _relational databases : https://en.wikipedia.org/wiki/Relational_database
102
115
.. _SQL : https://en.wikipedia.org/wiki/SQL
103
116
.. _SQLAlchemy : https://aosabook.org/en/v2/sqlalchemy.html
117
+ .. _the chunksize should not be too small : https://acepor.github.io/2017/08/03/using-chunksize/
104
118
.. _wide-narrow-general : https://en.wikipedia.org/wiki/Wide_and_narrow_data
105
119
.. _wide-narrow-data-computing : https://dtkaplan.github.io/DataComputingEbook/chap-wide-vs-narrow.html#chap:wide-vs-narrow
106
120
.. _wide-narrow-pandas-tutorial : https://anvil.works/blog/tidy-data
0 commit comments