|
| 1 | +.. _sqlalchemy-pandas: |
| 2 | +.. _sqlalchemy-dataframe: |
| 3 | + |
| 4 | +================================ |
| 5 | +SQLAlchemy: DataFrame operations |
| 6 | +================================ |
| 7 | + |
| 8 | +About |
| 9 | +===== |
| 10 | + |
| 11 | +This section of the documentation demonstrates support for efficient batch |
| 12 | +``INSERT`` operations with `pandas`_, using the CrateDB SQLAlchemy dialect. |
| 13 | + |
| 14 | + |
| 15 | +Introduction |
| 16 | +============ |
| 17 | + |
| 18 | +The :ref:`pandas DataFrame <pandas:api.dataframe>` is a structure that contains |
| 19 | +two-dimensional data and its corresponding labels. DataFrames are widely used |
| 20 | +in data science, machine learning, scientific computing, and many other |
| 21 | +data-intensive fields. |
| 22 | + |
| 23 | +DataFrames are similar to SQL tables or the spreadsheets that you work with in |
| 24 | +Excel or Calc. In many cases, DataFrames are faster, easier to use, and more |
| 25 | +powerful than tables or spreadsheets because they are an integral part of the |
| 26 | +`Python`_ and `NumPy`_ ecosystems. |
| 27 | + |
| 28 | +The :ref:`pandas I/O subsystem <pandas:api.io>` for `relational databases`_ |
| 29 | +using `SQL`_ is based on `SQLAlchemy`_. |
| 30 | + |
| 31 | + |
| 32 | +.. rubric:: Table of Contents |
| 33 | + |
| 34 | +.. contents:: |
| 35 | + :local: |
| 36 | + |
| 37 | + |
| 38 | +Efficient ``INSERT`` operations with pandas |
| 39 | +=========================================== |
| 40 | + |
| 41 | +The package provides a ``bulk_insert`` function to use the |
| 42 | +:meth:`pandas:pandas.DataFrame.to_sql` method most efficiently, based on the `CrateDB |
| 43 | +bulk operations`_ endpoint. It will effectively split your insert workload across |
| 44 | +multiple batches, using a defined chunk size. |
| 45 | + |
| 46 | + >>> import sqlalchemy as sa |
| 47 | + >>> from pandas._testing import makeTimeDataFrame |
| 48 | + >>> from crate.client.sqlalchemy.support import insert_bulk |
| 49 | + ... |
| 50 | + >>> # Define number of records, and chunk size. |
| 51 | + >>> INSERT_RECORDS = 42 |
| 52 | + >>> CHUNK_SIZE = 8 |
| 53 | + ... |
| 54 | + >>> # Connect to CrateDB, and create a pandas DataFrame. |
| 55 | + >>> df = makeTimeDataFrame(nper=INSERT_RECORDS, freq="S") |
| 56 | + >>> engine = sa.create_engine(f"crate://{crate_host}") |
| 57 | + ... |
| 58 | + >>> # Insert batches of records. Effectively, six. 42 / 8 = 5.25. |
| 59 | + >>> df.to_sql( |
| 60 | + ... name="test-testdrive", |
| 61 | + ... con=engine, |
| 62 | + ... if_exists="replace", |
| 63 | + ... index=False, |
| 64 | + ... chunksize=CHUNK_SIZE, |
| 65 | + ... method=insert_bulk, |
| 66 | + ... ) |
| 67 | + |
| 68 | +.. TIP:: |
| 69 | + |
| 70 | + You will observe that the optimal chunk size highly depends on the shape of |
| 71 | + your data, specifically the width of each record, i.e. the number of columns |
| 72 | + and their individual sizes. You will need to determine a good chunk size by |
| 73 | + running corresponding experiments on your own behalf. For that purpose, you |
| 74 | + can use the `insert_pandas.py`_ program as a blueprint. |
| 75 | + |
| 76 | + It is a good idea to start your explorations with a chunk size of 5000, and |
| 77 | + then see if performance improves when you increase or decrease that figure. |
| 78 | + Chunk sizes of 20000 may also be applicable, but make sure to take the limits |
| 79 | + of your HTTP infrastructure into consideration. |
| 80 | + |
| 81 | + In order to learn more about what wide- vs. long-form (tidy, stacked, narrow) |
| 82 | + data means in the context of `DataFrame computing`_, let us refer you to `a |
| 83 | + general introduction <wide-narrow-general_>`_, the corresponding section in |
| 84 | + the `Data Computing book <wide-narrow-data-computing_>`_, and a `pandas |
| 85 | + tutorial <wide-narrow-pandas-tutorial_>`_ about the same topic. |
| 86 | + |
| 87 | + |
| 88 | +.. hidden: Disconnect from database |
| 89 | +
|
| 90 | + >>> engine.dispose() |
| 91 | +
|
| 92 | +
|
| 93 | +.. _CrateDB bulk operations: https://crate.io/docs/crate/reference/en/latest/interfaces/http.html#bulk-operations |
| 94 | +.. _DataFrame computing: https://realpython.com/pandas-dataframe/ |
| 95 | +.. _insert_pandas.py: https://github.com/crate/crate-python/blob/master/examples/insert_pandas.py |
| 96 | +.. _NumPy: https://en.wikipedia.org/wiki/NumPy |
| 97 | +.. _pandas: https://en.wikipedia.org/wiki/Pandas_(software) |
| 98 | +.. _pandas DataFrame: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html |
| 99 | +.. _Python: https://en.wikipedia.org/wiki/Python_(programming_language) |
| 100 | +.. _relational databases: https://en.wikipedia.org/wiki/Relational_database |
| 101 | +.. _SQL: https://en.wikipedia.org/wiki/SQL |
| 102 | +.. _SQLAlchemy: https://aosabook.org/en/v2/sqlalchemy.html |
| 103 | +.. _wide-narrow-general: https://en.wikipedia.org/wiki/Wide_and_narrow_data |
| 104 | +.. _wide-narrow-data-computing: https://dtkaplan.github.io/DataComputingEbook/chap-wide-vs-narrow.html#chap:wide-vs-narrow |
| 105 | +.. _wide-narrow-pandas-tutorial: https://anvil.works/blog/tidy-data |
0 commit comments