-
Notifications
You must be signed in to change notification settings - Fork 10
feat: High performance pandas integration. #24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 124 commits
Commits
Show all changes
149 commits
Select commit
Hold shift + click to select a range
12660f3
Testing tweak.
amunra bdbd283
Merge remote-tracking branch 'origin/main' into pandas_integration
amunra 9e93993
Updated to c-questdb-client 2.1.1
amunra e56027c
Some progress..
amunra ed1658e
Merge remote-tracking branch 'origin/main' into pandas_integration
amunra d692098
Fixed broken check.
amunra a92a858
symbols validation.
amunra bebbc8a
Added repl command to proj script.
amunra ab05dec
Progress with pandas method input validation.
amunra dc65b2c
Moved Python IntVec to C int_vec.
amunra a46490a
More code to avoid python lists.
amunra 737f970
CI fixup.
amunra d574486
CI fixup 2.
amunra b1c17b0
CI fixup 3.
amunra fc99054
CI fixup 4.
amunra 3654938
CI fixup 5.
amunra 7102ddd
Types and buffers.
amunra 5d6d4be
Improved column index handling, actually getting buffers from numpy a…
amunra 95a66bc
Introducing a small rust lib to convert python strings to UTF-8 witho…
amunra e43e564
Implemented conversions for UCS1/2/4.
amunra 25bc49b
Renamed rust lib and wired up the build and linkage bits in setup.py
amunra b199db6
Auxilliary rust lib rename.
amunra 7b5200e
Reworked API for better buffer reuse and for individual UCS1, 2, 4 fu…
amunra e945702
cbindgen to generate C .h and Cython .pxd headers.
amunra 161f6ae
cbindgen fixup & including lib headers from setup.py
amunra 5b5d58b
Wrote Cython function to invoke 'pystr-to-utf8' lib. Additional refac…
amunra 12598c4
Transitioned all string buffer conversions to new Rust code lib.
amunra 3544312
Made pystr_to_utf8 addresses stable.
amunra 9fce68b
Rust str to utf8 lib fixes (but still broken - ongoing)
amunra 9155357
Minor unicode test improvement. Transcoding works now.
amunra f953ad0
Rust PyStr lib tests and a few bugfixes.
amunra ab1566c
Updated pystr lib readme.
amunra d79955f
More pystr-to_utf8 tests and improvements.
amunra e8980d5
Added UCS-4 tests.
amunra 644bdba
More unicode testing.
amunra b071edf
Fixed include for cython generation compatability.
amunra 091e4c8
Table name columns, symbols and timestamps now work!
amunra 29895f6
Handling null column values in strings.
amunra ef6735b
Added arrow C data interface type definitions.
amunra ead851d
Code reorg.
amunra d14eea9
Consolidated approach writeup and code reorg into .pxi files.
amunra 94378a0
Undid removal of ingress.c from gitignore.
amunra 939c50d
More writeup with final types.
amunra 2355b3d
Categories added to write-up.
amunra 1c14cb7
Consolidated Pandas logic into single .pxi file.
amunra 4f7ef81
File renaming.
amunra de4b471
Reorganised existing logic into a sorted array of col_t types. Some p…
amunra ef2424d
Documented float16, added col_source_t.
amunra 993dd5f
Beginning to resolve columns.
amunra 8387209
More array extraction logic.
amunra 5e5158f
Updating of types, updating tech doc for timezone timestamps.
amunra 2a26d8f
Fixed up most cython build issues. Mostly enum usage issues.
amunra e9890a6
Code builds again finally.
amunra 7e75260
Dead code removal.
amunra 0e67480
Types to dispatch codes to functions.
amunra 41d3e6b
Some test fixup
amunra 7a1c0b0
Yay, segfault!
amunra 3af4899
Fixed a few segfaults, got some more.
amunra 9dc17a3
Fixed segfaults.
amunra 9665d9f
Added missing dispatch codes and lots of TODOs.
amunra 194ad25
Got rid of a lot of INCREF/DECREF silliness.
amunra 9ea4a0e
Ohh look. Tests pass again.
amunra 7c03446
Fixed another segfault.
amunra 80748ad
Another bug bites the dust.
amunra 01d8732
Implemented symbols='auto' and i32 column support.
amunra 579e649
Swapped out error prone 'bint / except False' declarations with 'void…
amunra df823b7
More string trouble.
amunra cb97a20
Normality restored.
amunra 5c1aaff
py obj to symbols.
amunra dc5795f
Done timestamp at and columns. Found out that timezone timestamps are…
amunra b7cfd50
Added some testing notes.
amunra 12c6eec
TODO fixup.
amunra 8593b75
Added support for datetimes with timezones (only nanosecond based) vi…
amunra 7f6503d
Bool column support from Python objects.
amunra 25c91be
Added arrow-based boolean pandas datatype column support.
amunra 8b31010
Support for arrow integer columns.
amunra 6cad41b
Progress handling strings.
amunra 1499cff
Support for objects with integers.
amunra 0e9e287
Float object support.
amunra a1e7d07
arrow f32 and f64
amunra 898b157
str column pyarrow.
amunra 88e618a
LTO, basic perf tests, removed debug logging, fixed a bug in string c…
amunra 0158fad
Tests for categories.
amunra fd01ef3
Releasing and reacquiring GIL to avoid starving other threads.
amunra dfdd302
Fully releasing GIL whenever possible. This was fiddly to get working.
amunra 5041d26
Refactoring out benchmarks, refactoring Py str to UTF8 rust impl.
amunra e4135d0
8% perf improvements in Python string to UTF-8 conversions.
amunra 754534c
Multithreading benchmark.
amunra 5915a00
Implemented column (arrow and pybuffer) cleanup.
amunra 9b75a12
Formatting.
amunra 7f8dab4
Tested all-nulls column is altogether skipped.
amunra 21f39f8
Refactoring and sorting columns in C.
amunra f28903e
Updated c-questdb-client submodule: Latest perf improvements.
amunra 8b6652a
Fixed broken build.
amunra cba2aaf
Single logic to infer object column types.
amunra e07bb42
Tests fixup.
amunra bd4bca6
More tests.
amunra 7c734b5
Fixed a bug passing None in datetime columns.
amunra 480343d
Tests for degenerate pandas dataframes.
amunra b1f2ebf
Informative message for row of nulls.
amunra 9811007
Mandating pyarrow dependency for pandas functionality.
amunra c03c4ed
There's a chance this will fix CI.
amunra b1a4dc7
Second attempt to fix up the CI.
amunra 8b3e45d
Third attempt to fix up the CI.
amunra 15330b0
Reduced stack size in case of errors to aid legibility.
amunra cee6d4b
Fourth attempt to fix up the CI.
amunra 5668b57
Fifth attempt to fix up the CI.
amunra 77a612c
Sixth attempt to fix up the CI.
amunra 57ae0b8
Progress on API docs.
amunra a2763f9
Found and fixed a memory leak.
amunra 960cd74
More fuzzing.
amunra 5998a1c
Added support from taking the table name from the df.index.name, rena…
amunra be0407c
General fixes and testcases for handling timestamps.
amunra 24ee3cd
Should fix tests in CI.
amunra e8d8daa
Extra testing of 'TimestampXXX.now()' and hopefully fixing CI.
amunra 5f8e8ee
CI fixup attempt.
amunra 6b77da9
Fixing broken 32-bit binaries.
amunra aaf7e95
Slimmed down 'col_t' type.
amunra b9b2081
Implemented (but not yet tested) pandas auto-flush logic. Also releas…
amunra dcccabd
Tweak to pandas auto-flush logic.
amunra 98a5496
Basic pandas end-to-end test.
amunra 02d49dd
Tests (and bugfixes) for panda's auto-flush.
amunra 1801f10
Pandas API docs.
amunra 45aa14b
Renamed '.pandas()' to '.dataframe()'.
amunra ada3ac8
Int object int64 bounds check tests.
amunra 89c50b4
Test strided numpy array with zero-copy into pandas.
amunra 64f14fa
Serializing subset of dataframe rows.
amunra def3887
Improved error messaging.
amunra 81d6cb8
Testing chunked arrow arrays.
amunra 83a937a
Removed completed TODOs
amunra 712ec1d
Hopefully fixing CI.
amunra 88b043e
Dataframe API doc fixup.
amunra 58de10c
Fixing the CI
amunra b461557
Parquet rountrip test.
amunra 9625447
Added missing libs in dev_requirements.txt
amunra 02da96b
CI fixup (hopefully)
amunra e26f5fe
CI fixup (hopefully, again)
amunra 6dd6cf6
CI fixup (once more, with feeling)
amunra 67cedd9
More examples.
amunra 0c7b6ef
Parquet data example.
amunra cd97af2
Updated parquet example, added to docs.
amunra ab69e9c
Updated examples manifest to hint at more examples for Pandas datafra…
amunra 3af8c85
Disabled bytecode file gen for install_rust.py
amunra 25d4e2b
Updated CHANGELOG.rst
amunra 39dc427
Minor error reporting bugfix.
amunra 7818149
Improved docs.
amunra 46999e7
Updated c-questdb-client dependency.
amunra 32f3394
Exception type tidy-up.
amunra 38eb382
Fixed typos spotted during the code review.
amunra File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,7 @@ | ||
{ | ||
"esbonio.sphinx.confDir": "" | ||
"esbonio.sphinx.confDir": "", | ||
"cmake.configureOnOpen": false, | ||
"files.associations": { | ||
"ingress_helper.h": "c" | ||
} | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Submodule c-questdb-client
updated
3 files
+39 −1 | cpp_test/test_line_sender.cpp | |
+28 −21 | include/questdb/ilp/line_sender.hpp | |
+31 −17 | questdb-rs/src/ingress/mod.rs |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
import sys | ||
import subprocess | ||
import shlex | ||
import textwrap | ||
import platform | ||
|
||
|
||
class UnsupportedDependency(Exception): | ||
pass | ||
|
||
|
||
def pip_install(package): | ||
args = [ | ||
sys.executable, | ||
'-m', 'pip', 'install', | ||
'--upgrade', | ||
'--only-binary', ':all:', | ||
package] | ||
args_s = ' '.join(shlex.quote(arg) for arg in args) | ||
sys.stderr.write(args_s + '\n') | ||
res = subprocess.run( | ||
args, | ||
stderr=subprocess.STDOUT, | ||
stdout=subprocess.PIPE) | ||
if res.returncode == 0: | ||
return | ||
output = res.stdout.decode('utf-8') | ||
if 'Could not find a version that satisfies the requirement' in output: | ||
raise UnsupportedDependency(output) | ||
else: | ||
sys.stderr.write(output + '\n') | ||
sys.exit(res.returncode) | ||
|
||
|
||
def try_pip_install(package): | ||
try: | ||
pip_install(package) | ||
except UnsupportedDependency as e: | ||
msg = textwrap.indent(str(e), ' ' * 8) | ||
sys.stderr.write(f' Ignored unsatisfiable dependency:\n{msg}\n') | ||
|
||
|
||
def ensure_timezone(): | ||
try: | ||
import zoneinfo | ||
if platform.system() == 'Windows': | ||
pip_install('tzdata') # for zoneinfo | ||
except ImportError: | ||
pip_install('pytz') | ||
|
||
|
||
def main(): | ||
ensure_timezone() | ||
try_pip_install('pandas') | ||
try_pip_install('numpy') | ||
try_pip_install('pyarrow') | ||
|
||
on_linux_is_glibc = ( | ||
(not platform.system() == 'Linux') or | ||
(platform.libc_ver()[0] == 'glibc')) | ||
is_64bits = sys.maxsize > 2**32 | ||
is_cpython = platform.python_implementation() == 'CPython' | ||
if on_linux_is_glibc and is_64bits and is_cpython: | ||
# Ensure that we've managed to install the expected dependencies. | ||
import pandas | ||
import numpy | ||
import pyarrow | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,8 +1,10 @@ | ||
setuptools>=45.2.0 | ||
Cython>=0.29.32 | ||
wheel>=0.34.2 | ||
cibuildwheel>=2.11.1 | ||
cibuildwheel>=2.11.2 | ||
Sphinx>=5.0.2 | ||
sphinx-rtd-theme>=1.0.0 | ||
twine>=4.0.1 | ||
bump2version>=1.0.1 | ||
pandas>=1.3.5 | ||
numpy>=1.21.6 | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# Profiling with Linux Perf | ||
|
||
https://juanjose.garciaripoll.com/blog/profiling-code-with-linux-perf/index.html | ||
|
||
```bash | ||
$ TEST_QUESTDB_PATCH_PATH=1 perf record -g --call-graph dwarf python3 test/benchmark.py -v TestBencharkPandas.test_string_encoding_1m | ||
test_string_encoding_1m (__main__.TestBencharkPandas.test_string_encoding_1m) ... Time: 4.682273147998785, size: 4593750000 | ||
ok | ||
|
||
---------------------------------------------------------------------- | ||
Ran 1 test in 10.166s | ||
|
||
OK | ||
[ perf record: Woken up 1341 times to write data ] | ||
Warning: | ||
Processed 54445 events and lost 91 chunks! | ||
|
||
Check IO/CPU overload! | ||
|
||
[ perf record: Captured and wrote 405.575 MB perf.data (50622 samples) ] | ||
``` | ||
|
||
# Rendering results | ||
|
||
```bash | ||
$ perf script | python3 perf/gprof2dot.py --format=perf | dot -Tsvg > perf/profile_graph.svg | ||
$ (cd perf && python3 -m http.server) | ||
``` |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.