-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparison of memory efficiency LMDB, S3, Pandas for read/write operations on dataframes (new approach on making sure we accurately measure peakmem with ASV) #2204
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am becoming less and less convinced of setup-as-classes approach.
It is starting to generate a lot of boilerplate code due to unnecessary abstractions.
Maybe we should have a separate task to refactor the approach and to make it more maintainable/easier to use
- read/write dataframe to arcticdb | ||
- read/write dataframe to arcticdb LMDB | ||
|
||
The above problem is interesting in 2 ways: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a a bit of a nit pick, but I don't think that this is a useful comment in here.
Would be better to put in a proper document, probably in the ASV wiki
b) how to represent information of comparison effectively | ||
|
||
To accurately measure the peakmem of any operation with ASV we have to exclude | ||
to base memory that is the object memory space from the calculation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above
self.ac = Arctic(RealComparisonBenchmarks.URL) | ||
self.lib = self.ac[RealComparisonBenchmarks.LIB_NAME] | ||
self.path = f"{tempfile.gettempdir()}/df.parquet" | ||
self.path_to_read = f"{tempfile.gettempdir()}/df_to_read.parquet" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.path / self.path_to_read - are not great names, change to something more meaningful e.g. parquet_to_write / parquet_to_read
self.pid = os.getpid() | ||
# With shared storage we create different libs for each process | ||
# therefore we initialize the symbol here also not in setup_cache | ||
self.s3_lib = RealComparisonBenchmarks.REAL_STORAGE_SETUP.get_modifiable_library(self.pid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need self.pid
here?
IIRC the tests are executed sequentially so this shouldn't be needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All asv code must allways assume tests are are executed in separate processes. That prevents errors that will arise if and when this happens, intentianally or unintentionally (change in asv execution policy).
So actually I would always the opposite comment and actually require code change in any future tests that does convey to this policy.
In other words that assumption is for safety reasons and allows engineers to always write code that will work despite any changes that may occur in future.
This will also become part of the wiki I started to write for the asv test that would run on many types of storages
elif btype == PANDAS_PARQUET: | ||
pd.read_parquet(self.path_to_read) | ||
elif btype == ARCTICDB_LMDB: | ||
self.lib.read(RealComparisonBenchmarks.SYMBOL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we using RealComparisonBenchmarks.SYMBOL
here?
Isnn't this for lmdb
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. I beleive it is used exatly for LMDB
return next | ||
|
||
|
||
class GeneralSetupOfLibrariesWithSymbols(EnvConfigurationBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we even need GeneralSetupOfLibrariesWithSymbols
and GeneralAppendSetup
?
I don't them used anywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as mentioned there is another PR in review where those are used in actual tests. So helper modules are taken from there. There those are used along. So this file is "shared" and any modifications to it will fo with the other PR. #2185.
This means that this PR will be merged only after previous one as it will cary the needed functionality.
I chose to separete this test from others as it carries new approach to ASV tests, while the other PR is for tests that we know well and exists in LMDB only version.
# With shared storage we create different libs for each process | ||
# therefore we initialize the symbol here also not in setup_cache | ||
self.s3_lib = RealComparisonBenchmarks.REAL_STORAGE_SETUP.get_modifiable_library(self.pid) | ||
self.s3_symbol = RealComparisonBenchmarks.REAL_STORAGE_SETUP.get_symbol_name_template(self.pid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very hard to read
RealComparisonBenchmarks.REAL_STORAGE_SETUP.get_symbol_name_template
either:
- define this as a variable and use it that way
- refactor the setup to be functions rather than a class, given that they are just abstracting pretty simple calls to the arctic client
Reference Issues/PRs
NOTE: Part of the review are only the the tests the utility.py and environment_setup.py are part of #2185 they are here to support the creation of this PR only
IMPOTANT READ:
from: https://asv.readthedocs.io/en/stable/writing_benchmarks.html#peak-memory
Note
The peak memory benchmark also counts memory usage during the setup routine, which may confound the benchmark results. One way to avoid this is to use setup_cache instead.
My takeaways:
We have many tests which have no way unless they make heavy use of setup() to setup some preconditions which are not possible to be done at setup_cache as they run in different processes. Therefore :
= we cannot assume that number we get there is a number related only with the process inside peakmem
Change Type (Required)
Any other comments?
Checklist
Checklist for code changes...