Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison of memory efficiency LMDB, S3, Pandas for read/write operations on dataframes (new approach on making sure we accurately measure peakmem with ASV) #2204

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

grusev
Copy link
Collaborator

@grusev grusev commented Feb 26, 2025

Reference Issues/PRs

NOTE: Part of the review are only the the tests the utility.py and environment_setup.py are part of #2185 they are here to support the creation of this PR only

The test aims to compare efficiency of read and write operations
in terms of memory on one graph
    - creation of dataframe
    - read/write dataframe
    - read/write dataframe to arcticdb LMDB
    - read/write dataframe to arcticdb Amazon S3

The above problem is interesting in 2 ways:
 a) how to measure accurately the memory of certain operation with asv
 b) how to represent information of comparison effectively    

To accurately measure the peakmem of any operation with ASV we have to exclude
to base memory that is the object memory space from the calculation.

To be able to do that we need to introduce memory baseline - a run of the test 
which actually does nothing. It will produce result X

Then each new run of the same test with operation that does anything will produce
memory which is X + Z, where Z will be the actual memory of that specific operation

The only logical way to represent then the information is to put all in one graph:
  - the base line
  - operation A
  - operation B

Sample measurement:

         ============================ =======
                 backend_type
         ---------------------------- -------
              no-operation-load        2.66G
          create-df-pandas-from_dict   3.17G
                pandas-parquet          5.6G
                arcticdb-lmdb          5.01G
         ============================ =======

IMPOTANT READ:
from: https://asv.readthedocs.io/en/stable/writing_benchmarks.html#peak-memory

Note
The peak memory benchmark also counts memory usage during the setup routine, which may confound the benchmark results. One way to avoid this is to use setup_cache instead.

My takeaways:
We have many tests which have no way unless they make heavy use of setup() to setup some preconditions which are not possible to be done at setup_cache as they run in different processes. Therefore :
= we cannot assume that number we get there is a number related only with the process inside peakmem

  • the process peak mem should be significant compared to base memory (have this in mind during planning tests)
  • if we get a signal for 10% increase of memory it might not be directly related to what is the actual test .. but also to operations in setup
  • peakmem can give us only a clue for the trend but perhaps not the actual numbers unless baseline as above is established. (this makes memray approach OR memory_profiler better suited for actual analysis (see Compare memory needs for dataframe vs read/write dataframe in lmdb and aws s3 #2135)

Change Type (Required)

  • Patch (Bug fix or non-breaking improvement)
  • Minor (New feature, but backward compatible)
  • Major (Breaking changes)
  • Cherry pick

Any other comments?

Checklist

Checklist for code changes...
  • Have you updated the relevant docstrings, documentation and copyright notice?
  • Is this contribution tested against all ArcticDB's features?
  • Do all exceptions introduced raise appropriate error messages?
  • Are API changes highlighted in the PR description?
  • Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

@grusev grusev changed the title initial version Comparison of memory efficiency LMDB, S3, Pandas for read/write operations on dataframes (new approach on making sure we accurately measure peakmem with ASV) Feb 26, 2025
@grusev grusev added test engineer Issue that is good for an engineer in test to work on. patch Small change, should increase patch version labels Feb 26, 2025
Copy link
Collaborator

@G-D-Petrov G-D-Petrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am becoming less and less convinced of setup-as-classes approach.
It is starting to generate a lot of boilerplate code due to unnecessary abstractions.
Maybe we should have a separate task to refactor the approach and to make it more maintainable/easier to use

- read/write dataframe to arcticdb
- read/write dataframe to arcticdb LMDB

The above problem is interesting in 2 ways:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a a bit of a nit pick, but I don't think that this is a useful comment in here.
Would be better to put in a proper document, probably in the ASV wiki

b) how to represent information of comparison effectively

To accurately measure the peakmem of any operation with ASV we have to exclude
to base memory that is the object memory space from the calculation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above

self.ac = Arctic(RealComparisonBenchmarks.URL)
self.lib = self.ac[RealComparisonBenchmarks.LIB_NAME]
self.path = f"{tempfile.gettempdir()}/df.parquet"
self.path_to_read = f"{tempfile.gettempdir()}/df_to_read.parquet"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.path / self.path_to_read - are not great names, change to something more meaningful e.g. parquet_to_write / parquet_to_read

self.pid = os.getpid()
# With shared storage we create different libs for each process
# therefore we initialize the symbol here also not in setup_cache
self.s3_lib = RealComparisonBenchmarks.REAL_STORAGE_SETUP.get_modifiable_library(self.pid)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need self.pid here?
IIRC the tests are executed sequentially so this shouldn't be needed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All asv code must allways assume tests are are executed in separate processes. That prevents errors that will arise if and when this happens, intentianally or unintentionally (change in asv execution policy).

So actually I would always the opposite comment and actually require code change in any future tests that does convey to this policy.

In other words that assumption is for safety reasons and allows engineers to always write code that will work despite any changes that may occur in future.

This will also become part of the wiki I started to write for the asv test that would run on many types of storages

elif btype == PANDAS_PARQUET:
pd.read_parquet(self.path_to_read)
elif btype == ARCTICDB_LMDB:
self.lib.read(RealComparisonBenchmarks.SYMBOL)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we using RealComparisonBenchmarks.SYMBOL here?
Isnn't this for lmdb?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. I beleive it is used exatly for LMDB

return next


class GeneralSetupOfLibrariesWithSymbols(EnvConfigurationBase):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we even need GeneralSetupOfLibrariesWithSymbols and GeneralAppendSetup?
I don't them used anywhere

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned there is another PR in review where those are used in actual tests. So helper modules are taken from there. There those are used along. So this file is "shared" and any modifications to it will fo with the other PR. #2185.

This means that this PR will be merged only after previous one as it will cary the needed functionality.

I chose to separete this test from others as it carries new approach to ASV tests, while the other PR is for tests that we know well and exists in LMDB only version.

# With shared storage we create different libs for each process
# therefore we initialize the symbol here also not in setup_cache
self.s3_lib = RealComparisonBenchmarks.REAL_STORAGE_SETUP.get_modifiable_library(self.pid)
self.s3_symbol = RealComparisonBenchmarks.REAL_STORAGE_SETUP.get_symbol_name_template(self.pid)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very hard to read
RealComparisonBenchmarks.REAL_STORAGE_SETUP.get_symbol_name_template

either:

  • define this as a variable and use it that way
  • refactor the setup to be functions rather than a class, given that they are just abstracting pretty simple calls to the arctic client

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
patch Small change, should increase patch version test engineer Issue that is good for an engineer in test to work on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants