Comparison of memory efficiency LMDB, S3, Pandas for read/write operations on dataframes (new approach on making sure we accurately measure peakmem with ASV) #2204

grusev · 2025-02-26T09:17:01Z

Reference Issues/PRs

NOTE: Part of the review are only the the tests the utility.py and environment_setup.py are part of #2185 they are here to support the creation of this PR only

The test aims to compare efficiency of read and write operations
in terms of memory on one graph
    - creation of dataframe
    - read/write dataframe
    - read/write dataframe to arcticdb LMDB
    - read/write dataframe to arcticdb Amazon S3

The above problem is interesting in 2 ways:
 a) how to measure accurately the memory of certain operation with asv
 b) how to represent information of comparison effectively    

To accurately measure the peakmem of any operation with ASV we have to exclude
to base memory that is the object memory space from the calculation.

To be able to do that we need to introduce memory baseline - a run of the test 
which actually does nothing. It will produce result X

Then each new run of the same test with operation that does anything will produce
memory which is X + Z, where Z will be the actual memory of that specific operation

The only logical way to represent then the information is to put all in one graph:
  - the base line
  - operation A
  - operation B

Sample measurement:

         ============================ =======
                 backend_type
         ---------------------------- -------
              no-operation-load        2.66G
          create-df-pandas-from_dict   3.17G
                pandas-parquet          5.6G
                arcticdb-lmdb          5.01G
         ============================ =======

IMPOTANT READ:
from: https://asv.readthedocs.io/en/stable/writing_benchmarks.html#peak-memory

Note
The peak memory benchmark also counts memory usage during the setup routine, which may confound the benchmark results. One way to avoid this is to use setup_cache instead.

My takeaways:
We have many tests which have no way unless they make heavy use of setup() to setup some preconditions which are not possible to be done at setup_cache as they run in different processes. Therefore :
= we cannot assume that number we get there is a number related only with the process inside peakmem

the process peak mem should be significant compared to base memory (have this in mind during planning tests)
if we get a signal for 10% increase of memory it might not be directly related to what is the actual test .. but also to operations in setup
peakmem can give us only a clue for the trend but perhaps not the actual numbers unless baseline as above is established. (this makes memray approach OR memory_profiler better suited for actual analysis (see Compare memory needs for dataframe vs read/write dataframe in lmdb and aws s3 #2135)

Change Type (Required)

Patch (Bug fix or non-breaking improvement)
Minor (New feature, but backward compatible)
Major (Breaking changes)
Cherry pick

Any other comments?

Checklist

Checklist for code changes...

Have you updated the relevant docstrings, documentation and copyright notice?
Is this contribution tested against all ArcticDB's features?
Do all exceptions introduced raise appropriate error messages?
Are API changes highlighted in the PR description?
Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

G-D-Petrov

I am becoming less and less convinced of setup-as-classes approach.
It is starting to generate a lot of boilerplate code due to unnecessary abstractions.
Maybe we should have a separate task to refactor the approach and to make it more maintainable/easier to use

G-D-Petrov · 2025-02-26T13:36:19Z

python/benchmarks/comparison_benchmarks.py

-        - read/write dataframe to arcticdb
+        - read/write dataframe to arcticdb LMDB 
+
+    The above problem is interesting in 2 ways:


This might be a a bit of a nit pick, but I don't think that this is a useful comment in here.
Would be better to put in a proper document, probably in the ASV wiki

G-D-Petrov · 2025-02-26T13:36:52Z

python/benchmarks/real_comparison_benchmarks.py

+     b) how to represent information of comparison effectively    
+
+    To accurately measure the peakmem of any operation with ASV we have to exclude
+    to base memory that is the object memory space from the calculation.


See comment above

G-D-Petrov · 2025-02-26T13:45:32Z

python/benchmarks/real_comparison_benchmarks.py

+        self.ac = Arctic(RealComparisonBenchmarks.URL)
+        self.lib = self.ac[RealComparisonBenchmarks.LIB_NAME]
+        self.path = f"{tempfile.gettempdir()}/df.parquet"
+        self.path_to_read = f"{tempfile.gettempdir()}/df_to_read.parquet"


self.path / self.path_to_read - are not great names, change to something more meaningful e.g. parquet_to_write / parquet_to_read

G-D-Petrov · 2025-02-26T13:59:37Z

python/benchmarks/real_comparison_benchmarks.py

+        self.pid = os.getpid()
+        # With shared storage we create different libs for each process
+        # therefore we initialize the symbol here also not in setup_cache
+        self.s3_lib = RealComparisonBenchmarks.REAL_STORAGE_SETUP.get_modifiable_library(self.pid)


Why do we need self.pid here?
IIRC the tests are executed sequentially so this shouldn't be needed

All asv code must allways assume tests are are executed in separate processes. That prevents errors that will arise if and when this happens, intentianally or unintentionally (change in asv execution policy).

So actually I would always the opposite comment and actually require code change in any future tests that does convey to this policy.

In other words that assumption is for safety reasons and allows engineers to always write code that will work despite any changes that may occur in future.

This will also become part of the wiki I started to write for the asv test that would run on many types of storages

G-D-Petrov · 2025-02-26T14:01:59Z

python/benchmarks/real_comparison_benchmarks.py

+        elif btype == PANDAS_PARQUET:
+            pd.read_parquet(self.path_to_read)
+        elif btype == ARCTICDB_LMDB:
+            self.lib.read(RealComparisonBenchmarks.SYMBOL)


Why are we using RealComparisonBenchmarks.SYMBOL here?
Isnn't this for lmdb?

yes. I beleive it is used exatly for LMDB

G-D-Petrov · 2025-02-26T14:39:08Z

python/arcticdb/util/environment_setup.py

+        return next
+
+
+class GeneralSetupOfLibrariesWithSymbols(EnvConfigurationBase):


Why do we even need GeneralSetupOfLibrariesWithSymbols and GeneralAppendSetup?
I don't them used anywhere

as mentioned there is another PR in review where those are used in actual tests. So helper modules are taken from there. There those are used along. So this file is "shared" and any modifications to it will fo with the other PR. #2185.

This means that this PR will be merged only after previous one as it will cary the needed functionality.

I chose to separete this test from others as it carries new approach to ASV tests, while the other PR is for tests that we know well and exists in LMDB only version.

G-D-Petrov · 2025-02-26T14:46:42Z

python/benchmarks/real_comparison_benchmarks.py

+        # With shared storage we create different libs for each process
+        # therefore we initialize the symbol here also not in setup_cache
+        self.s3_lib = RealComparisonBenchmarks.REAL_STORAGE_SETUP.get_modifiable_library(self.pid)
+        self.s3_symbol = RealComparisonBenchmarks.REAL_STORAGE_SETUP.get_symbol_name_template(self.pid)


This is very hard to read
RealComparisonBenchmarks.REAL_STORAGE_SETUP.get_symbol_name_template

either:

define this as a variable and use it that way

refactor the setup to be functions rather than a class, given that they are just abstracting pretty simple calls to the arctic client

initial version

d1a824e

grusev requested review from alexowens90, willdealtry and poodlewars as code owners February 26, 2025 09:17

grusev changed the title ~~initial version~~ Comparison of memory efficiency LMDB, S3, Pandas for read/write operations on dataframes (new approach on making sure we accurately measure peakmem with ASV) Feb 26, 2025

grusev added test engineer Issue that is good for an engineer in test to work on. patch Small change, should increase patch version labels Feb 26, 2025

G-D-Petrov requested changes Feb 26, 2025

View reviewed changes

Georgi Rusev added 2 commits February 27, 2025 10:20

adapted to comments

569192b

tidy up the logging code

9fb0a01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparison of memory efficiency LMDB, S3, Pandas for read/write operations on dataframes (new approach on making sure we accurately measure peakmem with ASV) #2204

Comparison of memory efficiency LMDB, S3, Pandas for read/write operations on dataframes (new approach on making sure we accurately measure peakmem with ASV) #2204

grusev commented Feb 26, 2025 •

edited

Loading

G-D-Petrov left a comment

G-D-Petrov Feb 26, 2025

G-D-Petrov Feb 26, 2025

G-D-Petrov Feb 26, 2025

G-D-Petrov Feb 26, 2025

grusev Feb 27, 2025

G-D-Petrov Feb 26, 2025

grusev Feb 27, 2025

G-D-Petrov Feb 26, 2025

grusev Feb 27, 2025

G-D-Petrov Feb 26, 2025

		return next


		class GeneralSetupOfLibrariesWithSymbols(EnvConfigurationBase):

Comparison of memory efficiency LMDB, S3, Pandas for read/write operations on dataframes (new approach on making sure we accurately measure peakmem with ASV) #2204

Are you sure you want to change the base?

Comparison of memory efficiency LMDB, S3, Pandas for read/write operations on dataframes (new approach on making sure we accurately measure peakmem with ASV) #2204

Conversation

grusev commented Feb 26, 2025 • edited Loading

Reference Issues/PRs

Change Type (Required)

Any other comments?

Checklist

G-D-Petrov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grusev commented Feb 26, 2025 •

edited

Loading