-
-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Append seems to lose data #43
Comments
@bigtonylewis did you ever find a resolution for this? |
No, I moved on and used something else |
Seems like it is dropping duplicates when appending. and it doesn't take the timestamp index into consideration |
@bigtonylewis what solution did you switch to that? did that solution retain the pystore structure/syntax? Not being able to increment some massive datasets is causing some problems for us, but we do not want to change datasets built on the Pystore framework. Any help is appreciated! |
Hi - can you share the code piece that causes that? I can't seem to be able to replicate it scenario. Thanks! |
@ranaroussi Here's the code from the first post: https://gist.github.com/bigtonylewis/eb2913814869416ccbb82944c3662d32 @payas-parab-92 I just whipped up my own code, a very much reduced subset of this that fit my purpose. It won't scale well |
I personally would be stoked to see this bug fixed, as it was a show-stopper for me too! |
@ranaroussi I have also detailed the issue a little bit further in another issue with more detail than I provided here: #48 (comment) Similar to @jeffneuen this is a big pain point for us/becoming super critical and I will be playing around with this in the next few days and will circle back to this thread if I figure anything out. |
Edit: I added PR #57 to address this problem. @ranaroussi I'm quite confident that I figured out why pystore is loosing data when appending. The problem here is that you are calling The documentation of the
becomes
See how the index of the data is ignored when searching for duplicates. There is no way to force Reset the index, so the original index is inserted as a column before calling
When you've got this dataframe
it will be changed to this
Note how this keeps rows with the same index but different column values. For data that can only have unique timestamps (OHLC, EOD, ect.) it is enough to drop duplicated indexes.
This will result in:
For Pandas there is actually no need to copy the index as you can use the To control the behavior you could add a keyword to the
To increase speed on big dataframes it would be nice to apply a Boolean mask of index duplicates first and then look for duplicated row columns. It avoids copying the index and reduces amount of rows (with all columns) that are compared. Unfortunately Dask haven't implemented the
On my 15338798 rows × 7 columns dataframe its mask: |
Hi everyone , for information, I have started an alternative lib, oups that has some similarities with @ranaroussi, I am aware this post may not be welcome and I am sorry if it is a bit rude. Please, remove it if it does. |
It seems that when using collection.append, some data is not written. If I do a
collection.write()
then one or morecollection.append()
and then read it back, I get less rows than I put in.Here's some code that demonstrates it at https://gist.github.com/bigtonylewis/eb2913814869416ccbb82944c3662d32
When I iterate over two 1000-row dataframes, including splitting, writing and appending them, I get about 1800 rows back out of 2000.
The text was updated successfully, but these errors were encountered: