-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Open
Listed in
Labels
Microsoft.Data.AnalysisAll DataFrame related issues and PRsAll DataFrame related issues and PRsbugSomething isn't workingSomething isn't working
Milestone
Description
.Net Core 3.1
Microsoft.Data.Analysis Nuget package version: 0.19.1
This code results in a hang in DataFrame.Join:
using Microsoft.Data.Analysis;
using System;
using System.Linq;
namespace TestDataFRame
{
internal class Program
{
static void Main(string[] args)
{
DateTime?[] dates1 = { new DateTime(2022, 03, 01), new DateTime(2022, 03, 02), new DateTime(2022, 03, 03) };
double?[] closePrices = { 10.5, 12.4, 11.3 };
DateTime?[] dates2 = { new DateTime(2022, 03, 01), new DateTime(2022, 03, 02), new DateTime(2022, 03, 03), new DateTime(2022, 03, 04) };
double[] shortPercentages = { 2.34, 2.36, 3.01, 3.04 };
DataFrame dataFrame1 = new DataFrame();
dataFrame1.Columns.Add(new PrimitiveDataFrameColumn<DateTime>("Date", dates1));
dataFrame1.Columns.Add(new DoubleDataFrameColumn("ClosePrice", closePrices));
var numbers1 = dataFrame1.Columns.GetDoubleColumn("ClosePrice").ToArray();
DataFrame dataFrame2 = new DataFrame();
dataFrame2.Columns.Add(new PrimitiveDataFrameColumn<DateTime>("Date", dates2));
dataFrame2.Columns.Add(new DoubleDataFrameColumn("ShortPercentage", shortPercentages));
var numbers2 = dataFrame2.Columns.GetDoubleColumn("ShortPercentage").ToArray();
DataFrame dataFrame = dataFrame1.Join(dataFrame2, "Date", "Date", joinAlgorithm: JoinAlgorithm.FullOuter);
}
}
}
Metadata
Metadata
Assignees
Labels
Microsoft.Data.AnalysisAll DataFrame related issues and PRsAll DataFrame related issues and PRsbugSomething isn't workingSomething isn't working
Type
Projects
Relationships
Development
Select code repository
Activity
[-]System.Data.Analysis DataFrame.Join infinite loop[/-][+]Microsoft.Data.Analysis DataFrame.Join infinite loop[/+]luisquintanilla commentedon Mar 15, 2022
Hi @olavt
Sorry to hear you're having issues joining your data. I was able to repro. When using the
Join
method, it seems like the library has a hard time performing the operation when the columns from both DataFrames have the same name. If the column names are different, the operation completes successfully. Thanks for reporting this issue.olavt commentedon Mar 15, 2022
Yes, joining the datasets with different column names works. I would however expect to be able to join the datasets and have one resulting Date column rather than two different ones. How can I achieve that?
luisquintanilla commentedon Mar 15, 2022
In the current implementation of the DataFrame, both columns are persisted. One way for you to get a single columns is, after the merge, create a new column with the distinct date values and drop the original columns from the merged DataFrame.
olavt commentedon Mar 15, 2022
Hmm, that solution does not look pretty. What is the difference between Merge and Join? I would have expected the Join to be able to join on a "common" column in two DataFrames and create one resulting column. I would be scared reading the above code and it's unclear that the dates in the resulting column would match the rows.
luisquintanilla commentedon Mar 16, 2022
I believe the main difference is Join combines the data based on the index and Merge does it based on the columns you specify.
This is what a Joined DataFrame would look like.
This is what a Merged DataFrame would look like
olavt commentedon Mar 16, 2022
According to my testing, both the Merge and Join produces the same result in the code below:
michaelgsharp commentedon Mar 18, 2022
Yeah, we have an issue with the naming of the join/merge API's. I haven't looked closely enough at join to know exactly how it works, but it does NOT work on column names. If you want to do it based on column names you need to use merge instead. This threw me off, and will probably throw of anyone familiar with this type of stuff IMO. Especially because if @luisquintanilla is correct in that join is based on index no one will expect that I don't think. Not only will they probably use it incorrectly, the join results could verily easily be not at all what they are expecting. Honestly I would vote for a rename in this case, but its something that we need to discuss.
The reason you are getting an infinite loop is that for join it doesn't want the name of the columns you want to join on (because join here is not based on the columns...), it wants the suffix to append to the columns of same name in the 2 datasets. So when you provide a value that is the same for both suffixes, our code is looping until all the columns are unique, and this can never happen when you continue to append identical values to both sides.
For the behavior you are looking for, you are going to want to use merge and not join. There is a simpler workaround than what luis suggested for all types except a full outer join (not sure this would work there as I haven't fully tested it but I don't think it will). Since in left/right/inner joins at least 1 of the columns will be 100% correct, you can do this:
That will remove the extra column and rename the remaining one back to your original value. Just change the "left/right" based on the type of join you perform. Again, this won't work on an outer join, but for for left/right/inner it will work fine.
luisquintanilla commentedon Mar 21, 2022
@olavt That's correct, because you have a different number of indices, so it looks the same when you use
Join
. If you were to add another row to the the first DataFrame, you'd get the results I shared where the 3rd index contains different dates. If theDate
column was used as the index,Join
might make sense and work as expected but that's not the case here causing the results where indices have different data to not work as expected.