Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[native] Add row expression optimizer #22927

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

pramodsatya
Copy link
Contributor

@pramodsatya pramodsatya commented Jun 5, 2024

Description

Introduces capability to optimize and constant fold row expressions in the Presto native sidecar.
Adds class RowExpressionConverter to convert a velox exec::Expr to a Presto protocol::RowExpression.
Adds class RowExpressionOptimizer to optimize and constant fold a Presto RowExpression in the native sidecar (using RowExpressionConverter).

Motivation and Context

Please refer to #24126 for full context of the changes, this is as described in RFC-0006.

Test Plan

Unit tests for simple expression conversions are added in RowExpressionConverter.cpp.
Unit tests for simple expression optimizations are added in RowExpressionOptimizerTest.cpp.
Exhaustive end-to-end tests will be added in TestNativeExpressionOptimizer.java (please refer to the PR #24238).

Release Notes

== NO RELEASE NOTE ==

@pramodsatya pramodsatya changed the title [WIP] Add proxygen endpoint for expression evaluation [native] Add proxygen endpoint for expression evaluation Aug 5, 2024
@tdcmeehan tdcmeehan self-assigned this Aug 5, 2024
@pramodsatya pramodsatya force-pushed the expr_endpt branch 2 times, most recently from 822f79f to 2352cd4 Compare September 11, 2024 14:52
Copy link
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pramodsatya : Have done a first round of comments. Will read your tests more closely once you address the comments here.

Copy link
Contributor Author

@pramodsatya pramodsatya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @aditi-pandit, addressed the comments. Could you please take another look?

facebook-github-bot pushed a commit to facebookincubator/velox that referenced this pull request Nov 16, 2024
Summary:
prestodb/presto#23331 adds a native expression optimizer to delegate expression evaluation to the native sidecar. This is used to constant fold expressions on the presto native sidecar, instead of on the presto java coordinator (which is the current behavior). prestodb/presto#22927 implements a proxygen endpoint to accept `RowExpression`s from `NativeSidecarExpressionInterpreter`, optimize them if possible (rewrite special form expressions), and compile the `RowExpression` to a velox expression with constant folding enabled. This velox expression is then converted back to a `RowExpression` and returned by the sidecar to the coordinator.

When the constant folded velox expression is of type `velox::exec::ConstantExpr`, we need to return a `RowExpression` of type `ConstantExpression`. This requires us to serialize the constant value from `velox::exec::ConstantExpr` into `protocol::ConstantExpression::valueBlock`. This can be done by serializing the constant value vector to presto SerializedPage::column format, followed by base 64 encoding the result (reverse engineering the logic from `Base64Util.cpp::readBlock`).

This PR adds a new function, `serializeSingleColumn`, to `PrestoVectorSerde`. This can be used to serialize input data from vectors containing a single element into a single PrestoPage column format (without the PrestoPage header).
This function is not added to `PrestoBatchVectorSerializer` alongside the existing `serialize` function since that would require adding it as a virtual function in `BatchVectorSerializer` as well, and this is not desirable since the `PrestoPage` format is not relevant in this base class. There is an existing function `deserializeSingleColumn` in `PrestoVectorSerde` which is used to deserialize data from a single column, since `serializeSingleColumn` performs the inverse operation to this function, it is added alongside it in `PrestoVectorSerde`.

Pull Request resolved: #10657

Reviewed By: amitkdutta

Differential Revision: D66044754

Pulled By: pedroerp

fbshipit-source-id: e509605067920f8207e5a3fa67552badc2ce0eba
@pramodsatya pramodsatya changed the title [native] Add proxygen endpoint for expression evaluation [native] Add row expression optimizer Dec 10, 2024
Copy link
Contributor Author

@pramodsatya pramodsatya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @aditi-pandit, addressed the review comments. Could you please take another look?

@pramodsatya pramodsatya marked this pull request as ready for review December 10, 2024 17:49
@pramodsatya pramodsatya requested a review from a team as a code owner December 10, 2024 17:49
@prestodb-ci prestodb-ci added the from:IBM PR from IBM label Dec 10, 2024
@prestodb-ci prestodb-ci requested review from a team and imjalpreet and removed request for a team December 10, 2024 17:49
@pramodsatya pramodsatya force-pushed the expr_endpt branch 2 times, most recently from 416d768 to b244385 Compare December 11, 2024 17:28
Copy link
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pramodsatya. Have reviewed the RowExpressionConverter class so far.

@pramodsatya pramodsatya force-pushed the expr_endpt branch 2 times, most recently from 6297ece to 7cbdc9c Compare January 6, 2025 23:31
Copy link
Contributor Author

@pramodsatya pramodsatya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions @aditi-pandit. Refactored RowExpressionConverter and RowExpressionOptimizer into separate files in presto_cpp/main/types, and addressed the remaining comments. Could you please take another look?

presto-native-execution/presto_cpp/main/PrestoServer.cpp Outdated Show resolved Hide resolved
// case that can be simplified since 'a' is a variable here, so the WHEN clauses
// that are required by Presto as switch expression arguments are returned in
// the field 'arguments'.
struct SwitchFormArguments {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needn't be defined in the header here. It seems to have local usage only in the cpp file. Please move it there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SwitchFormArguments is needed in the header because the member functions getSimpleSwitchFormArgs and getSwitchSpecialFormArgs return values of this type. Is it fine to retain it here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry missed that. You can move this struct within the RowExpressionConverter class.

athmaja-n pushed a commit to athmaja-n/velox that referenced this pull request Jan 10, 2025
…bator#10657)

Summary:
prestodb/presto#23331 adds a native expression optimizer to delegate expression evaluation to the native sidecar. This is used to constant fold expressions on the presto native sidecar, instead of on the presto java coordinator (which is the current behavior). prestodb/presto#22927 implements a proxygen endpoint to accept `RowExpression`s from `NativeSidecarExpressionInterpreter`, optimize them if possible (rewrite special form expressions), and compile the `RowExpression` to a velox expression with constant folding enabled. This velox expression is then converted back to a `RowExpression` and returned by the sidecar to the coordinator.

When the constant folded velox expression is of type `velox::exec::ConstantExpr`, we need to return a `RowExpression` of type `ConstantExpression`. This requires us to serialize the constant value from `velox::exec::ConstantExpr` into `protocol::ConstantExpression::valueBlock`. This can be done by serializing the constant value vector to presto SerializedPage::column format, followed by base 64 encoding the result (reverse engineering the logic from `Base64Util.cpp::readBlock`).

This PR adds a new function, `serializeSingleColumn`, to `PrestoVectorSerde`. This can be used to serialize input data from vectors containing a single element into a single PrestoPage column format (without the PrestoPage header).
This function is not added to `PrestoBatchVectorSerializer` alongside the existing `serialize` function since that would require adding it as a virtual function in `BatchVectorSerializer` as well, and this is not desirable since the `PrestoPage` format is not relevant in this base class. There is an existing function `deserializeSingleColumn` in `PrestoVectorSerde` which is used to deserialize data from a single column, since `serializeSingleColumn` performs the inverse operation to this function, it is added alongside it in `PrestoVectorSerde`.

Pull Request resolved: facebookincubator#10657

Reviewed By: amitkdutta

Differential Revision: D66044754

Pulled By: pedroerp

fbshipit-source-id: e509605067920f8207e5a3fa67552badc2ce0eba
Copy link
Contributor Author

@pramodsatya pramodsatya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aditi-pandit, addressed the comments, could you please take another look?

presto-native-execution/presto_cpp/main/PrestoServer.cpp Outdated Show resolved Hide resolved
// case that can be simplified since 'a' is a variable here, so the WHEN clauses
// that are required by Presto as switch expression arguments are returned in
// the field 'arguments'.
struct SwitchFormArguments {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SwitchFormArguments is needed in the header because the member functions getSimpleSwitchFormArgs and getSwitchSpecialFormArgs return values of this type. Is it fine to retain it here?

// case that can be simplified since 'a' is a variable here, so the WHEN clauses
// that are required by Presto as switch expression arguments are returned in
// the field 'arguments'.
struct SwitchFormArguments {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry missed that. You can move this struct within the RowExpressionConverter class.

@pramodsatya pramodsatya force-pushed the expr_endpt branch 3 times, most recently from 0b96443 to 60eb032 Compare January 30, 2025 21:42
@pramodsatya pramodsatya force-pushed the expr_endpt branch 3 times, most recently from 0da5052 to 6ef10b9 Compare February 4, 2025 17:42
@@ -34,6 +34,15 @@ add_library(presto_velox_conversion OBJECT VeloxPlanConversion.cpp)

target_link_libraries(presto_velox_conversion velox_type)

add_library(presto_expression_converter RowExpressionConverter.cpp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both files can be compiled in a single library.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we don't need two objects here. Logically these two belong together.

@@ -22,6 +22,61 @@

namespace facebook::presto {

static const std::unordered_map<std::string, std::string> prestoOperatorMap() {
static const std::string prestoDefaultNamespacePrefix =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the function body to PrestoToVeloxExpr.cpp

public:
explicit RowExpressionConverter(memory::MemoryPool* pool) : pool_(pool) {}

/// Converts a velox constant expression of type exec::ConstantExpr to a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can omit the types in the comments..."Converts a velox constant expression constantExpr to a Presto protocol constant expression". The types are readable from the function signatures.

result["arguments"] = switchResult.arguments;
}
} else {
auto exprInputs = expr->inputs();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of expressions could fall in the else part of this condition ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presto special form expressions that aren't of form SWITCH, such as COALESCE, IN, AND, OR etc,. are handled by this else part. Added a comment to indicate the same.

Copy link
Contributor

@czentgr czentgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. My initial review to get started. I haven't looked into all of the details.

veloxExprConverter_(pool, &typeParser_),
rowExpressionConverter_(RowExpressionConverter(pool)) {}

/// Optimizes all expressions from the input json array. If the expression
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (grammar): we don't need to use "would", "will" etc. We can outright state it.

If the expression optimization fails for any of the input expressions, the second value in
the returned pair is set to false and the returned json contains the exception.

Similar for the next sentence.

if (result.second) {
VELOX_USER_CHECK(
result.first.is_array(),
"Output json should be an array of row expressions");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "Output json should be an array of row expressions"
so it is not an arrow of row expressions? Why not state the fact?
"The output json is not an array of expressions".

@@ -34,6 +34,15 @@ add_library(presto_velox_conversion OBJECT VeloxPlanConversion.cpp)

target_link_libraries(presto_velox_conversion velox_type)

add_library(presto_expression_converter RowExpressionConverter.cpp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we don't need two objects here. Logically these two belong together.


protocol::TypeSignature getTypeSignature(const TypePtr& type) {
std::string typeSignature;
if (type->parameters().empty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to separate the complex types from the scalar types, right?
If so, then you can use type->isPrimitiveType().
Then you don't check on the parameters and need a special case for DECIMAL (and any other primitive parameterized future type) - any special case would require changes here.

VELOX_USER_FAIL("Invalid type {}", type->toString());
}

typeSignature = complexTypeString + "(";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is this logic to do all this rebuilding of the signature.
The velox type implementation provides the toString function.
The main issue is that it uses <> instead of () to handle the child types of the complex types, for example, MAP<INTEGER, INTEGER> instead of map(integer, integer) which is what this function produces if I am not mistaken.

So why don't we use the toString function on all types and then do the lower case and replacement of <> with () for the complex types. Would this be easier than all this code to build a type signature? Or am I missing something?

Copy link
Contributor Author

@pramodsatya pramodsatya Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, Velox complex type's implementation of toString introduces these additional characters (<, >) and whitespaces that are not required in the Presto protocol TypeSignature. Instead of finding and replacing these characters, it would be simpler to construct the TypeSignature as per Presto's requirement in this helper function. In case the velox type's toString implementation changes in the future, this function would not need any changes.
Is it fine if this function is retained for now? As a follow-up, I could refactor the velox type's toString function to take in a custom delimiter as an argument, for this use-case it would be the pair (, ) and the default delimiters could be <, >?

if (input->_type == kSpecial) {
auto inputSpecialForm =
dynamic_cast<protocol::SpecialFormExpression*>(input.get());
VELOX_USER_CHECK(inputSpecialForm);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you using the VELOX_USER_CHECK to check for not null? Would be better to use VELOX_NOT_NULL instead.
We don't expect this to be user input that could be bad but we still need to protect if something went wrong. I guess this is debatable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced VELOX_USER_CHECK with VELOX_CHECK for cases where the input doesn't come from the user.

const std::vector<RowExpressionPtr>& arguments) {
SwitchFormArguments result;
// The switch case's 'expression' in simple form of the conditional cannot be
// inferred from Velox since it could evaluate all 'when' clauses to true or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: since -> because.

I don't think I understand the comment.
From what I understand it means the velox switchExpr doesn't contain all the info to rebuild the switch row expression, right? So some info is taken from the input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, for a simple form switch expression such as case 1 when 1 then 32 + 1 when 1 then 34 end, the Velox switch expression inputs contain the following equal expressions: eq(1, 1), eq(1, 1). It is not possible to get the expression value in simple form of CASE conditional if all these equal expressions in velox evaluate to either true or false during constant folding. Hence, this expression is obtained from the input Presto switch expression. I have updated the comments with examples for different cases here, please let me know if any further changes are needed.

// Velox expression to Presto RowExpression conversion for different types of
// expressions can be found in TestDelegatingExpressionOptimizer.java in
// presto-native-sidecar-plugin.
class RowExpressionConverterTest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I missing something or this converter test doesn't test switch expressions?

Copy link
Contributor Author

@pramodsatya pramodsatya Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit tests were not added for special form expressions that require the input Presto protocol RowExpression for conversion (such as SWITCH) in the interest of avoiding large test json files. Added tests for SWITCH, IN special form expressions and BETWEEN call expression to RowExpressionOptimizerTest.cpp. Adding these tests to RowExpressionConverterTest.cpp would result in the test code duplicating much of the RowExpressionOptimizer logic with code repetition, so I feel it would be better to have these tests in RowExpressionOptimizerTest.cpp. Would that be fine?
End-to-end tests for various other expression types will be added in the subsequent Presto PR.

/// simplified, and the switch expression arguments required by Presto are
/// returned in 'arguments'.
struct SwitchFormArguments {
bool isSimplified = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this ever being set to true in the new code. When processing the getSimpleSwitchFormArgs function it is never set but maybe it should be because otherwise why have this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being set to true in the following check in getSimpleSwitchFormArgs:

...
        // If this is the first switch case that evaluates to true, return the
        // expression corresponding to this case.
        if (constVector->valueAt(0) && result.arguments.size() == 1) {
          return {
              true,
              veloxToPrestoRowExpression(caseValue, inputWhenArgs[1]),
              json::array()};
        } 
...

Copy link
Contributor Author

@pramodsatya pramodsatya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @aditi-pandit, @czentgr. I have addressed the review comments, could you please take another look?

VELOX_USER_FAIL("Invalid type {}", type->toString());
}

typeSignature = complexTypeString + "(";
Copy link
Contributor Author

@pramodsatya pramodsatya Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, Velox complex type's implementation of toString introduces these additional characters (<, >) and whitespaces that are not required in the Presto protocol TypeSignature. Instead of finding and replacing these characters, it would be simpler to construct the TypeSignature as per Presto's requirement in this helper function. In case the velox type's toString implementation changes in the future, this function would not need any changes.
Is it fine if this function is retained for now? As a follow-up, I could refactor the velox type's toString function to take in a custom delimiter as an argument, for this use-case it would be the pair (, ) and the default delimiters could be <, >?

if (input->_type == kSpecial) {
auto inputSpecialForm =
dynamic_cast<protocol::SpecialFormExpression*>(input.get());
VELOX_USER_CHECK(inputSpecialForm);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced VELOX_USER_CHECK with VELOX_CHECK for cases where the input doesn't come from the user.

const std::vector<RowExpressionPtr>& arguments) {
SwitchFormArguments result;
// The switch case's 'expression' in simple form of the conditional cannot be
// inferred from Velox since it could evaluate all 'when' clauses to true or
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, for a simple form switch expression such as case 1 when 1 then 32 + 1 when 1 then 34 end, the Velox switch expression inputs contain the following equal expressions: eq(1, 1), eq(1, 1). It is not possible to get the expression value in simple form of CASE conditional if all these equal expressions in velox evaluate to either true or false during constant folding. Hence, this expression is obtained from the input Presto switch expression. I have updated the comments with examples for different cases here, please let me know if any further changes are needed.

/// simplified, and the switch expression arguments required by Presto are
/// returned in 'arguments'.
struct SwitchFormArguments {
bool isSimplified = false;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being set to true in the following check in getSimpleSwitchFormArgs:

...
        // If this is the first switch case that evaluates to true, return the
        // expression corresponding to this case.
        if (constVector->valueAt(0) && result.arguments.size() == 1) {
          return {
              true,
              veloxToPrestoRowExpression(caseValue, inputWhenArgs[1]),
              json::array()};
        } 
...

result["arguments"] = switchResult.arguments;
}
} else {
auto exprInputs = expr->inputs();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presto special form expressions that aren't of form SWITCH, such as COALESCE, IN, AND, OR etc,. are handled by this else part. Added a comment to indicate the same.

// Velox expression to Presto RowExpression conversion for different types of
// expressions can be found in TestDelegatingExpressionOptimizer.java in
// presto-native-sidecar-plugin.
class RowExpressionConverterTest
Copy link
Contributor Author

@pramodsatya pramodsatya Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit tests were not added for special form expressions that require the input Presto protocol RowExpression for conversion (such as SWITCH) in the interest of avoiding large test json files. Added tests for SWITCH, IN special form expressions and BETWEEN call expression to RowExpressionOptimizerTest.cpp. Adding these tests to RowExpressionConverterTest.cpp would result in the test code duplicating much of the RowExpressionOptimizer logic with code repetition, so I feel it would be better to have these tests in RowExpressionOptimizerTest.cpp. Would that be fine?
End-to-end tests for various other expression types will be added in the subsequent Presto PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
from:IBM PR from IBM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants