You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Great question! The get_id_params(), get_op_params(), and universal_identifier() are not well documented, but they all play an important (and distinct) role in PZ. I'd like to take this opportunity to expand on their purpose(s) as they relate to the Optimizer, and maybe as a group we can come up with a better approach that is slightly less opaque.
At a very high-level, the Optimizer's job is to find the best physical plan for a user program. To do this, we follow a recipe that looks something like the following:
Convert the user program to a (set of) logical plan(s)
Convert those logical plan(s) into a set of physical plans
Cost the physical plans and return the best one
While this sounds simple, there are a lot of implementation details that you need to get right. One such detail is determining whether two physical operators across two different physical plans are actually the same operator.
Why is this important? Currently in PZ, our cost model is built around sampled data. As the number of physical operators grows, it becomes expensive to sample every physical operator (let alone every physical plan) to get sample cost, runtime, and quality information. Thus, in an effort to be economical with our samples, we need to re-use the cost information gathered about a physical operator in one plan and apply it to that operator in other plans. (This of course rests on an assumption that operator performance is independent of its placement within a plan -- which is not always true -- but for now this is an assumption I am willing to make).
Consider the following four plans:
Plan 1: scan --> convert A1 --> filter B1
Plan 2: scan --> convert A2 --> filter B2
Plan 3: scan --> filter B1 --> convert A1
Plan 4: scan --> filter B2 --> convert A1
Let's say that we sample Plans 1 and 2 by processing a few inputs with each plan. Now, without having sampled Plans 3 and 4, we want to produce a cost estimate for them. Let's assume our cost model for Plans 3 and 4 is:
plan_dollar_cost = sum([op.cost for op in plan])
plan_runtime = sum([op.runtime for op in plan])
plan_quality = product([op.quality for op in plan])
(I realize that this^ cost model is not perfect, but let's just assume it is our cost model of choice for now.)
When we sample Plans 1 and 2, we compute the average cost, runtime, and quality of each operator and use this as our estimate for that operator. We store this information in a dictionary:
Now, to estimate Plan 3 using our cost model above, we simply need to compute:
plan3_dollar_cost = sum([operator_to_stats[op.get_op_id()]["cost"] for op in plan3])
plan3_runtime = sum([operator_to_stats[op.get_op_id()]["runtime"] for op in plan3])
plan3_quality = product([operator_to_stats[op.get_op_id()]["quality"] for op in plan3])
Here's the key issue: in order for operator_to_stats[op.get_op_id()] to lookup the correct entry in the dictionary, it must be the case that op.get_op_id() is:
Independent of the operator's location in the plan
Dependent on the setting of the operator's parameters
Criteria (1.) implies that op.get_op_id() cannot be a function of:
the op_id of the parent operator
the input_schema of the operator, and
the output_schema of the operator, as sem_add_columns() now computes output_schema = self.schema.union(output_schema) (i.e. it also contains the input schema's fields in order to accurately reflect the schema of a DataRecord after being processed by that operator).
Criteria (2.) implies that we still need to specify which of an operator's parameters make that operator unique -- without depending upon location within the plan. For example, because the target_cache_id is a function of the parent operator's target_cache_id (which violates criteria (1.)) it cannot be included in the set of parameters that make the operator unique.
As a result of this^, the approach I settled on was to have each subclass of PhysicalOperator define the set of parameters make that operator unique, independent of location within the plan. These should not include things like input_schema, output_schema, and target_cache_id, but they should include things like model, filter, agg_func, etc. This is the purpose that get_id_params() serves. Its output is a dictionary containing these unique key-value pairs for each physical operator.
On the one hand, this approach places a burden on the PZ developer because it requires them to implement this dictionary for each new physical operator that they create. One might be tempted to instead enumerate the set of physical operator parameters which should NOT be included (e.g. input_schema, output_schema, target_cache_id, etc.) and compute an operator's get_id_params() as the set of all kwargs which are not in this disallowed list. However, with this^ approach we still need someone (the developer or PR reviewer) to check whether a parameter of the new physical operator needs to be added to the disallow list.
Personally, I opted for the get_id_params() approach (i.e. having the developer explicitly state the unique parameters), because I figured that -- at a minimum -- this would force a new PZ developer to stop and ask "Hey, what should this function return?". In the other approach, I think it's too easy for a new parameter to sneak into the set of unique params without first being added to a disallow list.
Finally, this begs the question:
what does get_op_id() do?
what does get_logical_op_id() (and get_logical_id_params()) do?
what does get_op_params() / get_logical_op_params() do?
and what does universal_identifier() / get_node_uid() do?
The answers are:
get_op_id() calls get_id_params() and converts the output into a string
These are the logical operator equivalents of get_op_id() and get_id_params()
get_op_params() / get_logical_op_params() returns the full set of physical / logical operator parameters for when we need to make a copy of an operator. This should include parameters like input_schema, output_schema, etc.
There are some cases where we do want a universally unique identifier for a Dataset / logical operator (e.g. when computing the target_cache_id, or when disambiguating between two identical field names in two different output schemas. universal_identifier() computes a hash which includes the parent's universal_identifier() (and all Dataset parameters), thus this can be used for both target_cache_id and field disambiguation.
I hope this^ provides more context on some of these opaque functions within PZ. I do not believe that I have found a perfect implementation, but with this context in mind, maybe we can all come up with a cleaner implementation of these stated goals.
From a review comment on #118:
Great question! The
get_id_params()
,get_op_params()
, anduniversal_identifier()
are not well documented, but they all play an important (and distinct) role in PZ. I'd like to take this opportunity to expand on their purpose(s) as they relate to the Optimizer, and maybe as a group we can come up with a better approach that is slightly less opaque.At a very high-level, the Optimizer's job is to find the best physical plan for a user program. To do this, we follow a recipe that looks something like the following:
While this sounds simple, there are a lot of implementation details that you need to get right. One such detail is determining whether two physical operators across two different physical plans are actually the same operator.
Why is this important? Currently in PZ, our cost model is built around sampled data. As the number of physical operators grows, it becomes expensive to sample every physical operator (let alone every physical plan) to get sample cost, runtime, and quality information. Thus, in an effort to be economical with our samples, we need to re-use the cost information gathered about a physical operator in one plan and apply it to that operator in other plans. (This of course rests on an assumption that operator performance is independent of its placement within a plan -- which is not always true -- but for now this is an assumption I am willing to make).
Consider the following four plans:
Plan 1: scan --> convert A1 --> filter B1
Plan 2: scan --> convert A2 --> filter B2
Plan 3: scan --> filter B1 --> convert A1
Plan 4: scan --> filter B2 --> convert A1
Let's say that we sample Plans 1 and 2 by processing a few inputs with each plan. Now, without having sampled Plans 3 and 4, we want to produce a cost estimate for them. Let's assume our cost model for Plans 3 and 4 is:
(I realize that this^ cost model is not perfect, but let's just assume it is our cost model of choice for now.)
When we sample Plans 1 and 2, we compute the average cost, runtime, and quality of each operator and use this as our estimate for that operator. We store this information in a dictionary:
Now, to estimate Plan 3 using our cost model above, we simply need to compute:
Here's the key issue: in order for
operator_to_stats[op.get_op_id()]
to lookup the correct entry in the dictionary, it must be the case thatop.get_op_id()
is:Criteria (1.) implies that
op.get_op_id()
cannot be a function of:op_id
of the parent operatorinput_schema
of the operator, andoutput_schema
of the operator, assem_add_columns()
now computesoutput_schema = self.schema.union(output_schema)
(i.e. it also contains the input schema's fields in order to accurately reflect the schema of aDataRecord
after being processed by that operator).Criteria (2.) implies that we still need to specify which of an operator's parameters make that operator unique -- without depending upon location within the plan. For example, because the
target_cache_id
is a function of the parent operator'starget_cache_id
(which violates criteria (1.)) it cannot be included in the set of parameters that make the operator unique.As a result of this^, the approach I settled on was to have each subclass of
PhysicalOperator
define the set of parameters make that operator unique, independent of location within the plan. These should not include things likeinput_schema
,output_schema
, andtarget_cache_id
, but they should include things likemodel
,filter
,agg_func
, etc. This is the purpose thatget_id_params()
serves. Its output is a dictionary containing these unique key-value pairs for each physical operator.On the one hand, this approach places a burden on the PZ developer because it requires them to implement this dictionary for each new physical operator that they create. One might be tempted to instead enumerate the set of physical operator parameters which should NOT be included (e.g.
input_schema
,output_schema
,target_cache_id
, etc.) and compute an operator'sget_id_params()
as the set of allkwargs
which are not in this disallowed list. However, with this^ approach we still need someone (the developer or PR reviewer) to check whether a parameter of the new physical operator needs to be added to the disallow list.Personally, I opted for the
get_id_params()
approach (i.e. having the developer explicitly state the unique parameters), because I figured that -- at a minimum -- this would force a new PZ developer to stop and ask "Hey, what should this function return?". In the other approach, I think it's too easy for a new parameter to sneak into the set of unique params without first being added to a disallow list.Finally, this begs the question:
get_op_id()
do?get_logical_op_id()
(andget_logical_id_params()
) do?get_op_params()
/get_logical_op_params()
do?universal_identifier()
/get_node_uid()
do?The answers are:
get_op_id()
callsget_id_params()
and converts the output into a stringget_op_id()
andget_id_params()
get_op_params()
/get_logical_op_params()
returns the full set of physical / logical operator parameters for when we need to make a copy of an operator. This should include parameters likeinput_schema
,output_schema
, etc.target_cache_id
, or when disambiguating between two identical field names in two different output schemas.universal_identifier()
computes a hash which includes the parent'suniversal_identifier()
(and allDataset
parameters), thus this can be used for bothtarget_cache_id
and field disambiguation.I hope this^ provides more context on some of these opaque functions within PZ. I do not believe that I have found a perfect implementation, but with this context in mind, maybe we can all come up with a cleaner implementation of these stated goals.
FYI: @sivaprasadsudhir @mikecafarella @vitaglianog @Tranway1
The text was updated successfully, but these errors were encountered: