-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Component to extract CSV files from PostgreSQL database search (1st CLAIMED component) #19
base: main
Are you sure you want to change the base?
Component to extract CSV files from PostgreSQL database search (1st CLAIMED component) #19
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: romeokienzler The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@ckadner description and readme markdown added |
Signed-off-by: Romeo Kienzler <[email protected]>
Signed-off-by: Romeo Kienzler <[email protected]>
77ef9d5
to
69052ed
Compare
- -ec | ||
- | | ||
mkdir -p `echo $0 |sed -e 's/\/[a-zA-Z0-9]*$//'` | ||
wget https://raw.githubusercontent.com/IBM/claimed/master/component-library/input/input-postgresql.ipynb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just wondering: since this component (and probably other Claimed compoinents?) is running a notebook, would there be a benefit to adding this functionality as a Notebook
asset type, instead of a Component
in the MLX Katalog?
The benefit of running as a notebook, would be additional graphics, tables, debug statements that could be more conveniently surfaced to the user?
If the functionality is very narrowly defined, like it seems to be the case here, a simple component might make more sense though.
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
name: Input Postgresql | ||
description: This notebook pulls data from a postgresql database as CSV on a given SQL statement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add some filter_categories
(manually) to this component (maybe all our components) to mark which step/stage in the ML Pipeline this (or any) component fits into, i.e. this one could be:
filter_categories:
...
pipeline_stage: "data collection"
@@ -0,0 +1,3 @@ | |||
# Input PostgreSQL | |||
This component pulls data from a postgresql database as CSV on a given SQL statement. Parameters like | |||
host, database, user, password and sql need to be set. Please note that data is processed in-memory (pandas) and can't spill on disk (spark) yet. Therefore, the queried data must fit onto main memory (of the POD in case running within KubeFlow context. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another nit-pick: there is an opening parenthesis (
that is not needed:
... fit into main memory (of the pod ...
- {name: password, type: String, description: 'db password'} | ||
- {name: port, type: Integer, description: 'db port'} | ||
- {name: sql, type: String, description: 'sql query statement to be executed'} | ||
- {name: data_dir, type: String, description: 'temporal data storage for local execution'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't mention data_dir
in the README so I assume it's optional?
I think in MLX, we have the convention of marking input parameters as Required
in the description. Optional parameters should have a default
value.
inputs:
- {name: token, description: 'Required. GitHub token for accessing private repository'}
- {name: url, description: 'Required. GitHub raw path for accessing the credential file'}
- {name: name, description: 'Required. Secret Name to be stored in Kubernetes'}
The KFP docs say this https://www.kubeflow.org/docs/components/pipelines/sdk/component-development/#define-your-components-interface:
optional
: Specifies if this input is optional. The value of this attribute is of typeBool
, and defaults toFalse
.
@romeokienzler -- any reason to restrict this new component to extract CSV from a PostgrSQL database? Could it be generalized to work with any JDBC type database connection? |
This is the first component created by C3 - the CLAIMED component compiler - entirely from a notebook. This is the prototype component and PR - once accepted will PR the remaining 30+ components from CLAIMED. More on CLAIMED
Here a summary video ~10min about how CLAIMED, Elyra, JupyterLab, KubeFlow, MLX and Kubernetes play together https://youtu.be/H8WskMEUI74
Here the same in written form https://romeokienzler.medium.com/create-component-oriented-data-science-pipelines-using-claimed-elyra-kubeflow-pipelines-mlx-and-f17eab24b91c