Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Component to extract CSV files from PostgreSQL database search (1st CLAIMED component) #19

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

romeokienzler
Copy link
Member

@romeokienzler romeokienzler commented Aug 23, 2021

This is the first component created by C3 - the CLAIMED component compiler - entirely from a notebook. This is the prototype component and PR - once accepted will PR the remaining 30+ components from CLAIMED. More on CLAIMED

Here a summary video ~10min about how CLAIMED, Elyra, JupyterLab, KubeFlow, MLX and Kubernetes play together https://youtu.be/H8WskMEUI74

Here the same in written form https://romeokienzler.medium.com/create-component-oriented-data-science-pipelines-using-claimed-elyra-kubeflow-pipelines-mlx-and-f17eab24b91c

@mlx-bot
Copy link
Collaborator

mlx-bot commented Sep 1, 2021

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: romeokienzler
To complete the pull request process, please assign yhwang after the PR has been reviewed.
You can assign the PR to them by writing /assign @yhwang in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@romeokienzler
Copy link
Member Author

@ckadner description and readme markdown added

Signed-off-by: Romeo Kienzler <[email protected]>
Signed-off-by: Romeo Kienzler <[email protected]>
- -ec
- |
mkdir -p `echo $0 |sed -e 's/\/[a-zA-Z0-9]*$//'`
wget https://raw.githubusercontent.com/IBM/claimed/master/component-library/input/input-postgresql.ipynb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just wondering: since this component (and probably other Claimed compoinents?) is running a notebook, would there be a benefit to adding this functionality as a Notebook asset type, instead of a Component in the MLX Katalog?

The benefit of running as a notebook, would be additional graphics, tables, debug statements that could be more conveniently surfaced to the user?

If the functionality is very narrowly defined, like it seems to be the case here, a simple component might make more sense though.

# See the License for the specific language governing permissions and
# limitations under the License.
name: Input Postgresql
description: This notebook pulls data from a postgresql database as CSV on a given SQL statement
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add some filter_categories (manually) to this component (maybe all our components) to mark which step/stage in the ML Pipeline this (or any) component fits into, i.e. this one could be:

filter_categories:
  ...
  pipeline_stage: "data collection"

@@ -0,0 +1,3 @@
# Input PostgreSQL
This component pulls data from a postgresql database as CSV on a given SQL statement. Parameters like
host, database, user, password and sql need to be set. Please note that data is processed in-memory (pandas) and can't spill on disk (spark) yet. Therefore, the queried data must fit onto main memory (of the POD in case running within KubeFlow context.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another nit-pick: there is an opening parenthesis ( that is not needed:

... fit into main memory (of the pod ...

- {name: password, type: String, description: 'db password'}
- {name: port, type: Integer, description: 'db port'}
- {name: sql, type: String, description: 'sql query statement to be executed'}
- {name: data_dir, type: String, description: 'temporal data storage for local execution'}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't mention data_dir in the README so I assume it's optional?

I think in MLX, we have the convention of marking input parameters as Required in the description. Optional parameters should have a default value.

inputs:
  - {name: token,           description: 'Required. GitHub token for accessing private repository'}
  - {name: url,             description: 'Required. GitHub raw path for accessing the credential file'}
  - {name: name,            description: 'Required. Secret Name to be stored in Kubernetes'}

The KFP docs say this https://www.kubeflow.org/docs/components/pipelines/sdk/component-development/#define-your-components-interface:

optional: Specifies if this input is optional. The value of this attribute is of type Bool, and defaults to False.

@ckadner ckadner changed the title add input-postgresql.yaml - 1st CLAIMED component Component to extract CSV files from PostgreSQL database search (1st CLAIMED component) Sep 9, 2021
@ckadner
Copy link
Member

ckadner commented Sep 9, 2021

@romeokienzler -- any reason to restrict this new component to extract CSV from a PostgrSQL database? Could it be generalized to work with any JDBC type database connection?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants