Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] cannot import name 'FdedupRayTransformConfiguration' from 'fdedup_transform_ray' #898

Open
1 of 2 tasks
MFahadShahid opened this issue Dec 20, 2024 · 8 comments
Open
1 of 2 tasks
Assignees
Labels
bug Something isn't working

Comments

@MFahadShahid
Copy link

Search before asking

  • I searched the issues and found no similar issues.

Component

Transforms/universal/fdedup

What happened + What you expected to happen

I have setup a virtual environment and followed the mentioned steps for installing data-prep-kit. I'm testing the end-to-end pipeline examples (sample notebook and demo-with-launcher) and facing the following error:
cannot import name 'FdedupRayTransformConfiguration' from 'fdedup_transform_ray' (/opt/conda/envs/data-prep-kit/lib/python3.11/site-packages/fdedup_transform_ray.py)

Reproduction script

input_folder = "sample_data/docid_out"
output_folder = "sample_data/fdedup_out"

import os
import sys

from data_processing.utils import ParamsUtils
from fdedup_transform_ray import FdedupRayTransformConfiguration

local_conf = {
"input_folder": input_folder,
"output_folder": output_folder,
}
worker_options = {"num_cpus": 0.8}
code_location = {"github": "github", "commit_hash": "12345", "path": "path"}
fdedup_params = {
# columns used
"fdedup_doc_column": "contents",
"fdedup_id_column": "int_id_column",
"fdedup_cluster_column": "hash_column",
"data_local_config": ParamsUtils.convert_to_ast(local_conf)
}

params = common_config_params| fdedup_params

Pass commandline params

sys.argv = ParamsUtils.dict_to_req(d=params)

launch

fdedup_launcher = RayTransformLauncher(FdedupRayTransformConfiguration())
fdedup_launcher.launch()

Anything else

No response

OS

Red Hat Enterprise Linux (RHEL)

Python

3.11.x

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@MFahadShahid MFahadShahid added the bug Something isn't working label Dec 20, 2024
@cmadam
Copy link
Collaborator

cmadam commented Dec 20, 2024

@MFahadShahid :which version of the data-prep-kit are you using. We have recently changed the implementation of fuzzy dedup - it is a pipeline of 4 transforms: signature calculation, cluster analysis, get duplicate list and data cleaning. As such, there is no more FdedupRayTransformConfiguration. Which documentation did you follow? Perhaps we should update any outdated docs.

@MFahadShahid
Copy link
Author

I'm using data-prep-kit version 0.2.3 and following the documentation on the main page (https://ibm.github.io/data-prep-kit/).

I'm currently at the "Run your first data prep pipeline" section, as shown in the attached image. This section includes two notebooks (sample-notebook.ipynb and demo_with_launcher.ipynb), both of which use FdedupRayTransformConfiguration. Could these notebooks be outdated?

Screenshot 2024-12-23 130816

@touma-I
Copy link
Collaborator

touma-I commented Dec 23, 2024

@MFahadShahid The fdedup has undergone a number of improvements and we have not yet update the documentation. Sorry about the confusion. For example on how to use the 0.2.3 release, please see this notebook:

https://github.com/IBM/data-prep-kit/blob/releases/v0.2.3/transforms/universal/fdedup/fdedup_ray.ipynb

cc: @cmadam @revit13

@revit13
Copy link
Collaborator

revit13 commented Dec 24, 2024

@cmadam is there an example similar to this local_python ededup example where "data_local_config" is used? Thanks

@cmadam
Copy link
Collaborator

cmadam commented Dec 26, 2024

@revit13 : please take a look at these notebooks, they show how to launch (run) fuzzy dedup locally for:

@azka2001
Copy link

@cmadam @touma-I Could you please provide sample notebooks for Document Quality and Filtering as well for ray as I'm getting errors in this outdated notebook as well and couldn't find notebooks with the ray implementation?

@matouma
Copy link
Contributor

matouma commented Jan 8, 2025

@azka2001 Thank you for raising this issue. I submit a PR for a fix. In the meanwhile, if you want ealy access to the notebook before the PR is merged please see the following two links:
https://github.com/matouma/data-prep-kit/blob/3b53e0cd8aff59b74887f5c969c6532ea8a660e1/transforms/language/doc_quality/doc_quality-ray.ipynb
https://github.com/matouma/data-prep-kit/blob/3b53e0cd8aff59b74887f5c969c6532ea8a660e1/transforms/universal/filter/filter-ray.ipynb

@burn2l
Copy link

burn2l commented Jan 16, 2025

I see that missing class is also used in examples/notebooks/intro/dpk_intro_1_ray.ipynb

It's unfortunate that fuzzy dedup doesn't follow the same naming conventions and invocation as the other transforms, especially exact dedup. Could you not create a wrapper around the 4 steps and make it look like the other transforms?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants