Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in running Ray version of pdf2parquet on Google Colab #940

Open
1 of 2 tasks
shahrokhDaijavad opened this issue Jan 14, 2025 · 0 comments
Open
1 of 2 tasks
Assignees
Labels
bug Something isn't working

Comments

@shahrokhDaijavad
Copy link
Member

shahrokhDaijavad commented Jan 14, 2025

Search before asking

  • I searched the issues and found no similar issues.

Component

Transforms/Other

What happened + What you expected to happen

When we run the notebook https://github.com/IBM/data-prep-kit/blob/dev/transforms/transforms-1.0-lang-ray.ipynb on Google Colab and get to the cell that executes pdf2parquet, we run into error. We tried 2 configuration of parameters:
1)
from data_processing.utils import GB
from dpk_pdf2parquet.ray.transform import Pdf2Parquet
Pdf2Parquet(input_folder= "files-web2parquet",
output_folder= "files-pdf2parquet",
data_files_to_use=['.pdf'],
run_locally= True,
num_cpus= 1,
memory= 2 * GB,
runtime_num_workers = 2,
pdf2parquet_contents_type='text/markdown').transform()

and got the error as follows:

23:17:31 INFO - pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': <pdf2parquet_ocr_engine.EASYOCR: 'easyocr'>, 'bitmap_area_threshold': 0.05, 'pdf_backend': <pdf2parquet_pdf_backend.DLPARSE_V2: 'dlparse_v2'>, 'double_precision': 8}
INFO:dpk_pdf2parquet.transform:pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': <pdf2parquet_ocr_engine.EASYOCR: 'easyocr'>, 'bitmap_area_threshold': 0.05, 'pdf_backend': <pdf2parquet_pdf_backend.DLPARSE_V2: 'dlparse_v2'>, 'double_precision': 8}
23:17:31 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id
23:17:31 INFO - code location None
INFO:data_processing.runtime.execution_configuration:code location None
23:17:31 INFO - number of workers 2 worker options {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1}
INFO:data_processing_ray.runtime.ray.execution_configuration:number of workers 2 worker options {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1}
23:17:31 INFO - actor creation delay 0
INFO:data_processing_ray.runtime.ray.execution_configuration:actor creation delay 0
23:17:31 INFO - job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}
INFO:data_processing_ray.runtime.ray.execution_configuration:job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}
23:17:31 INFO - data factory data_ is using local data access: input_folder - files-web2parquet output_folder - files-pdf2parquet
INFO:data_processing.data_access.data_access_factory_basef02c183b-e9e7-42ae-adee-2d2478d8bc43:data factory data_ is using local data access: input_folder - files-web2parquet output_folder - files-pdf2parquet
23:17:31 INFO - data factory data_ max_files -1, n_sample -1
INFO:data_processing.data_access.data_access_factory_basef02c183b-e9e7-42ae-adee-2d2478d8bc43:data factory data_ max_files -1, n_sample -1
23:17:31 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']
INFO:data_processing.data_access.data_access_factory_basef02c183b-e9e7-42ae-adee-2d2478d8bc43:data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']
23:17:31 INFO - Running locally
INFO:data_processing_ray.runtime.ray.transform_launcher:Running locally
2025-01-14 23:17:31,876 WARNING worker.py:1451 -- SIGTERM handler is not set because current thread is not the main thread.
2025-01-14 23:17:35,633 INFO worker.py:1777 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
(orchestrate pid=1877) 23:17:45 INFO - orchestrator started at 2025-01-14 23:17:45
(orchestrate pid=1877) 23:17:45 INFO - Number of files is 1, source profile {'max_file_size': 5.308699607849121, 'min_file_size': 5.308699607849121, 'total_file_size': 5.308699607849121}
(orchestrate pid=1877) 23:17:45 INFO - Cluster resources: {'cpus': 2, 'gpus': 0, 'memory': 7.459043885581195, 'object_store': 3.729521941393614}
(orchestrate pid=1877) 23:17:45 INFO - Number of workers - 2 with {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1} each

(orchestrate pid=1877) created [Actor(RayTransformFileProcessor, e903cdc1ae855e22e8a9a94001000000), Actor(RayTransformFileProcessor, 07414990a5043c4c25469d9401000000)], alive []

(orchestrate pid=1877) Traceback (most recent call last):
(orchestrate pid=1877) File "/usr/local/lib/python3.10/dist-packages/data_processing_ray/runtime/ray/transform_orchestrator.py", line 96, in orchestrate
(orchestrate pid=1877) processors = RayUtils.create_actors(
(orchestrate pid=1877) File "/usr/local/lib/python3.10/dist-packages/data_processing_ray/runtime/ray/ray_utils.py", line 129, in create_actors
(orchestrate pid=1877) raise UnrecoverableException(f"out of {len(actors)} created actors only {len(alive)} alive")
(orchestrate pid=1877) data_processing.utils.unrecoverable.UnrecoverableException: out of 2 created actors only 0 alive
(orchestrate pid=1877) 23:19:47 ERROR - Exception during execution out of 2 created actors only 0 alive: None
23:19:57 INFO - Completed execution in 2.428 min, execution result 1
INFO:data_processing_ray.runtime.ray.transform_launcher:Completed execution in 2.428 min, execution result 1

from data_processing.utils import GB
from dpk_pdf2parquet.ray.transform import Pdf2Parquet
Pdf2Parquet(input_folder= "files-web2parquet",
output_folder= "files-pdf2parquet",
data_files_to_use=['.pdf'],
run_locally= True,
num_cpus= 1,
memory= 2 * GB,
runtime_num_workers = 1,
pdf2parquet_contents_type='text/markdown').transform()

23:37:14 INFO - pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': <pdf2parquet_ocr_engine.EASYOCR: 'easyocr'>, 'bitmap_area_threshold': 0.05, 'pdf_backend': <pdf2parquet_pdf_backend.DLPARSE_V2: 'dlparse_v2'>, 'double_precision': 8}
INFO:dpk_pdf2parquet.transform:pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': <pdf2parquet_ocr_engine.EASYOCR: 'easyocr'>, 'bitmap_area_threshold': 0.05, 'pdf_backend': <pdf2parquet_pdf_backend.DLPARSE_V2: 'dlparse_v2'>, 'double_precision': 8}
23:37:14 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id
23:37:14 INFO - code location None
INFO:data_processing.runtime.execution_configuration:code location None
23:37:14 INFO - number of workers 1 worker options {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1}
INFO:data_processing_ray.runtime.ray.execution_configuration:number of workers 1 worker options {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1}
23:37:14 INFO - actor creation delay 0
INFO:data_processing_ray.runtime.ray.execution_configuration:actor creation delay 0
23:37:14 INFO - job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}
INFO:data_processing_ray.runtime.ray.execution_configuration:job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}
23:37:14 INFO - data factory data_ is using local data access: input_folder - files-web2parquet output_folder - files-pdf2parquet
INFO:data_processing.data_access.data_access_factory_basef02c183b-e9e7-42ae-adee-2d2478d8bc43:data factory data_ is using local data access: input_folder - files-web2parquet output_folder - files-pdf2parquet
23:37:14 INFO - data factory data_ max_files -1, n_sample -1
INFO:data_processing.data_access.data_access_factory_basef02c183b-e9e7-42ae-adee-2d2478d8bc43:data factory data_ max_files -1, n_sample -1
23:37:14 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']
INFO:data_processing.data_access.data_access_factory_basef02c183b-e9e7-42ae-adee-2d2478d8bc43:data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']
23:37:14 INFO - Running locally
INFO:data_processing_ray.runtime.ray.transform_launcher:Running locally
2025-01-14 23:37:14,637 WARNING worker.py:1451 -- SIGTERM handler is not set because current thread is not the main thread.
2025-01-14 23:37:17,911 INFO worker.py:1777 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
(orchestrate pid=7013) 23:37:29 INFO - orchestrator started at 2025-01-14 23:37:29
(orchestrate pid=7013) 23:37:29 INFO - Number of files is 1, source profile {'max_file_size': 5.308699607849121, 'min_file_size': 5.308699607849121, 'total_file_size': 5.308699607849121}
(orchestrate pid=7013) 23:37:29 INFO - Cluster resources: {'cpus': 2, 'gpus': 0, 'memory': 7.458782959729433, 'object_store': 3.729391478933394}
(orchestrate pid=7013) 23:37:29 INFO - Number of workers - 1 with {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1} each
(orchestrate pid=7013) Traceback (most recent call last):
(orchestrate pid=7013) File "/usr/local/lib/python3.10/dist-packages/data_processing_ray/runtime/ray/transform_orchestrator.py", line 96, in orchestrate
(orchestrate pid=7013) processors = RayUtils.create_actors(
(orchestrate pid=7013) File "/usr/local/lib/python3.10/dist-packages/data_processing_ray/runtime/ray/ray_utils.py", line 129, in create_actors
(orchestrate pid=7013) raise UnrecoverableException(f"out of {len(actors)} created actors only {len(alive)} alive")
(orchestrate pid=7013) data_processing.utils.unrecoverable.UnrecoverableException: out of 1 created actors only 0 alive
(orchestrate pid=7013) 23:39:31 ERROR - Exception during execution out of 1 created actors only 0 alive: None

(orchestrate pid=7013) created [Actor(RayTransformFileProcessor, e07380e85cfc70a32291429501000000)], alive []

23:39:41 INFO - Completed execution in 2.45 min, execution result 1
INFO:data_processing_ray.runtime.ray.transform_launcher:Completed execution in 2.45 min, execution result 1

Reproduction script

See above.

Anything else

No response

OS

MacOS (limited support)

Python

3.10.x

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@shahrokhDaijavad shahrokhDaijavad added the bug Something isn't working label Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants