You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched the issues and found no similar issues.
Component
Transforms/Other
What happened + What you expected to happen
When we run the notebook https://github.com/IBM/data-prep-kit/blob/dev/transforms/transforms-1.0-lang-ray.ipynb on Google Colab and get to the cell that executes pdf2parquet, we run into error. We tried 2 configuration of parameters:
1)
from data_processing.utils import GB
from dpk_pdf2parquet.ray.transform import Pdf2Parquet
Pdf2Parquet(input_folder= "files-web2parquet",
output_folder= "files-pdf2parquet",
data_files_to_use=['.pdf'],
run_locally= True,
num_cpus= 1,
memory= 2 * GB,
runtime_num_workers = 2,
pdf2parquet_contents_type='text/markdown').transform()
and got the error as follows:
23:17:31 INFO - pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': <pdf2parquet_ocr_engine.EASYOCR: 'easyocr'>, 'bitmap_area_threshold': 0.05, 'pdf_backend': <pdf2parquet_pdf_backend.DLPARSE_V2: 'dlparse_v2'>, 'double_precision': 8}
INFO:dpk_pdf2parquet.transform:pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': <pdf2parquet_ocr_engine.EASYOCR: 'easyocr'>, 'bitmap_area_threshold': 0.05, 'pdf_backend': <pdf2parquet_pdf_backend.DLPARSE_V2: 'dlparse_v2'>, 'double_precision': 8}
23:17:31 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id
23:17:31 INFO - code location None
INFO:data_processing.runtime.execution_configuration:code location None
23:17:31 INFO - number of workers 2 worker options {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1}
INFO:data_processing_ray.runtime.ray.execution_configuration:number of workers 2 worker options {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1}
23:17:31 INFO - actor creation delay 0
INFO:data_processing_ray.runtime.ray.execution_configuration:actor creation delay 0
23:17:31 INFO - job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}
INFO:data_processing_ray.runtime.ray.execution_configuration:job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}
23:17:31 INFO - data factory data_ is using local data access: input_folder - files-web2parquet output_folder - files-pdf2parquet
INFO:data_processing.data_access.data_access_factory_basef02c183b-e9e7-42ae-adee-2d2478d8bc43:data factory data_ is using local data access: input_folder - files-web2parquet output_folder - files-pdf2parquet
23:17:31 INFO - data factory data_ max_files -1, n_sample -1
INFO:data_processing.data_access.data_access_factory_basef02c183b-e9e7-42ae-adee-2d2478d8bc43:data factory data_ max_files -1, n_sample -1
23:17:31 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']
INFO:data_processing.data_access.data_access_factory_basef02c183b-e9e7-42ae-adee-2d2478d8bc43:data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']
23:17:31 INFO - Running locally
INFO:data_processing_ray.runtime.ray.transform_launcher:Running locally
2025-01-14 23:17:31,876 WARNING worker.py:1451 -- SIGTERM handler is not set because current thread is not the main thread.
2025-01-14 23:17:35,633 INFO worker.py:1777 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
(orchestrate pid=1877) 23:17:45 INFO - orchestrator started at 2025-01-14 23:17:45
(orchestrate pid=1877) 23:17:45 INFO - Number of files is 1, source profile {'max_file_size': 5.308699607849121, 'min_file_size': 5.308699607849121, 'total_file_size': 5.308699607849121}
(orchestrate pid=1877) 23:17:45 INFO - Cluster resources: {'cpus': 2, 'gpus': 0, 'memory': 7.459043885581195, 'object_store': 3.729521941393614}
(orchestrate pid=1877) 23:17:45 INFO - Number of workers - 2 with {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1} each
(orchestrate pid=1877) created [Actor(RayTransformFileProcessor, e903cdc1ae855e22e8a9a94001000000), Actor(RayTransformFileProcessor, 07414990a5043c4c25469d9401000000)], alive []
(orchestrate pid=1877) Traceback (most recent call last):
(orchestrate pid=1877) File "/usr/local/lib/python3.10/dist-packages/data_processing_ray/runtime/ray/transform_orchestrator.py", line 96, in orchestrate
(orchestrate pid=1877) processors = RayUtils.create_actors(
(orchestrate pid=1877) File "/usr/local/lib/python3.10/dist-packages/data_processing_ray/runtime/ray/ray_utils.py", line 129, in create_actors
(orchestrate pid=1877) raise UnrecoverableException(f"out of {len(actors)} created actors only {len(alive)} alive")
(orchestrate pid=1877) data_processing.utils.unrecoverable.UnrecoverableException: out of 2 created actors only 0 alive
(orchestrate pid=1877) 23:19:47 ERROR - Exception during execution out of 2 created actors only 0 alive: None
23:19:57 INFO - Completed execution in 2.428 min, execution result 1
INFO:data_processing_ray.runtime.ray.transform_launcher:Completed execution in 2.428 min, execution result 1
23:37:14 INFO - pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': <pdf2parquet_ocr_engine.EASYOCR: 'easyocr'>, 'bitmap_area_threshold': 0.05, 'pdf_backend': <pdf2parquet_pdf_backend.DLPARSE_V2: 'dlparse_v2'>, 'double_precision': 8}
INFO:dpk_pdf2parquet.transform:pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': <pdf2parquet_ocr_engine.EASYOCR: 'easyocr'>, 'bitmap_area_threshold': 0.05, 'pdf_backend': <pdf2parquet_pdf_backend.DLPARSE_V2: 'dlparse_v2'>, 'double_precision': 8}
23:37:14 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id
23:37:14 INFO - code location None
INFO:data_processing.runtime.execution_configuration:code location None
23:37:14 INFO - number of workers 1 worker options {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1}
INFO:data_processing_ray.runtime.ray.execution_configuration:number of workers 1 worker options {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1}
23:37:14 INFO - actor creation delay 0
INFO:data_processing_ray.runtime.ray.execution_configuration:actor creation delay 0
23:37:14 INFO - job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}
INFO:data_processing_ray.runtime.ray.execution_configuration:job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}
23:37:14 INFO - data factory data_ is using local data access: input_folder - files-web2parquet output_folder - files-pdf2parquet
INFO:data_processing.data_access.data_access_factory_basef02c183b-e9e7-42ae-adee-2d2478d8bc43:data factory data_ is using local data access: input_folder - files-web2parquet output_folder - files-pdf2parquet
23:37:14 INFO - data factory data_ max_files -1, n_sample -1
INFO:data_processing.data_access.data_access_factory_basef02c183b-e9e7-42ae-adee-2d2478d8bc43:data factory data_ max_files -1, n_sample -1
23:37:14 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']
INFO:data_processing.data_access.data_access_factory_basef02c183b-e9e7-42ae-adee-2d2478d8bc43:data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']
23:37:14 INFO - Running locally
INFO:data_processing_ray.runtime.ray.transform_launcher:Running locally
2025-01-14 23:37:14,637 WARNING worker.py:1451 -- SIGTERM handler is not set because current thread is not the main thread.
2025-01-14 23:37:17,911 INFO worker.py:1777 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
(orchestrate pid=7013) 23:37:29 INFO - orchestrator started at 2025-01-14 23:37:29
(orchestrate pid=7013) 23:37:29 INFO - Number of files is 1, source profile {'max_file_size': 5.308699607849121, 'min_file_size': 5.308699607849121, 'total_file_size': 5.308699607849121}
(orchestrate pid=7013) 23:37:29 INFO - Cluster resources: {'cpus': 2, 'gpus': 0, 'memory': 7.458782959729433, 'object_store': 3.729391478933394}
(orchestrate pid=7013) 23:37:29 INFO - Number of workers - 1 with {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1} each
(orchestrate pid=7013) Traceback (most recent call last):
(orchestrate pid=7013) File "/usr/local/lib/python3.10/dist-packages/data_processing_ray/runtime/ray/transform_orchestrator.py", line 96, in orchestrate
(orchestrate pid=7013) processors = RayUtils.create_actors(
(orchestrate pid=7013) File "/usr/local/lib/python3.10/dist-packages/data_processing_ray/runtime/ray/ray_utils.py", line 129, in create_actors
(orchestrate pid=7013) raise UnrecoverableException(f"out of {len(actors)} created actors only {len(alive)} alive")
(orchestrate pid=7013) data_processing.utils.unrecoverable.UnrecoverableException: out of 1 created actors only 0 alive
(orchestrate pid=7013) 23:39:31 ERROR - Exception during execution out of 1 created actors only 0 alive: None
(orchestrate pid=7013) created [Actor(RayTransformFileProcessor, e07380e85cfc70a32291429501000000)], alive []
23:39:41 INFO - Completed execution in 2.45 min, execution result 1
INFO:data_processing_ray.runtime.ray.transform_launcher:Completed execution in 2.45 min, execution result 1
Reproduction script
See above.
Anything else
No response
OS
MacOS (limited support)
Python
3.10.x
Are you willing to submit a PR?
Yes I am willing to submit a PR!
The text was updated successfully, but these errors were encountered:
Search before asking
Component
Transforms/Other
What happened + What you expected to happen
When we run the notebook https://github.com/IBM/data-prep-kit/blob/dev/transforms/transforms-1.0-lang-ray.ipynb on Google Colab and get to the cell that executes pdf2parquet, we run into error. We tried 2 configuration of parameters:
1)
from data_processing.utils import GB
from dpk_pdf2parquet.ray.transform import Pdf2Parquet
Pdf2Parquet(input_folder= "files-web2parquet",
output_folder= "files-pdf2parquet",
data_files_to_use=['.pdf'],
run_locally= True,
num_cpus= 1,
memory= 2 * GB,
runtime_num_workers = 2,
pdf2parquet_contents_type='text/markdown').transform()
and got the error as follows:
23:17:31 INFO - pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': <pdf2parquet_ocr_engine.EASYOCR: 'easyocr'>, 'bitmap_area_threshold': 0.05, 'pdf_backend': <pdf2parquet_pdf_backend.DLPARSE_V2: 'dlparse_v2'>, 'double_precision': 8}
INFO:dpk_pdf2parquet.transform:pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': <pdf2parquet_ocr_engine.EASYOCR: 'easyocr'>, 'bitmap_area_threshold': 0.05, 'pdf_backend': <pdf2parquet_pdf_backend.DLPARSE_V2: 'dlparse_v2'>, 'double_precision': 8}
23:17:31 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id
23:17:31 INFO - code location None
INFO:data_processing.runtime.execution_configuration:code location None
23:17:31 INFO - number of workers 2 worker options {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1}
INFO:data_processing_ray.runtime.ray.execution_configuration:number of workers 2 worker options {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1}
23:17:31 INFO - actor creation delay 0
INFO:data_processing_ray.runtime.ray.execution_configuration:actor creation delay 0
23:17:31 INFO - job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}
INFO:data_processing_ray.runtime.ray.execution_configuration:job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}
23:17:31 INFO - data factory data_ is using local data access: input_folder - files-web2parquet output_folder - files-pdf2parquet
INFO:data_processing.data_access.data_access_factory_basef02c183b-e9e7-42ae-adee-2d2478d8bc43:data factory data_ is using local data access: input_folder - files-web2parquet output_folder - files-pdf2parquet
23:17:31 INFO - data factory data_ max_files -1, n_sample -1
INFO:data_processing.data_access.data_access_factory_basef02c183b-e9e7-42ae-adee-2d2478d8bc43:data factory data_ max_files -1, n_sample -1
23:17:31 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']
INFO:data_processing.data_access.data_access_factory_basef02c183b-e9e7-42ae-adee-2d2478d8bc43:data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']
23:17:31 INFO - Running locally
INFO:data_processing_ray.runtime.ray.transform_launcher:Running locally
2025-01-14 23:17:31,876 WARNING worker.py:1451 -- SIGTERM handler is not set because current thread is not the main thread.
2025-01-14 23:17:35,633 INFO worker.py:1777 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
(orchestrate pid=1877) 23:17:45 INFO - orchestrator started at 2025-01-14 23:17:45
(orchestrate pid=1877) 23:17:45 INFO - Number of files is 1, source profile {'max_file_size': 5.308699607849121, 'min_file_size': 5.308699607849121, 'total_file_size': 5.308699607849121}
(orchestrate pid=1877) 23:17:45 INFO - Cluster resources: {'cpus': 2, 'gpus': 0, 'memory': 7.459043885581195, 'object_store': 3.729521941393614}
(orchestrate pid=1877) 23:17:45 INFO - Number of workers - 2 with {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1} each
(orchestrate pid=1877) created [Actor(RayTransformFileProcessor, e903cdc1ae855e22e8a9a94001000000), Actor(RayTransformFileProcessor, 07414990a5043c4c25469d9401000000)], alive []
(orchestrate pid=1877) Traceback (most recent call last):
(orchestrate pid=1877) File "/usr/local/lib/python3.10/dist-packages/data_processing_ray/runtime/ray/transform_orchestrator.py", line 96, in orchestrate
(orchestrate pid=1877) processors = RayUtils.create_actors(
(orchestrate pid=1877) File "/usr/local/lib/python3.10/dist-packages/data_processing_ray/runtime/ray/ray_utils.py", line 129, in create_actors
(orchestrate pid=1877) raise UnrecoverableException(f"out of {len(actors)} created actors only {len(alive)} alive")
(orchestrate pid=1877) data_processing.utils.unrecoverable.UnrecoverableException: out of 2 created actors only 0 alive
(orchestrate pid=1877) 23:19:47 ERROR - Exception during execution out of 2 created actors only 0 alive: None
23:19:57 INFO - Completed execution in 2.428 min, execution result 1
INFO:data_processing_ray.runtime.ray.transform_launcher:Completed execution in 2.428 min, execution result 1
from data_processing.utils import GB
from dpk_pdf2parquet.ray.transform import Pdf2Parquet
Pdf2Parquet(input_folder= "files-web2parquet",
output_folder= "files-pdf2parquet",
data_files_to_use=['.pdf'],
run_locally= True,
num_cpus= 1,
memory= 2 * GB,
runtime_num_workers = 1,
pdf2parquet_contents_type='text/markdown').transform()
23:37:14 INFO - pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': <pdf2parquet_ocr_engine.EASYOCR: 'easyocr'>, 'bitmap_area_threshold': 0.05, 'pdf_backend': <pdf2parquet_pdf_backend.DLPARSE_V2: 'dlparse_v2'>, 'double_precision': 8}
INFO:dpk_pdf2parquet.transform:pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': <pdf2parquet_ocr_engine.EASYOCR: 'easyocr'>, 'bitmap_area_threshold': 0.05, 'pdf_backend': <pdf2parquet_pdf_backend.DLPARSE_V2: 'dlparse_v2'>, 'double_precision': 8}
23:37:14 INFO - pipeline id pipeline_id
INFO:data_processing.runtime.execution_configuration:pipeline id pipeline_id
23:37:14 INFO - code location None
INFO:data_processing.runtime.execution_configuration:code location None
23:37:14 INFO - number of workers 1 worker options {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1}
INFO:data_processing_ray.runtime.ray.execution_configuration:number of workers 1 worker options {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1}
23:37:14 INFO - actor creation delay 0
INFO:data_processing_ray.runtime.ray.execution_configuration:actor creation delay 0
23:37:14 INFO - job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}
INFO:data_processing_ray.runtime.ray.execution_configuration:job details {'job category': 'preprocessing', 'job name': 'pdf2parquet', 'job type': 'ray', 'job id': 'job_id'}
23:37:14 INFO - data factory data_ is using local data access: input_folder - files-web2parquet output_folder - files-pdf2parquet
INFO:data_processing.data_access.data_access_factory_basef02c183b-e9e7-42ae-adee-2d2478d8bc43:data factory data_ is using local data access: input_folder - files-web2parquet output_folder - files-pdf2parquet
23:37:14 INFO - data factory data_ max_files -1, n_sample -1
INFO:data_processing.data_access.data_access_factory_basef02c183b-e9e7-42ae-adee-2d2478d8bc43:data factory data_ max_files -1, n_sample -1
23:37:14 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']
INFO:data_processing.data_access.data_access_factory_basef02c183b-e9e7-42ae-adee-2d2478d8bc43:data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']
23:37:14 INFO - Running locally
INFO:data_processing_ray.runtime.ray.transform_launcher:Running locally
2025-01-14 23:37:14,637 WARNING worker.py:1451 -- SIGTERM handler is not set because current thread is not the main thread.
2025-01-14 23:37:17,911 INFO worker.py:1777 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
(orchestrate pid=7013) 23:37:29 INFO - orchestrator started at 2025-01-14 23:37:29
(orchestrate pid=7013) 23:37:29 INFO - Number of files is 1, source profile {'max_file_size': 5.308699607849121, 'min_file_size': 5.308699607849121, 'total_file_size': 5.308699607849121}
(orchestrate pid=7013) 23:37:29 INFO - Cluster resources: {'cpus': 2, 'gpus': 0, 'memory': 7.458782959729433, 'object_store': 3.729391478933394}
(orchestrate pid=7013) 23:37:29 INFO - Number of workers - 1 with {'num_cpus': 1, 'memory': 2147483648, 'max_restarts': -1} each
(orchestrate pid=7013) Traceback (most recent call last):
(orchestrate pid=7013) File "/usr/local/lib/python3.10/dist-packages/data_processing_ray/runtime/ray/transform_orchestrator.py", line 96, in orchestrate
(orchestrate pid=7013) processors = RayUtils.create_actors(
(orchestrate pid=7013) File "/usr/local/lib/python3.10/dist-packages/data_processing_ray/runtime/ray/ray_utils.py", line 129, in create_actors
(orchestrate pid=7013) raise UnrecoverableException(f"out of {len(actors)} created actors only {len(alive)} alive")
(orchestrate pid=7013) data_processing.utils.unrecoverable.UnrecoverableException: out of 1 created actors only 0 alive
(orchestrate pid=7013) 23:39:31 ERROR - Exception during execution out of 1 created actors only 0 alive: None
(orchestrate pid=7013) created [Actor(RayTransformFileProcessor, e07380e85cfc70a32291429501000000)], alive []
23:39:41 INFO - Completed execution in 2.45 min, execution result 1
INFO:data_processing_ray.runtime.ray.transform_launcher:Completed execution in 2.45 min, execution result 1
Reproduction script
See above.
Anything else
No response
OS
MacOS (limited support)
Python
3.10.x
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: