-
Hi @roll , continuing the discussion from discord. Now considering the BulkCm XML format as described in: A raw file example found in the specification:
A usual approach would be to create a Resource per XML element: bulkCmConfigDataFile Each with at least a CSV file. These BulkCm files can become quite large, 1GB or more. And contain hundreds of of Resources each with thousands of rows. My initial approach to integrate with frictionless framework would be to create a extractor which reads the BulkCm file(s) and writes, to local disk, CSV files. From there use the framework to create datapackage(s). A regular etl/pipeline starting with the raw file transformation to CSV and from there use ff to validate, transform and publish. A possible alternative would be to create BulkCmXmlPlugin, BulkCmXmlDialect and BulkCmXmlParser. This approach difficulty lies mostly in the write_row_stream parser method, which receives a single resource object. And must output a valid BulkCm. I'm not sure I understand the Parser class and how to use it for BulkCm. Any help or guidance is appreciated |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 5 replies
-
When considering a BulkCmParser @roll suggested creating a transformation step for outputing rows to the appropriate Resources. This transformation would have to execute just before the parser call to write_row_stream. Correct? So it wouldn't be part of the parser itself. |
Beta Was this translation helpful? Give feedback.
-
If I got it correctly in the first place your case needs more flexibility than from pprint import pprint
from frictionless import Package, Resource, transform, steps
def step(resource):
with resource:
for row in resource.row_stream:
# storing the data according a business logic
source = Resource("data/source.xlm", format='bulkcm') # powered by a custom parser (read-only)
transform(source, steps=[step])
# Or you can work with the resource directly |
Beta Was this translation helpful? Give feedback.
-
But based on the size of the potential resources (can be more than 1GB)
I would also go with temporary CSV files or a database |
Beta Was this translation helpful? Give feedback.
But based on the size of the potential resources (can be more than 1GB)
I would also go with temporary CSV files or a database