- Take any csv/text file of 2+ GB of your choice.
- Read the file ( Present approach of reading the file )
- Try different methods of file reading eg: Dask, Modin, Ray, pandas and present your findings in term of computational efficiency
- Perform basic validation on data columns : eg: remove special character , white spaces from the col name
- As you already know the schema hence create a YAML file and write the column name in YAML file. --define separator of read and write file, column name in YAML
- Validate number of columns and column name of ingested file with YAML.
- Write the file in pipe separated text file (|) in gz format.
- Create a summary of the file: Total number of rows, total number of columns, file size
-
Notifications
You must be signed in to change notification settings - Fork 0
murattkiran/File-ingestion-and-schema-validation
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
No description or website provided.
Topics
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published