Description: This repo is the central hub for the initial phase of data exploration and assessment of a project aimed at comprehensive data management and processing of the PSA Field data. It focuses on addressing the initial backlog of images and metadata while emphasizing a deep understanding of our existing Field images. This phase is critical for laying the groundwork for advanced data processing and management in later stages of the project.
Key Features:
-
Data Volume Assessment: Contains tools and methodologies for evaluating the size and complexity of the existing data pool, ensuring an understanding of the scale of data we're managing.
-
Data Visualization and Status Reporting: Features scripts and resources for creating visual representations of the data's current status and contents. This aids in identifying patterns, anomalies, and key areas requiring attention.
-
Metadata Quality Review and Image Sampling: Offers guidelines and tools for inspecting metadata accuracy and completeness, along with methods for performing quality checks on image samples.
-
Data Exploration and Reporting: Includes exploratory data analysis tools to understand the characteristics and structure of the dataset. It also encompasses reporting mechanisms for documenting findings and progress.
-
Data Organization and Issue Resolution: Provides strategies and scripts for identifying immediate dataset issues, along with solutions to these challenges.
To manage the project's dependencies efficiently, we use Conda, a powerful package manager and environment manager. Follow these steps to install Conda if you haven't already:
- Download the appropriate version of Miniconda for your operating system from the official Miniconda website.
- Follow the installation instructions provided on the website for your OS. This typically involves running the installer from the command line and following the on-screen prompts.
- Once installed, open a new terminal window and type
conda list
to ensure Conda was installed correctly. You should see a list of installed packages.
After installing Conda, you can set up an environment for this project using an environment file, which specifies all necessary dependencies. Here's how:
-
Clone this repository to your local machine.
-
Navigate to the repository directory in your terminal.
-
Locate the
environment.yaml
file in the repository. This file contains the list of packages needed for the project. -
Create a new Conda environment by running the following command:
conda env create -f environment.yaml
This command reads the
environment.yaml
file and creates an environment with the name and dependencies specified within it. -
Once the environment is created, activate it with:
conda activate <env_name>
Replace
<env_name>
with the name of the environment specified in theenvironment.yaml
file.
With the environment set up and activated, you can run the scripts provided in the repository to begin data exploration and analysis:
- Ensure your Conda environment is activated:
conda activate field
- To run a script, use the following command syntax:
sh run_volume_assessment.sh
- [NOTE] Setup the pipeline in the main config. To run a script, use the following command syntax:
python FIELD_REPORT.py
This script main purpose is to quickley get image DateTime information that can later be used to form batches. We quickley read jpg exif data to get capture datetime information, then remove the download jpg. We get the jpg from the azure blob container. This script appends date-time metadata extracted from the downloaded image EXIF data to an existing persistent CSV table. this is a crucial step that allows us to form "batches" for preprocessing. The script must be run before the "create_batches" task. If no new data is found, nothing happens.
Features of Note:
- The script finds the most recent "merged..." CSV file in "data/processed_data" and merges it with persistent data for continuous updates.
- It processes each JPG image to extract and append EXIF DateTime metadata to the relevant records in the CSV.
- Concurrency Handling: We use concurrent.futures and ThreadPoolExecutor for handling multiple downloads and reading of exif data at the same time.
This scripts create batches by using the updated DateTime information from append_datetime
, organizes raw images into "batches", and copies those image batches to the field-batches blob container. The script adjusts and groups images based on metadata into "batches" and filters out already processed or duplicate batches. Batch groupings are based on State, capture date, and 3 hour capture time intervals. It offers the flexibility to process data either concurrently or sequentially.
This script is designed to facilitate the quality check process by performing the following functions:
Random Image Selection: Automatically selects up to 15 images that have been uploaded in the past 15 days from a merged data table.
Image Plotting with Metadata: For each selected image, the script generates a plot that includes the image itself along with key metadata fields. Plots are located in the report/<date>/inspection
folder.