diff --git a/README.md b/README.md new file mode 100644 index 0000000..5f62db2 --- /dev/null +++ b/README.md @@ -0,0 +1,76 @@ +# Dataset, model weight, source code for paper "HistGen: Histopathology Report Generation via Local-Global Feature Encoding and Cross-modal Context Interaction" + +## Prerequisite +Follow this instruction to create conda environment and install necessary packages: +``` +git clone https://github.com/dddavid4real/HistGen.git +cd HistGen +conda env create -f requirements.yml +``` +## HistGen WSI-report dataset +Our curated dataset could be downloaded from . + +The structure of this fold is shown as follows: +``` +HistGen WSI-report dataset/ +|-- WSIs +| |-- slide_1.svs +| |-- slide_2.svs +| ╵-- ... +|-- dinov2_vitl +| |-- slide_1.pt +| |-- slide_2.pt +| ╵-- ... +╵-- annotation.json +``` +in which **WSIs** denotes the original WSI data from TCGA, **dinov2_vitl** is the features of original WSIs extracted by our pre-trained DINOv2 ViT-L backbone, and **annotation.json** contains the diagnostic reports and case ids of their corresponding WSIs. Concretely, the structure of this file is like this: +``` +{ + "train": [ + { + "id": "TCGA-A7-A6VW-01Z-00-DX1.1BC4790C-DB45-4A3D-9C97-92C92C03FF60", + "report": "Final Surgical Pathology Report Procedure: Diagnosis A. Sentinel lymph node, left axilla ...", + "image_path": [ + "/storage/Pathology/wsi-report/wsi/TCGA-A7-A6VW-01Z-00-DX1.1BC4790C-DB45-4A3D-9C97-92C92C03FF60.pt" + ], + "split": "train" + }, + ... + ], + + "val": [ + { + "id": "...", + "report": "...", + "image_path": ["..."], + "split": "val" + }, + ... + ], + + "test": [ + { + "id": "...", + "report": "...", + "image_path": ["..."], + "split": "test" + }, + ... + ] +} +``` +in which we have already split into train/val/test subsets with ratio 8:1:1. Besides, "id" denotes the case id of this report's corresponding WSI, "report" is the full refined text obtained after our proposed report cleaning pipeline, and "image_path" could be just ignored. + + + +## Pre-trained DINOv2 ViT-L Feature Extractor +We are organizing the training details, dataset used, and other information to release the pre-trained model. Please stay tuned for the update. + +## HistGen WSI Report Generation Model +To try our model for training, validation, and testing, simply run the following commands: +``` +cd HistGen +conda activate histgen +sh train_wsi_report.sh +``` +Before you run the script, please set the path and other hyperparameters in `train_wsi_report.sh`. \ No newline at end of file diff --git a/replace_pt_path.py b/replace_pt_path.py new file mode 100644 index 0000000..9e46ea1 --- /dev/null +++ b/replace_pt_path.py @@ -0,0 +1,22 @@ +import json + +def update_image_path(json_file, old_path, new_path): + # Read the JSON file + with open(json_file, 'r') as file: + data = json.load(file) + + # Update 'image_path' in 'train', 'val', and 'test' + for key in ['train', 'val', 'test']: + if key in data: + for item in data[key]: + item['image_path'] = [path.replace(old_path, new_path) for path in item['image_path']] + + # Write the updated data back to the JSON file + with open(json_file, 'w') as file: + json.dump(data, file, indent=4) + +# Usage +json_file = 'your_json_file.json' # Replace with your JSON file path +old_path = '/storage/Pathology/wsi-report/wsi' +new_path = '/new/path/here' # Replace with your new path +update_image_path(json_file, old_path, new_path) diff --git a/requirements.yml b/requirements.yml index 27971f1..e399b44 100644 --- a/requirements.yml +++ b/requirements.yml @@ -1,4 +1,4 @@ -name: r2gen +name: histgen channels: - pytorch - nvidia @@ -163,4 +163,4 @@ dependencies: - threadpoolctl==3.3.0 - tqdm==4.66.2 - tzdata==2024.1 -prefix: /home/zguobc/miniconda3/envs/r2gen +prefix: /home/zguobc/miniconda3/envs/histgen