+ + +
+
+ +
+
+ +
+ +
+ + +
+ +
+ + +
+
+ + + + + +
+ +
+

Dataset#

+
+

Overview#

+

SED comes with the ability to download and extract any URL-based datasets. By default, users can use the datasets:

+
    +
  • WSe2

  • +
  • TaS2

  • +
  • Gd_W110

  • +
  • W110

  • +
+

It is easy to extend this list using a JSON file.

+
+
+
+

Getting Datasets#

+
+

Importing Required Modules#

+
import os
+from sed.dataset import dataset
+
+
+
+
+

get() Method#

+

The get method requires only the dataset name, but an alternative root_dir can be provided.

+

Try interrupting the download process and restarting it to see that it resumes from where it stopped.

+
dataset.get("WSe2", remove_zip=False)
+
+
+

Example Output:

+
Using default data path for "WSe2": "<user_path>/datasets/WSe2"
+
+3%|▎         | 152M/5.73G [00:02<01:24, 71.3MB/s]
+
+Using default data path for "WSe2": "<user_path>/datasets/WSe2"
+
+100%|██████████| 5.73G/5.73G [01:09<00:00, 54.3MB/s]
+
+Download complete.
+
+
+

By default, not providing remove_zip=False will delete the zip file after extraction:

+
dataset.get("WSe2")
+
+
+

Setting use_existing=False allows downloading the data to a new location instead of using existing data.

+
dataset.get("WSe2", root_dir="new_datasets", use_existing=False)
+
+
+

Example Output:

+
Using specified data path for "WSe2": "<user_path>/new_datasets/datasets/WSe2"
+Created new directory at <user_path>/new_datasets/datasets/WSe2
+
+
+

Interrupting extraction here behaves similarly, resuming from where it stopped.

+

If the extracted files are deleted, rerunning this command below will re-extract from the zip file:

+
dataset.get("WSe2", remove_zip=False)
+
+
+

Example Output:

+
Using default data path for "WSe2": "<user_path>/datasets/WSe2"
+WSe2 data is already fully downloaded.
+
+5.73GB [00:00, 12.6MB/s]
+
+Download complete.
+Extracting WSe2 data...
+
+100%|██████████| 113/113 [02:41<00:00, 1.43s/file]
+
+WSe2 data extracted successfully.
+
+
+

and this does not delete the zip file.

+
+
+
+
+

remove() Method#

+

The remove method allows removing some or all instances of existing data.

+

Remove only one instance:

+
dataset.remove("WSe2", instance=dataset.existing_data_paths[0])
+
+
+

Example Output:

+
Removed <user_path>/datasets/WSe2
+
+
+

Remove all instances:

+
dataset.remove("WSe2")
+
+
+

Example Output:

+
WSe2 data is not present.
+
+
+
+
+
+

Useful Attributes#

+
+

Available Datasets#

+
dataset.available
+
+
+

Example Output:

+
['WSe2', 'TaS2', 'Gd_W110', 'W110']
+
+
+
+
+

Data Directory#

+
dataset.dir
+
+
+

Example Output:

+
'<user_path>/datasets/WSe2'
+
+
+
+
+

Subdirectories#

+
dataset.subdirs
+
+
+

Example Output:

+
['<user_path>/datasets/WSe2/Scan049_1',
+ '<user_path>/datasets/WSe2/energycal_2019_01_08']
+
+
+
+
+

Existing Data Paths#

+
dataset.existing_data_paths
+
+
+

Example Output:

+
['<user_path>/new_dataset/datasets/WSe2',
+ '<user_path>/datasets/WSe2']
+
+
+
+
+
+
+

Example: Adding Custom Datasets#

+
+

DatasetsManager#

+

Allows adding or removing datasets in a JSON file at different levels (module, user, folder). It also checks all levels to list available datasets.

+
import os
+from sed.dataset import DatasetsManager
+
+
+
+

Adding a New Dataset#

+

This example adds a dataset to both the folder and user levels. Setting rearrange_files=True moves all files from subfolders into the main dataset directory.

+
example_dset_name = "Example"
+example_dset_info = {
+    "url": "https://example-dataset.com/download",  # Not a real path
+    "subdirs": ["Example_subdir"],
+    "rearrange_files": True
+}
+
+DatasetsManager.add(data_name=example_dset_name, info=example_dset_info, levels=["folder", "user"])
+
+
+

Example Output:

+
Added Example dataset to folder datasets.json
+Added Example dataset to user datasets.json
+
+
+

Verify that datasets.json is created:

+
assert os.path.exists("./datasets.json")
+dataset.available
+
+
+

Example Output:

+
['Example', 'WSe2', 'TaS2', 'Gd_W110']
+
+
+
+
+

Removing a Dataset#

+

Remove the Example dataset from the user JSON file:

+
DatasetsManager.remove(data_name=example_dset_name, levels=["user"])
+
+
+

Example Output:

+
Removed Example dataset from user datasets.json
+
+
+

Adding an already existing dataset will result in an error:

+
DatasetsManager.add(data_name=example_dset_name, info=example_dset_info, levels=["folder"])
+
+
+

Example Output:

+
ValueError: Dataset Example already exists in folder datasets.json.
+
+
+
+
+

Downloading the Example Dataset#

+
dataset.get("Example")
+
+
+

Example Output:

+
Using default data path for "Example": "<user_path>/datasets/Example"
+Created new directory at <user_path>/datasets/Example
+Download complete.
+Extracting Example data...
+
+100%|██████████| 4/4 [00:00<00:00, 28.10file/s]
+
+Example data extracted successfully.
+
+
+
+
+

Download to Another Location#

+
dataset.get("Example", root_dir="new_datasets", use_existing=False)
+
+
+

Example Output:

+
Using specified data path for "Example": "<user_path>/new_datasets/datasets/Example"
+Created new directory at <user_path>/new_datasets/datasets/Example
+
+
+
+
+

Removing an Instance#

+
print(dataset.existing_data_paths)
+path_to_remove = dataset.existing_data_paths[0]
+dataset.remove(data_name="Example", instance=path_to_remove)
+
+
+

Example Output:

+
Removed <user_path>/new_datasets/datasets/Example
+
+
+

Verify that the path was removed:

+
assert not os.path.exists(path_to_remove)
+
+
+
print(dataset.existing_data_paths)
+
+
+

Example Output:

+
['<user_path>/datasets/Example']
+
+
+
+
+
+
+
+

Default datasets.json#

+
{
+    "WSe2": {
+        "url": "https://zenodo.org/record/6369728/files/WSe2.zip",
+        "subdirs": [
+            "Scan049_1",
+            "energycal_2019_01_08"
+        ]
+    },
+    "Gd_W110": {
+        "url": "https://zenodo.org/records/10658470/files/single_event_data.zip",
+        "subdirs": [
+            "analysis_data",
+            "calibration_data"
+        ],
+        "rearrange_files": true
+    },
+    "W110": {
+        "url": "https://zenodo.org/records/12609441/files/single_event_data.zip",
+        "subdirs": [
+            "analysis_data",
+            "calibration_data"
+        ],
+        "rearrange_files": true
+    },
+    "TaS2": {
+        "url": "https://zenodo.org/records/10160182/files/TaS2.zip",
+        "subdirs": [
+            "Scan0121_1",
+            "energycal_2020_07_20"
+        ]
+    },
+    "Au_Mica": {
+        "url": "https://zenodo.org/records/13952965/files/Au_Mica_SXP.zip"
+    },
+    "Test": {
+        "url": "http://test.com/files/file.zip",
+        "subdirs": [
+            "subdir"
+        ],
+        "rearrange_files": true
+    }
+}
+
+
+
+
+ + +
+ + + + + + + +
+ + + + + + + +
+
+ +
+ +