Skip to content

Commit

Permalink
move dataset docs to user guide
Browse files Browse the repository at this point in the history
  • Loading branch information
zain-sohail committed Feb 25, 2025
1 parent aeffacf commit 660a2b0
Show file tree
Hide file tree
Showing 3 changed files with 280 additions and 323 deletions.
323 changes: 0 additions & 323 deletions docs/sed/dataset.rst
Original file line number Diff line number Diff line change
@@ -1,326 +1,3 @@
Dataset
===================================================

SED comes with the ability to download and extract any URL based
datasets. By default, user can the “WSe2”, “TaS2” and “Gd_W110” datasets
but easy to extend this list.

Getting datasets
------------------------

.. code:: python
import os
from sed.dataset import dataset
get()
^^^^^

The “get” just needs the data name, but another root_dir can be provided.

Try to interrupt the download process and restart to see that it continues the download from where it stopped

.. code:: python
dataset.get("WSe2", remove_zip = False)
.. parsed-literal::
Using default data path for "WSe2": "<user_path>/datasets/WSe2"
3%|▎ | 152M/5.73G [00:02<01:24, 71.3MB/s]
Using default data path for "WSe2": "<user_path>/datasets/WSe2"
100%|██████████| 5.73G/5.73G [01:09<00:00, 54.3MB/s]
Download complete.
Not providing “remove_zip” at all will by default delete the zip file after extraction

.. code:: python
dataset.get("WSe2")
Setting the “use_existing” keyword to False allows to download the data in another location. Default is to use existing data

.. code:: python
dataset.get("WSe2", root_dir = "new_datasets", use_existing=False)
.. parsed-literal::
Using specified data path for "WSe2": "<user_path>/new_datasets/datasets/WSe2"
Created new directory at <user_path>/new_datasets/datasets/WSe2
3%|▎ | 152M/5.73G [00:02<01:24, 71.3MB/s]
Interrupting extraction has similar behavior to download and just continues from where it stopped.

Or if user deletes the extracted documents, it re-extracts from zip file

.. code:: python
dataset.get("WSe2", remove_zip = False)
## Try to remove some files and rerun this command.
.. parsed-literal::
Using default data path for "WSe2": "<user_path>/datasets/WSe2"
WSe2 data is already fully downloaded.
5.73GB [00:00, 12.6MB/s]
Download complete.
Extracting WSe2 data...
100%|██████████| 113/113 [02:41<00:00, 1.43s/file]
WSe2 data extracted successfully.
remove()
^^^^^^^^

“remove” allows removal of some or all instances of existing data

This would remove only one of the two existing paths

.. code:: python
dataset.remove("WSe2", instance = dataset.existing_data_paths[0])
.. parsed-literal::
Removed <user_path>/datasets/WSe2
This removes all instances, if any present

.. code:: python
dataset.remove("WSe2")
.. parsed-literal::
WSe2 data is not present.
Attributes useful for user
^^^^^^^^^^^^^^^^^^^^^^^^^^

All available datasets after looking at module, user and folder levels

.. code:: python
dataset.available
.. parsed-literal::
['WSe2', 'TaS2', 'Gd_W110']
The dir and subdirs where data is located

.. code:: python
dataset.dir
.. parsed-literal::
'<user_path>/datasets/WSe2'
.. code:: python
dataset.subdirs
.. parsed-literal::
['<user_path>/datasets/WSe2/Scan049_1',
'<user_path>/datasets/WSe2/energycal_2019_01_08']
Existing locations where data is present

.. code:: python
dataset.existing_data_paths
.. parsed-literal::
['<user_path>/new_dataset/datasets/WSe2',
'<user_path>/datasets/WSe2']
Example of adding custom datasets
---------------------------------

DatasetsManager
^^^^^^^^^^^^^^^

Allows to add or remove datasets in json file at any level (module, user, folder).

Looks at all levels to give the available datasets

.. code:: python
import os
from sed.dataset import DatasetsManager
We add a new dataset to both folder and user levels

This dataset also has “rearrange_files” set to True, which takes all files in subfolders and puts them in the main dataset specific directory

.. code:: python
example_dset_name = "Example"
example_dset_info = {}
example_dset_info["url"] = "https://example-dataset.com/download" # not a real path
example_dset_info["subdirs"] = ["Example_subdir"]
example_dset_info["rearrange_files"] = True
DatasetsManager.add(data_name=example_dset_name, info=example_dset_info, levels=["folder", "user"])
.. parsed-literal::
Added Example dataset to folder datasets.json
Added Example dataset to user datasets.json
datasets.json should be available in execution folder after this

.. code:: python
assert os.path.exists("./datasets.json")
dataset.available
.. parsed-literal::
['Example', 'WSe2', 'TaS2', 'Gd_W110']
This will remove the Example dataset from the user json file

.. code:: python
DatasetsManager.remove(data_name=example_dset_name, levels=["user"])
.. parsed-literal::
Removed Example dataset from user datasets.json
Adding dataset that already exists will give an error. Likewise, removing one that doesn’t exist

.. code:: python
# This should give an error
DatasetsManager.add(data_name=example_dset_name, info=example_dset_info, levels=["folder"])
.. parsed-literal::
ValueError: Dataset Example already exists in folder datasets.json.
Now that dataset.json with Example exists in current dir, lets try to fetch it

.. code:: python
dataset.get("Example")
.. parsed-literal::
Using default data path for "Example": "<user_path>/datasets/Example"
Created new directory at <user_path>/datasets/Example
Download complete.
Extracting Example data...
100%|██████████| 4/4 [00:00<00:00, 28.10file/s]
Example data extracted successfully.
Removed Example.zip file.
Rearranging files in Example_subdir.
100%|██████████| 3/3 [00:00<00:00, 696.11file/s]
File movement complete.
Rearranging complete.
.. code:: python
print(dataset.dir)
print(dataset.subdirs)
.. parsed-literal::
<user_path>/datasets/Example
[]
lets download to another location

.. code:: python
dataset.get("Example", root_dir = "new_datasets", use_existing = False)
.. parsed-literal.. parsed-literal::
Using specified data path for "Example": "<user_path>/new_datasets/datasets/Example"
Created new directory at <user_path>/new_datasets/datasets/Example
Download complete.
Extracting Example data...
100%|██████████| 4/4 [00:00<00:00, 28.28file/s]
Example data extracted successfully.
Removed Example.zip file.
Rearranging files in Example_subdir.
100%|██████████| 3/3 [00:00<00:00, 546.16file/s]
File movement complete.
Rearranging complete.
we can remove one instance

.. code:: python
print(dataset.existing_data_paths)
path_to_remove = dataset.existing_data_paths[0]
.. parsed-literal::
['<user_path>/new_datasets/datasets/Example', '<user_path>/datasets/Example']
.. code:: python
dataset.remove(data_name="Example", instance=path_to_remove)
.. parsed-literal::
Removed <user_path>/new_datasets/datasets/Example
.. code:: python
assert not os.path.exists(path_to_remove)
.. code:: python
print(dataset.existing_data_paths)
.. parsed-literal::
['<user_path>/datasets/Example']
Default datasets.json
---------------------

.. literalinclude:: ../../src/sed/config/datasets.json
:language: json

API
------------------------
.. automodule:: sed.dataset.dataset
Expand Down
Loading

0 comments on commit 660a2b0

Please sign in to comment.