precommit

lhparker1 · lhparker1 · commit 4cb27d5a9863 · 2024-06-07T15:21:02.000-04:00
diff --git a/README.md b/README.md
@@ -15,7 +15,7 @@ pip install --upgrade eventlet torch lightning[extra]
 pip install -e .
 ```
 
-The package expects to load models and data by default from 
+The package expects to load models and data by default from
 ```bash
 {ASTROCLIP_ROOT}
 ```
@@ -31,7 +31,7 @@ If no environment is specified, the default path at Flatiron will be assumed.
 
 ## Pretrained Models
 
-We provide the pretrained AstroCLIP model on the Huggingface model hub for easy access. Additionally, we provide the pretrained single-modal models for galaxy images and spectra as well. Model details, checkpoints, configs and logs are below.  
+We provide the pretrained AstroCLIP model on the Huggingface model hub for easy access. Additionally, we provide the pretrained single-modal models for galaxy images and spectra as well. Model details, checkpoints, configs and logs are below.
 
 <table>
   <tr>
@@ -154,7 +154,7 @@ Below, we include a high-level performance overview of our models on a variety o
   <tr>
 </table>
 
-We report R-squared metrics on redshift and galaxy property estimation (averaged across all properties) and accuracy on galaxy morphology classification (averaged across all labels). Our models are marked with an asterisk (*). 
+We report R-squared metrics on redshift and galaxy property estimation (averaged across all properties) and accuracy on galaxy morphology classification (averaged across all labels). Our models are marked with an asterisk (*).
 
 ## Data Access
 
@@ -184,7 +184,7 @@ The directory is organized into south and north surveys, where each survey is sp
 
 AstroCLIP is trained using a two-step process:
 
-1. We pre-train a single-modal galaxy image encoder and a single-modal galaxy spectrum encoder separately. 
+1. We pre-train a single-modal galaxy image encoder and a single-modal galaxy spectrum encoder separately.
 2. We CLIP-align these two encoders on a paired image-spectrum dataset.
 
 ### Single-Modal Pretraining
@@ -196,10 +196,10 @@ Model training can be launched with the following command:
 ```
 image_trainer -c astroclip/astrodino/config.yaml
 ```
-We train the model using 20 A100 GPUs (on 5 nodes) for 250k steps which takes roughly 46 hours. 
+We train the model using 20 A100 GPUs (on 5 nodes) for 250k steps which takes roughly 46 hours.
 
 #### Spectrum Pretraining - Masked Modelling Transformer:
-AstroCLIP uses a 1D Transformer to encode galaxy spectra. Pretraining is performed using a masked-modeling objective, whereby the 1D spectrum is split into contiguous, overlapping patches. 
+AstroCLIP uses a 1D Transformer to encode galaxy spectra. Pretraining is performed using a masked-modeling objective, whereby the 1D spectrum is split into contiguous, overlapping patches.
 
 Model training can be launched with the following command:
 ```
@@ -213,17 +213,17 @@ Once pretrained, we align the image and spectrum encoder using cross-attention p
 ```
 spectrum_trainer fit -c config/astroclip.yaml
 ```
-We train the model using 4 A100 GPUs (on 1 node) for 25k steps or until the validation loss does not increase for a fixed number of steps. This takes roughly 12 hours. 
+We train the model using 4 A100 GPUs (on 1 node) for 25k steps or until the validation loss does not increase for a fixed number of steps. This takes roughly 12 hours.
 
 ## Downstream Tasks
 
 TODO
 
 ## Acknowledgements
-This reposity uses datasets and contrastive augmentations from [Stein, et al. (2022)](https://github.com/georgestein/ssl-legacysurvey/tree/main). The image pretraining is built on top of the [DINOv2](https://github.com/facebookresearch/dinov2/) framework; we also thank Piotr Bojanowski for valuable conversations around image pretraining. 
+This reposity uses datasets and contrastive augmentations from [Stein, et al. (2022)](https://github.com/georgestein/ssl-legacysurvey/tree/main). The image pretraining is built on top of the [DINOv2](https://github.com/facebookresearch/dinov2/) framework; we also thank Piotr Bojanowski for valuable conversations around image pretraining.
 
 ## License
 AstroCLIP code and model weights are released under the MIT license. See [LICENSE](https://github.com/PolymathicAI/AstroCLIP/blob/main/LICENSE) for additional details.
 
 ## Citations
-TODO
+TODO
diff --git a/astroclip/data/crossmatch_scripts/README.md b/astroclip/data/crossmatch_scripts/README.md
@@ -2,12 +2,11 @@
 The following scripts are used to generate the datasets used in the paper:
 
 ```python
-- `cross_match_data.py`: Finds spectra for objects in the Legacy Survey 
+- `cross_match_data.py`: Finds spectra for objects in the Legacy Survey
 data prepared by George Stein (https://github.com/georgestein/ssl-legacysurvey/tree/main)
 
 - `export_data.py`: Exports the combination of images and spectra into
 a single HDF5 file.
 ```
 
 In principle you should not need to run these scripts, as the datasets are already provided by the resulting HuggingFace datasets. However, these scripts are provided for reproducibility purposes.
-
diff --git a/astroclip/data/crossmatch_scripts/cross_match_data.py b/astroclip/data/crossmatch_scripts/cross_match_data.py
@@ -1,70 +1,91 @@
-import numpy as np
+import glob
+
 import h5py
+import numpy as np
+import pandas as pd
 from astropy.table import Table, join, vstack
-from dl import authClient as ac, queryClient as qc
+from dl import authClient as ac
+from dl import queryClient as qc
 from sparcl.client import SparclClient
 from tqdm import tqdm
-import pandas as pd
-import glob
 
-DATA_DIR='/mnt/home/flanusse/ceph'
+DATA_DIR = "/mnt/home/flanusse/ceph"
 
 client = SparclClient()
-inc = ['specid', 'redshift', 'flux', 'ra', 'dec',
-       'wavelength', 'spectype', 'specprimary',
-       'survey', 'program', 'targetid', 'coadd_fiberstatus']
+inc = [
+    "specid",
+    "redshift",
+    "flux",
+    "ra",
+    "dec",
+    "wavelength",
+    "spectype",
+    "specprimary",
+    "survey",
+    "program",
+    "targetid",
+    "coadd_fiberstatus",
+]
 
 
 print("Retrieving all objects in the DESI data release...")
 query = """
 SELECT phot.targetid, phot.brickid, phot.brick_objid, phot.release, zpix.healpix
-FROM desi_edr.photometry AS phot 
+FROM desi_edr.photometry AS phot
 INNER JOIN desi_edr.zpix ON phot.targetid = zpix.targetid
 WHERE (zpix.coadd_fiberstatus = 0 AND zpix.sv_primary)
 """
-cat = qc.query(sql = query, fmt = 'table')
+cat = qc.query(sql=query, fmt="table")
 print("done")
 # Building search key based on brick ids
-cat['key'] = ['%d_%d_%d'%(cat['release'][i], cat['brickid'][i], cat['brick_objid'][i]) for i in range(len(cat))]
+cat["key"] = [
+    "%d_%d_%d" % (cat["release"][i], cat["brickid"][i], cat["brick_objid"][i])
+    for i in range(len(cat))
+]
 
 merged_cat = None
 
 # Looping over the downloaded image files
-for file in tqdm(glob.glob(DATA_DIR+'/*.h5')):
+for file in tqdm(glob.glob(DATA_DIR + "/*.h5")):
     try:
         with h5py.File(file) as d:
-            # search key 
-            d_key = np.array(['%d_%d_%d'%(d['release'][i], d['brickid'][i], d['objid'][i]) for i in range(len(d['brickid']))])
-            t = Table(data=[d['inds'][:], d_key], names=['inds', 'key'])
+            # search key
+            d_key = np.array(
+                [
+                    "%d_%d_%d" % (d["release"][i], d["brickid"][i], d["objid"][i])
+                    for i in range(len(d["brickid"]))
+                ]
+            )
+            t = Table(data=[d["inds"][:], d_key], names=["inds", "key"])
     except:
         continue
-    file_cat = join(cat, t, keys=['key'])
-    file_cat['image_file'] = file
-    file_cat.sort('healpix')
-    
+    file_cat = join(cat, t, keys=["key"])
+    file_cat["image_file"] = file
+    file_cat.sort("healpix")
+
     # Retrieving spectra associated with this file
-    target_ids = [int(i) for i in file_cat['targetid']]
+    target_ids = [int(i) for i in file_cat["targetid"]]
     records = None
-    for i in tqdm(range(len(target_ids)//500 + 1)):
-        start = i*500
-        end = min((i+1)*500, len(target_ids)-1)
+    for i in tqdm(range(len(target_ids) // 500 + 1)):
+        start = i * 500
+        end = min((i + 1) * 500, len(target_ids) - 1)
 
-        res = client.retrieve_by_specid(specid_list = target_ids[start:end],
-                                    include = inc,
-                                    dataset_list = ['DESI-EDR'])
+        res = client.retrieve_by_specid(
+            specid_list=target_ids[start:end], include=inc, dataset_list=["DESI-EDR"]
+        )
         if records is None:
             records = Table.from_pandas(pd.DataFrame.from_records(res.records))
         else:
             r = Table.from_pandas(pd.DataFrame.from_records(res.records))
             records = vstack([records, r])
 
     # Merging catalogs
-    file_cat = join(file_cat, records, keys=['targetid'])
-     
+    file_cat = join(file_cat, records, keys=["targetid"])
+
     if merged_cat is None:
         merged_cat = file_cat
     else:
         merged_cat = vstack([merged_cat, file_cat])
-        
+
     # Saving the results
-    merged_cat.to_pandas().to_parquet('matched_catalog.pq')
+    merged_cat.to_pandas().to_parquet("matched_catalog.pq")
diff --git a/astroclip/data/crossmatch_scripts/export_data.py b/astroclip/data/crossmatch_scripts/export_data.py
@@ -1,36 +1,46 @@
 import h5py
-from astropy.table import Table, join
 import numpy as np
 import pandas as pd
+from astropy.table import Table, join
 from tqdm import tqdm
 
-DATA_DIR='/mnt/home/flanusse/ceph'
+DATA_DIR = "/mnt/home/flanusse/ceph"
 
 # Open matched catalog
-joint_cat = pd.read_parquet(DATA_DIR+'/matched_catalog.pq').drop_duplicates(subset=["key"])
+joint_cat = pd.read_parquet(DATA_DIR + "/matched_catalog.pq").drop_duplicates(
+    subset=["key"]
+)
 
 # Create randomized indices to shuffle the dataset
 rng = np.random.default_rng(seed=42)
 indices = rng.permutation(len(joint_cat))
 joint_cat = joint_cat.iloc[indices]
 
-with h5py.File(DATA_DIR+'/exported_data.h5', 'w') as f:
+with h5py.File(DATA_DIR + "/exported_data.h5", "w") as f:
     for i in range(10):
-        print("Processing file %d"%i)
+        print("Processing file %d" % i)
         # Considering only the objects that are in the current file
-        sub_cat = joint_cat[joint_cat['inds'] // 1000000 == i]
+        sub_cat = joint_cat[joint_cat["inds"] // 1000000 == i]
         images = []
         spectra = []
         redshifts = []
         targetids = []
-        with h5py.File(DATA_DIR+'/images_npix152_0%02d000000_0%02d000000.h5'%(i,i+1)) as d:
+        with h5py.File(
+            DATA_DIR + "/images_npix152_0%02d000000_0%02d000000.h5" % (i, i + 1)
+        ) as d:
             for j in tqdm(range(len(sub_cat))):
-                images.append(np.array(d['images'][sub_cat['inds'].iloc[j] % 1000000]).T.astype('float32'))
-                spectra.append(np.reshape(sub_cat['flux'].iloc[j], [-1, 1]).astype('float32'))
-                redshifts.append(sub_cat['redshift'].iloc[j])
-                targetids.append(sub_cat['targetid'].iloc[j])
+                images.append(
+                    np.array(d["images"][sub_cat["inds"].iloc[j] % 1000000]).T.astype(
+                        "float32"
+                    )
+                )
+                spectra.append(
+                    np.reshape(sub_cat["flux"].iloc[j], [-1, 1]).astype("float32")
+                )
+                redshifts.append(sub_cat["redshift"].iloc[j])
+                targetids.append(sub_cat["targetid"].iloc[j])
         f.create_group(str(i))
-        f[str(i)].create_dataset('images', data=images)
-        f[str(i)].create_dataset('spectra', data=spectra)
-        f[str(i)].create_dataset('redshifts', data=redshifts)
-        f[str(i)].create_dataset('targetids', data=targetids)
+        f[str(i)].create_dataset("images", data=images)
+        f[str(i)].create_dataset("spectra", data=spectra)
+        f[str(i)].create_dataset("redshifts", data=redshifts)
+        f[str(i)].create_dataset("targetids", data=targetids)
diff --git a/downstream_tasks/similarity_search/README.md b/downstream_tasks/similarity_search/README.md
@@ -1,5 +1,5 @@
 ## In-Modal and Cross-Modal Retrieval
-AstroCLIP enables researchers to easily find similar galaxies to a query galaxy by simply exploiting the cosine similarity between galaxy embeddings in embedding space. Because AstroCLIP's embedding space is shared between both galaxy images and optical spectra, retrieval can be performed for both in-modal and cross-modal similarity searches. 
+AstroCLIP enables researchers to easily find similar galaxies to a query galaxy by simply exploiting the cosine similarity between galaxy embeddings in embedding space. Because AstroCLIP's embedding space is shared between both galaxy images and optical spectra, retrieval can be performed for both in-modal and cross-modal similarity searches.
 
 ### Embedding the dataset
 To perform retrieval on the held-out validation set, it is important to first generate AstroCLIP embeddings of the galaxy images and spectra. We provide the already-embedded held-out validation set here:
@@ -12,4 +12,4 @@ python embed_astroclip.py [save_path]
 ```
 
 ### Similarity Search
-Once embedded, the ```similarity_search.ipynb``` jupyter notebook contains a brief tutorial that demonstrates the retrieval abilities of the model.
+Once embedded, the ```similarity_search.ipynb``` jupyter notebook contains a brief tutorial that demonstrates the retrieval abilities of the model.
diff --git a/downstream_tasks/similarity_search/similarity_search.ipynb b/downstream_tasks/similarity_search/similarity_search.ipynb