-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inquiry About "auraro" Model Detail #47
Comments
Hey @qqydss! Thank you for your very thorough questions. Just a quick message to let you know that we've seen this. :) We will back to you shortly! |
Great to hear that you've received my questions and will get back to me soon. Looking forward to your response. Thanks! |
Hey @qqydss! Apologies for the delay in getting back to you. We made a big push to get a new version of the paper out on arXiv, which we're pretty thrilled about. Let me answer your questions in order.
I hope this answers all your questions. Please let me know if anything remains unclear. :) |
Introduction:
I have been following the work on Microsoft's weather model "auraro" and have carefully read through the paper and code. I am writing to seek clarification on some details regarding the experimental setup and model architecture. I would greatly appreciate your insights on the following questions:
When using dataset configuration C4 for pretraining, if the inputs come from different data sources, is it required that their corresponding predicted future GroundTruth all come from the ERA5 dataset? In other words, could there be inputs with the same time label but slightly different, corresponding to the same GT? If so, could this be considered a form of data augmentation similar to distorting images in CV classification?
In the "Comparison with AI models at 0.25° resolution" section, figure 4 shows the x-axis as token_num. Could you please explain how this number is calculated?
For dataset labeled as C3, which has only 3 pressure levels in ensemble mode data, when a batch retrieves ensemble mode data, does the corresponding predicted future GD also only have 3 layers? If so, does it use the same weights for latent level query, atmospheric keys & values as shown in figure 6 of the article when input data has 13 pressure levels?
In Figure 4b, is the input for auraro the "HRES Analysis" from HRES_T0 in 2022, and is the groundtruth ERA5?
In the finetune settings of aurara-0.1°, is the GD ERA5?
In figure 3b, is the input for auraro "HRES Analysis" from HRES-T0? As I understand, HRES starts every 12 hours, so there are only two zero lead time fields per day (00/12). Is the evaluation in figure 3b conducted every 12 hours?
In supplement B.7, formula (9), is x the raw data or normalized data? Additionally, I plotted the curve of x_transformed and x and found they are not a monotonic bijective relationship, which might lead to multiple x corresponding to the same x_transformed, causing information loss. Has this factor been considered regarding its impact on model performance?
8.Could you please elaborate on the process of "embedding dependent on the pressure level" in supplement B.7? For example, how does the tensor shape change? Is this operation only for pollution variables or also for U, V, T, Q, Z? Are the embeddings initialized using the weights from a 12-hour pretrained model for U, V, T, Q, Z, while initializing pollution variables from scratch?
9.In D.3-CAMS 0.4° Analysis, how are the learning rates for the backbone and perceiver-decoder set?
In B.7, "Additional static variables" introduce two constant masks for timestamp. However, both the encoder and swin3d_backbone (AdaptiveLayerNorm) use Fourier encoding for timestamp in the code. Why reintroduce a timestamp mask in the input for pollution forecasting?
In model/film.py, AdaptiveLayerNorm initializes self.ln_modulation’s weights and bias to 0, meaning shift and scale are 0 at the start of training, making the backbone almost equivalent to an identical mapping at the beginning. What is the rationale or empirical support behind this unique initialization method?
In pollution forecasting experiments, concatenating static variables(z, slt, lsm) and atmospheric variables together instead of surface variables, what benefits does this bring? Is it performance improvement or computational efficiency?
13.In the fine tune of auraro-0.1°, when the patch size is increased from 4 to 10, is my understanding correct: are 10×10 patches interpolated into 4×4 patches before entering the embedding module, and then during the perceiver decoder stage, these 4×4 patches are interpolated back to 10×10 before unpatchifying to the forecast field pattern? If my understanding is incorrect, could you provide the correct procedure?
Thank you very much for your time and consideration. I am eager to learn from your insights!
The text was updated successfully, but these errors were encountered: