-
Notifications
You must be signed in to change notification settings - Fork 289
Add D-FINE to KerasHub #2318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add D-FINE to KerasHub #2318
Conversation
The tests fail due to pending reviews on the HGNetV2 dependency. Once it is merged, the D-FINE PR will be open for review, and the tests will pass, as demonstrated in the notebook. |
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces the D-FINE model to KerasHub, including its architecture, layers, tests, and a checkpoint conversion script. The implementation is comprehensive and well-structured. I've provided a few suggestions to improve code clarity, maintainability, and correctness. Overall, this is a solid contribution.
@divyashreepathihalli @mattdangerw D-FINE is ready for its first round of reviews! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Nice work. Just some initial comments.
In general, now that this is up and working let's see if we can find anywhere to cut complexity if we can. Anything we can do to same lines of code (without playing code golf) will probably help keep this maintainable for the future.
|
||
def __init__( | ||
self, | ||
decoder_in_channels, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty massive list of arguments.
First, let's try to whittle down to the stuff we absolutely need. For some of the _scale/_temperature/etc might be fine to leave these as default args on relevant component layers and not expose them if they won't vary in any of our presets (we can always add them later if someone asks!).
Second, we could consider taking in sub-models directly. We did this for SD3, which was facing a similar arg explosion. Maybe not for everything, but could consider this for the encoder/decoder and HGNetV2Backbone. Take a look at how SD3 does this https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/stable_diffusion_3/stable_diffusion_3_backbone.py#L641-L654. Anything that we take in as an argument must be exposed as a public symbol.
But overall, let's try to simplify this arg list as much as we can. We want a practical list of things users might actually change, not an exhaustive list of everything that could possibly change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed!
I've kept in mind the practical list from the user's perspective approach you mentioned above. I'll take you through my thought process here a bit just so you can evaluate my approach, since I'd love to know your thoughts.
As you mentioned, HGNetV2Backbone
can be passed as an argument, which I've implemented based on the SD3 pattern you shared, including the serialization and deserialization paradigm. Having said that, I think the following makes it unnecessary to do the same for the encoder and the decoder:
a. I checked for invariants across the 13 KH presets for D-FINE, and I noticed many low-level arguments that wouldn't need to be varied by the user practically, such as a {}_prob
, {}_scale
, and {}_temperature
args, which I've removed.
b. I've, however, retained a select few arguments, even though I noticed they were invariants across this set of presets, for example, lqe_hidden_dim
, that may want to be configured by the user. For example, the denoising box_noise_scale
and label_noise_ratio
, these are args that are seemingly invariant, but depending on the users' training scenario, they might want to be configured, so I've kept them in.
This way, we don't play code golf, and they don’t have to instantiate the encoder and decoder separately, which helps maintain symmetry. Otherwise, it might feel inconsistent (e.g., why only instantiate the encoder manually?). At the same time, this helps reduce the number of arguments, we’re able to trim down ~30 just by doing this.
I also rechecked for numerics consistency, inference, preset loading, etc., and everything is maintained, so we're good.
I'd love to know your thoughts on the same, thanks!
@@ -0,0 +1,147 @@ | |||
# Metadata for loading pretrained model weights. | |||
backbone_presets = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably easiest to leave this empty and only add once we have the actual kaggle handles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve done it, but may I ask why we can’t have the presets with placeholders for the Kaggle handles just yet? Thanks!
Thanks for the reviews @mattdangerw. Yeah let's definitely cut down the complexity wherever possible for maintainability, I'll look into it! |
@mattdangerw Could you please check if all your comments have been addressed when you have the time, thanks a lot! |
Description of the change
Welcome D-FINE to the KerasHub family of models!
D-FINE, a powerful real-time object detector, sets a new state-of-the-art benchmark for object detection on KerasHub. It achieves outstanding localization precision by redefining the bounding box regression task in DETR models. Additionally, it incorporates lightweight optimizations in computationally intensive modules and operations, striking a better balance between speed and accuracy. Specifically, D-FINE-L/X achieves 54.0%/55.8% AP on the COCO dataset at 124/78 FPS on an NVIDIA T4 GPU. When pretrained on Objects365, D-FINE-L/X attains 57.1%/59.3% AP, surpassing all existing real-time detectors.
Closes the second half and thus, the complete issue #2271
Results in Action of KerasHub's D-FINE
Colab Notebook
D-FINE: Complete Workflow with Predictions and Numerics Matching
Checklist