You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried taking a ViT B vision encoder + XLM Roberta text encoder and train it using both CLIP softmax and SigLip sigmoid loss on an in house dataset of 10M image-text pairs at an effective batch size of 9k (with V100 GPUs) and observed that CLIP softmax still performs better than siglip sigmoid loss on nDCG metric.
I was wondering if there is any error in above implementation using p2pop. I also tried using an all_gather to get negative text_features from other gpus but still the behavior seems to be the same
classSigLipLossAllGather(nn.Module):
""" Sigmoid Loss for Language Image Pre-Training (SigLIP) - https://arxiv.org/abs/2303.15343 """def__init__(
self,
logit_scale=np.log(10),
logit_bias=-10
):
super().__init__()
self.logit_scale=logit_scaleself.logit_bias=logit_biasself.labels= {}
defget_ground_truth(self, device, dtype, num_logits, negative_only=False) ->torch.Tensor:
labels=-torch.ones((num_logits, num_logits), device=device, dtype=dtype)
ifnotnegative_only:
labels=2*torch.eye(num_logits, device=device, dtype=dtype) +labelsreturnlabelsdefget_logits(self, image_features, text_features, logit_scale, logit_bias=None):
logits=logit_scale*image_features @ text_features.Tiflogit_biasisnotNone:
logits+=logit_biasreturnlogitsdef_loss(self, image_features, text_features, logit_scale, logit_bias=None, negative_only=False):
logits=self.get_logits(image_features, text_features, logit_scale, logit_bias)
labels=self.get_ground_truth(
image_features.device,
image_features.dtype,
image_features.shape[0],
negative_only=negative_only,
)
loss=-F.logsigmoid(labels*logits).sum() /image_features.shape[0]
returnlossdefforward(self, image_features, text_features):
loss=self._loss(image_features, text_features, self.logit_scale, self.logit_bias)
ifglobal_manager.world_size>1:
# Gather text features from all rankstext_features_dict= {'text_features': text_features, 'global_rank': global_manager.rank}
all_text_features_dict=all_gather_objects(text_features_dict)
# Compute loss against negative text features from all other ranksforiinrange(len(all_text_features_dict)):
neigh_rank=all_text_features_dict[i]['global_rank']
neigh_text_features=all_text_features_dict[i]['text_features']
ifneigh_rank!=global_manager.rank:
loss+=self._loss(image_features, neigh_text_features, self.logit_scale, self.logit_bias, negative_only=True)
return {"loss": loss}
The implementation in this repo seems to be the non-chunked version
Thank you for the great work on siglip paper. In particular, the gains in metrics at small batch sizes are impressive.
Sigmoid Loss for Language Image Pre-Training
I tried using this pytorch based distributed chunked implementation in open clip repo using torch.distributed.P2POp
https://github.com/mlfoundations/open_clip/blob/fc5a37b72d705f760ebbc7915b84729816ed471f/src/open_clip/loss.py#L307
I tried taking a ViT B vision encoder + XLM Roberta text encoder and train it using both CLIP softmax and SigLip sigmoid loss on an in house dataset of 10M image-text pairs at an effective batch size of 9k (with V100 GPUs) and observed that CLIP softmax still performs better than siglip sigmoid loss on nDCG metric.
I was wondering if there is any error in above implementation using p2pop. I also tried using an all_gather to get negative text_features from other gpus but still the behavior seems to be the same
The implementation in this repo seems to be the non-chunked version
big_vision/big_vision/trainers/proj/image_text/siglip.py
Lines 287 to 308 in 46b2456
I am wondering if you can add a chunked implementation of the siglip sigmoig loss.
The text was updated successfully, but these errors were encountered: