Video-to-video retrieval?

Hi authors!

Thanks for the great work! I saw that is paper is evaluated on all kinds of video-to-text dataset. CLIP model itself works pretty well for image-to-image retrieval, despite that it is trained on image-text pairs. Similarly, I wonder if CLIP4Clip would also work for video-to-video retrieval?