Skip to content
Discussion options

You must be logged in to vote

The performance of small batch sizes with MoE models is not very good in the CUDA backend. There is a fast gemv implementation, but it only works with bs=1.

  MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=256,n_used=8,b=1,m=2048,n=1,k=7168):                14058 runs -    73.25 us/run - 234.88 MFLOP/run -   3.21 TFLOPS
  MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=256,n_used=8,b=1,m=2048,n=2,k=7168):                 2343 runs -   432.78 us/run - 469.76 MFLOP/run -   1.09 TFLOPS
  MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=256,n_used=8,b=1,m=2048,n=3,k=7168):                 1420 runs -   711.23 us/run - 704.64 MFLOP/run - 990.74 GFLOPS
  MUL_MAT_ID(type_a=q4_0,type_b=f32,n_mats=256,n_used=8,b=1,m=2…

Replies: 3 comments 2 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
1 reply
@tylike
Comment options

Answer selected by tylike
Comment options

You must be logged in to vote
1 reply
@ggerganov
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants