Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MatMul op results mismatch (NPU/Numpy) #145

Open
kballeda opened this issue Dec 8, 2024 · 2 comments
Open

MatMul op results mismatch (NPU/Numpy) #145

kballeda opened this issue Dec 8, 2024 · 2 comments

Comments

@kballeda
Copy link

kballeda commented Dec 8, 2024

I attempted to measure the performance between NPU and Numpy-based dot product computations using float16 type but I found that results mismatch occurs between NPU/CPU numpy. I am using NPU on ARL.

To Reproduce
Steps to reproduce the behavior:

  1. Copy the code below and to run it python <filename.py>
from intel_npu_acceleration_library.backend import MatMul
import numpy as np
import time

inC = 8
outC = 8
batch = 1

X1 = np.random.uniform(-1, 1, (batch, inC)).astype(np.float16)
X2 = np.random.uniform(-1, 1, (outC, inC)).astype(np.float16)

mm = MatMul(inC, outC, batch, profile=False)

start_time = time.perf_counter()
result = mm.run(X1,X2)
end_time = time.perf_counter()
print(f"Intel NPU Acceleration Library Time: {end_time - start_time} * 1000:.6f millisecs")

start_time = time.perf_counter()
np_res = np.dot(X1, X2)
end_time = time.perf_counter()
print(f"Numpy Library Time: {end_time - start_time} * 1000:.6f millisecs")

print("NPU Result: ", result)
print("Numpy Result:", np_res)

Expected behavior
Output mismatch between numpy/NPU occurs

>python matmul.py
Intel NPU Acceleration Library Time: 0.177700 ms
Numpy Library Time: 0.015700 ms
NPU Result:  [[ 1.178  -0.572   1.957  -0.4443 -0.549   0.7744 -0.2756 -0.997 ]]
Numpy Result: [[-0.1885    1.905    -1.779    -0.8945    0.866    -0.9365    1.248
  -0.006233]]

>python matmul.py
Intel NPU Acceleration Library Time: 0.189100 ms
Numpy Library Time: 0.016700 ms
NPU Result:  [[-0.3157  2.506   0.3499  0.631  -0.1031  0.5913 -1.599   1.001 ]]
Numpy Result: [[-1.137    0.09534  0.12366 -1.748    0.981    0.2004  -0.2607  -2.086  ]]

>python matmul.py
Intel NPU Acceleration Library Time: 0.177000 ms
Numpy Library Time: 0.013300 ms
NPU Result:  [[ 0.7153 -1.251   0.2106 -0.409  -0.4336  0.2329  1.653  -1.58  ]]
Numpy Result: [[-1.764   1.005  -0.837  -1.518   0.8794  0.427   0.1887 -1.153 ]]

Desktop (please complete the following information):

  • OS: Win11Enterprise
@kballeda kballeda changed the title MatrixMul sample doesnt work Results mismatch with MatMul op Dec 8, 2024
@kballeda kballeda changed the title Results mismatch with MatMul op MatMul op results mismatch (NPU/Numpy) Dec 8, 2024
@alessandropalla
Copy link
Contributor

Hi, mm.run(X1,X2) is equivalent to np.dot(X1, X2.T), if you use that the math checks out.

Also, external AI accelerators (like GPUs and NPUs) are more effective when offloading large operations, you won't see speedups in multiplying a 8x8 matrix with a vector (as explained very nicely here )

Here an example with a medium size matrix matrix operation that have a significative speedup:

from intel_npu_acceleration_library.backend import MatMul
import numpy as np
import time

inC = 1024
outC = 1024
batch = 256

X1 = np.random.uniform(-1, 1, (batch, inC)).astype(np.float16)
X2 = np.random.uniform(-1, 1, (outC, inC)).astype(np.float16)

mm = MatMul(inC, outC, batch, profile=False)

start_time = time.perf_counter()
result = mm.run(X1,X2)
end_time = time.perf_counter()
print(f"Intel NPU Acceleration Library Time: {(end_time - start_time) * 1000:.6f} ms")

start_time = time.perf_counter()
np_res = np.dot(X1, X2.T)
end_time = time.perf_counter()
print(f"Numpy Library Time: {(end_time - start_time) * 1000:.6f} ms")

print("NPU Result: ", result)
print("Numpy Result:", np_res)

attached some code you can use for reference, it returns the following on an ARL machine

Intel NPU Acceleration Library Time: 0.831700 ms
Numpy Library Time: 2027.385200 ms
NPU Result:  [[ 18.16   -12.62     6.703  ...   8.24     4.71    -5.74  ]
 [  5.82   -15.77   -11.47   ...  -6.914   11.86    -0.8833]
 [  3.604   -7.562    6.15   ...  14.19     8.88    13.63  ]
 ...
 [ -7.438   -8.72     0.1948 ...   5.684   -1.962   -0.7773]
 [  4.727   19.52   -13.34   ...   4.973   -4.89    18.12  ]
 [ -8.625    4.54     7.22   ... -11.734   14.914  -19.64  ]]
Numpy Result: [[ 18.16   -12.62     6.703  ...   8.24     4.71    -5.74  ]
 [  5.82   -15.77   -11.47   ...  -6.914   11.86    -0.8833]
 [  3.604   -7.562    6.15   ...  14.19     8.88    13.63  ]
 ...
 [ -7.438   -8.72     0.1948 ...   5.684   -1.962   -0.7773]
 [  4.727   19.52   -13.34   ...   4.973   -4.89    18.12  ]
 [ -8.625    4.54     7.22   ... -11.734   14.914  -19.64  ]]

Also, consider that first time you compile results are skewed because of first inference latency

@kballeda
Copy link
Author

kballeda commented Dec 9, 2024

Thank you will check this and confirm at my end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants