Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect "slow" response message in Cache Configuration test #33190

Closed
H0llyW00dzZ opened this issue Jan 10, 2025 · 14 comments · Fixed by #33220
Closed

Incorrect "slow" response message in Cache Configuration test #33190

H0llyW00dzZ opened this issue Jan 10, 2025 · 14 comments · Fixed by #33220
Labels

Comments

@H0llyW00dzZ
Copy link

H0llyW00dzZ commented Jan 10, 2025

Description

When testing the Cache Configuration with Redis in the admin panel, the cache test keeps displaying the message: Cache test successful, but response is slow: 6.725281ms.

However, a response time of around 6ms should not be considered slow. The message seems to be misleading and may cause confusion for administrators.

Expected behavior:
The cache test should provide an accurate assessment of the response time and not label a response time of around 6ms as slow

Gitea Version

v1.23.0

Can you reproduce the bug on the Gitea demo site?

No

Log Gist

No response

Screenshots

Screenshot from 2025-01-10 09-34-01

Git Version

2.47.1

Operating System

Linux

How are you running Gitea?

Running on Kubernetes

Database

SQLite

@wxiaoguang
Copy link
Contributor

Related: Add cache test for admins #31265 @6543

You can safely ignore that message, and I still think that test should be removed from admin panel.

@H0llyW00dzZ
Copy link
Author

H0llyW00dzZ commented Jan 10, 2025

Related: Add cache test for admins #31265 @6543

You can safely ignore that message, and I still think that test should be removed from admin panel.

@wxiaoguang It seems that the implementation in PR #31265 is incorrect. The test measures performance in microseconds (μs), which is a local measurement. However, in real-world scenarios, it should start from milliseconds (ms). Basically, any response time under 30ms, even if it extends to microseconds, can still be considered fast or low latency.

@TheFox0x7
Copy link
Contributor

I can fix this, but the question remains how:

  1. Remove the test entirely per PR comments (Gitea won't start if the cache doesn't work so testing if it works is pointless)
  2. Replace it with pure information about the timing (so no warning).
  3. Change time to pass to 100ms so the test won't fail

Any other suggestions or ideas? I'd add a prometheus histogram there but then there's no way to get that information by default without enabling metrics and scarping (which would be an issue if switched to otel). And self querying the endpoint feels weird.

@lunny
Copy link
Member

lunny commented Jan 10, 2025

I prefer the solution 3.

@H0llyW00dzZ
Copy link
Author

I can fix this, but the question remains how:

  1. Remove the test entirely per PR comments (Gitea won't start if the cache doesn't work so testing if it works is pointless)
  2. Replace it with pure information about the timing (so no warning).
  3. Change time to pass to 100ms so the test won't fail

Any other suggestions or ideas? I'd add a prometheus histogram there but then there's no way to get that information by default without enabling metrics and scarping (which would be an issue if switched to otel). And self querying the endpoint feels weird.

@TheFox0x7
Nah, don't remove it. It's pretty useful for testing.

Another idea is to change the current implementation by adjusting the ratio. For example, if the latency is under 30ms, it can still be considered fast or low latency. However, if the latency goes around 50ms to 100ms, then it can be considered slow or high latency. This is because under 30ms is considered an average stable network latency.

@wxiaoguang
Copy link
Contributor

Removing it should be fine because:

  • Gitea instance already does cache check during startup, it doesn't make sense to "test" it again and again
  • For the latency, it should be handled by something like "tracing" and "profiling", but not just test once and report a delay.

@H0llyW00dzZ
Copy link
Author

Removing it should be fine because:

  • Gitea instance already does cache check during startup, it doesn't make sense to "test" it again and again
  • For the latency, it should be handled by something like "tracing" and "profiling", but not just test once and report a delay.

Cache check during startup is different from cache testing repeatedly. When you press test again and again, the results will always vary, sometimes ranging from low to high latency.

For example:

image
image
image

This test provides a real-time check of the latency, which can be useful for monitoring the system's performance.

@wxiaoguang
Copy link
Contributor

This test provides a real-time check of the latency, which can be useful for monitoring the system's performance.

But:

For the latency, it should be handled by something like "tracing" and "profiling", but not just test once and report a delay.

@H0llyW00dzZ
Copy link
Author

This test provides a real-time check of the latency, which can be useful for monitoring the system's performance.

But:

For the latency, it should be handled by something like "tracing" and "profiling", but not just test once and report a delay.

No, you don't really need something like tracing and profiling in this case.

In PR #31265, the code is measuring latency, not just reporting a single delay. The Test() function performs a series of cache operations and measures the total time taken to complete them:

const (
	testCacheKey       = "DefaultCache.TestKey"
	SlowCacheThreshold = 100 * time.Microsecond
)

func Test() (time.Duration, error) {
	if defaultCache == nil {
		return 0, fmt.Errorf("default cache not initialized")
	}

	testData := fmt.Sprintf("%x", make([]byte, 500))

	start := time.Now()

	if err := defaultCache.Delete(testCacheKey); err != nil {
		return 0, fmt.Errorf("expect cache to delete data based on key if exist but got: %w", err)
	}
	if err := defaultCache.Put(testCacheKey, testData, 10); err != nil {
		return 0, fmt.Errorf("expect cache to store data but got: %w", err)
	}
	testVal, hit := defaultCache.Get(testCacheKey)
	if !hit {
		return 0, fmt.Errorf("expect cache hit but got none")
	}
	if testVal != testData {
		return 0, fmt.Errorf("expect cache to return same value as stored but got other")
	}

	return time.Since(start), nil
}

@wxiaoguang
Copy link
Contributor

No, you don't really need something like tracing and profiling in this case.

In PR #31265, the code is measuring latency, not just reporting a single delay. The Test() function performs a series of cache operations and measures the total time taken to complete them:

Does it make sense? This time, you do a test, it shows that "the cache is fast", then some time later, there is some cache performance problem, then users report that "the response is slow", then you go to admin panel click "test", at that time the cache recovers, you still see "the cache is fast". Does it resolve any real world problem?

@H0llyW00dzZ
Copy link
Author

No, you don't really need something like tracing and profiling in this case.
In PR #31265, the code is measuring latency, not just reporting a single delay. The Test() function performs a series of cache operations and measures the total time taken to complete them:

Does it make sense? This time, you do a test, it shows that "the cache is fast", then some time later, there is some cache performance problem, then users report that "the response is slow", then you go to admin panel click "test", at that time the cache recovers, you still see "the cache is fast". Does it resolve any real world problem?

Of course, if the response times are accurate. However, the message Cache test successful, but response is slow: 6.725281ms can be misleading.

A latency of 6ms is generally considered fast or low latency, and it's well within acceptable performance standards. Labeling a response time under 30ms as "slow" might not be appropriate in most cases, as it can create confusion

@H0llyW00dzZ
Copy link
Author

So the PR #31265 only needs to improve this part:

elapsed, err := cache.Test()
if err != nil {
    ctx.Data["CacheError"] = err
} else if elapsed > cache.SlowCacheThreshold {
    ctx.Data["CacheSlow"] = fmt.Sprint(elapsed)
}

Change the SlowCacheThreshold to:

const SlowCacheThreshold = 30 * time.Millisecond

Then it might look like this:

elapsed, err := cache.Test()
if err != nil {
    ctx.Data["CacheError"] = err
} else if elapsed > cache.SlowCacheThreshold {
    ctx.Data["CacheSlow"] = fmt.Sprintf("Cache response is slow: %v", elapsed)
} else {
    ctx.Data["CacheFast"] = fmt.Sprintf("Cache response is fast: %v", elapsed)
}

@TheFox0x7
Copy link
Contributor

Does it make sense? This time, you do a test, it shows that "the cache is fast", then some time later, there is some cache performance problem, then users report that "the response is slow", then you go to admin panel click "test", at that time the cache recovers, you still see "the cache is fast". Does it resolve any real world problem

Of course, if the response times are accurate

But they aren't really. You measure a set of 3 operations in a one off time measurement and that's prone to outliers. To have it accurate, you'd need to measure the operations for a bit (say 10-100 times) and take an average from them, at which point you're just re-implementing metrics but without all the details and benefits of it.

@H0llyW00dzZ Maybe you can explain your usecase better? Right now the longer I look at the feature the more I agree with @wxiaoguang that it's kind of pointless, so a real user scenario would show the usefulness of it better. In particular:

  • Do you believe the test is accurate when you use it?
  • How do you use it? once or refreshing the page/clicking the button multiple times?
  • Why wouldn't histogram in metrics endpoint cover this instead?

@H0llyW00dzZ
Copy link
Author

Does it make sense? This time, you do a test, it shows that "the cache is fast", then some time later, there is some cache performance problem, then users report that "the response is slow", then you go to admin panel click "test", at that time the cache recovers, you still see "the cache is fast". Does it resolve any real world problem

Of course, if the response times are accurate

But they aren't really. You measure a set of 3 operations in a one off time measurement and that's prone to outliers. To have it accurate, you'd need to measure the operations for a bit (say 10-100 times) and take an average from them, at which point you're just re-implementing metrics but without all the details and benefits of it.

@H0llyW00dzZ Maybe you can explain your usecase better? Right now the longer I look at the feature the more I agree with @wxiaoguang that it's kind of pointless, so a real user scenario would show the usefulness of it better. In particular:

  • Do you believe the test is accurate when you use it?
  • How do you use it? once or refreshing the page/clicking the button multiple times?
  • Why wouldn't histogram in metrics endpoint cover this instead?
  • Do you believe the test is accurate when you use it? Yes, because the code in the Test() function is correct for checking the latency of cache operations.
func Test() (time.Duration, error) {
	if defaultCache == nil {
		return 0, fmt.Errorf("default cache not initialized")
	}

	testData := fmt.Sprintf("%x", make([]byte, 500))

	start := time.Now()

	if err := defaultCache.Delete(testCacheKey); err != nil {
		return 0, fmt.Errorf("expect cache to delete data based on key if exist but got: %w", err)
	}
	if err := defaultCache.Put(testCacheKey, testData, 10); err != nil {
		return 0, fmt.Errorf("expect cache to store data but got: %w", err)
	}
	testVal, hit := defaultCache.Get(testCacheKey)
	if !hit {
		return 0, fmt.Errorf("expect cache hit but got none")
	}
	if testVal != testData {
		return 0, fmt.Errorf("expect cache to return same value as stored but got other")
	}

	return time.Since(start), nil
}

The function performs a series of cache operations (delete, put, and get) and measures the total time taken to complete them, providing an accurate measurement of the cache latency.

  • How do you use it? I only use it when there are many crawler viewing the source code.

  • Why wouldn't histogram in metrics endpoint cover this instead? do you mean the metrics endpoint from gitea configuration ?

@lunny lunny closed this as completed in 8c6d707 Jan 11, 2025
GiteaBot pushed a commit to GiteaBot/gitea that referenced this issue Jan 11, 2025
Change SlowCacheThreshold to 30 milliseconds so it doesn't trigger on
non memory cache

Closes: go-gitea#33190
Closes: go-gitea#32657
wxiaoguang pushed a commit that referenced this issue Jan 12, 2025
Backport #33220 by TheFox0x7

Change SlowCacheThreshold to 30 milliseconds so it doesn't trigger on
non memory cache

Closes: #33190
Closes: #32657

Co-authored-by: TheFox0x7 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants