-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve read timeout failing #9888
base: master
Are you sure you want to change the base?
Improve read timeout failing #9888
Conversation
👋 Hello! Thanks for contributing to our project. You can see the progress at the end of this page and at https://github.com/uyuni-project/uyuni/pull/9888/checks If you are unsure the failing tests are related to your code, you can check the "reference jobs". These are jobs that run on a scheduled time with code from master. If they fail for the same reason as your build, it means the tests or the infrastructure are broken. If they do not fail, but yours do, it means it is related to your code. Reference tests: KNOWN ISSUES Sometimes the build can fail when pulling new jar files from download.opensuse.org . This is a known limitation. Given this happens rarely, when it does, all you need to do is rerun the test. Sorry for the inconvenience. For more tips on troubleshooting, see the troubleshooting guide. Happy hacking! |
9bfbfcd
to
da7df49
Compare
da7df49
to
e45bc8c
Compare
e45bc8c
to
135e5a0
Compare
@@ -73,6 +74,10 @@ | |||
$beta_enabled = (ENV.fetch('BETA_ENABLED', 'False') == 'True') | |||
$api_protocol = ENV.fetch('API_PROTOCOL', nil) if ENV['API_PROTOCOL'] # force the API protocol to be used. You can use 'http' or 'xmlrpc' | |||
|
|||
# Define a global counter to track consecutive ReadTimeout errors | |||
$timeout_failure_count = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better if we set this counter using the add_context('timeout_failure_counter', 0)
and get_context('timeout_failure_counter')
@@ -73,6 +74,10 @@ | |||
$beta_enabled = (ENV.fetch('BETA_ENABLED', 'False') == 'True') | |||
$api_protocol = ENV.fetch('API_PROTOCOL', nil) if ENV['API_PROTOCOL'] # force the API protocol to be used. You can use 'http' or 'xmlrpc' | |||
|
|||
# Define a global counter to track consecutive ReadTimeout errors | |||
$timeout_failure_count = 0 | |||
$timeout_threshold = 10 # Set the threshold for stopping the suite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value does not aim to change during the whole cucumber execution, so I would better define it as a constant than a global variable. Also, if we have it as a global variable, please move it up to the block of lines 45-59
# Dump feature code coverage into a Redis DB | ||
After do |scenario| | ||
next unless $code_coverage_mode | ||
next unless $feature_path != scenario.location.file | ||
|
||
process_code_coverage | ||
$feature_path = scenario.location.file | ||
end | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not mix different responsibilities please.
Instead of having all mixed in a unique After method, I think it's better to have it separated in multiple After methods each of them responsible of one thing.
That way, even if one of them fails, others can still pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having two after do |scenario|
seem to create an unstable behavior. I'm not sure having two after do |scenario|
and two before do |scenario|
is recommended. How do you know which one will be executed before an other. In the case of my fast failing, this duplication is creating a hanging process between two features.
# Method to check server response time (could be using an HTTP request or ping) | ||
def log_server_response_time | ||
begin | ||
start_time = Time.now | ||
uri = URI.parse("https://#{ENV.fetch('SERVER', nil)}") | ||
Net::HTTP.get_response(uri) | ||
response_time = Time.now - start_time | ||
log "Server response time: #{response_time} seconds" | ||
rescue => e | ||
warn "Error checking server response time: #{e.message}" | ||
end | ||
end | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function seems to fit better on features/support/system_monitoring.rb
feature_path = scenario.location.file | ||
$feature_filename = feature_path.split(%r{(\.feature|/)})[-2] | ||
next if get_context('user_created') == true | ||
|
||
# Create own user based on feature filename. Exclude core, reposync, finishing and build_validation features. | ||
unless feature_path.match?(/core|reposync|finishing|build_validation/) | ||
step %(I create a user with name "#{$feature_filename}" and password "linux") | ||
add_context('user_created', true) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are duplicating these steps in another Before method, please let's remove this from this Before method. We better have separated Before methods each of them having a single responsibility, it gonna be more readable and easier to maintain.
@@ -130,26 +135,80 @@ def capybara_register_driver | |||
# Define the current feature scope | |||
Before do |scenario| | |||
$feature_scope = scenario.location.file.split(%r{(\.feature|/)})[-2] | |||
if $timeout_failure_count >= $timeout_threshold | |||
warn "Timeout failure threshold reached, stopping test suite." | |||
exit(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about this https://cucumber.io/docs/cucumber/api/ ?
exit(1) | |
warn 'Timeout failure threshold reached, stopping test suite.' | |
Cucumber.wants_to_quit = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cucumber.wants_to_quit = true
seems to do nothing.
if scenario.failed? | ||
begin | ||
if web_session_is_active? | ||
handle_screenshot_and_relog(scenario, current_epoch) | ||
else | ||
warn 'There is no active web session; unable to take a screenshot or relog.' | ||
if scenario.exception.is_a?(Net::ReadTimeout) | ||
$timeout_failure_count += 1 | ||
warn "Net::ReadTimeout detected! Consecutive ReadTimeouts: #{$timeout_failure_count}" | ||
log_server_response_time | ||
else | ||
begin | ||
if web_session_is_active? | ||
handle_screenshot_and_relog(scenario, current_epoch) | ||
else | ||
warn 'There is no active web session; unable to take a screenshot or relog.' | ||
end | ||
ensure | ||
print_server_logs | ||
end | ||
ensure | ||
print_server_logs | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's a bit more readable like that?
Otherwise, maybe separating it in multiple functions.
After do |scenario|
current_epoch = Time.new.to_i
log "This scenario took: #{current_epoch - @scenario_start_time} seconds"
if $timeout_failure_count >= $timeout_threshold
warn 'Timeout failure threshold reached, stopping test suite.'
Cucumber.wants_to_quit = true
end
if scenario.failed?
case scenario.exception
when Net::ReadTimeout
$timeout_failure_count += 1
warn "Net::ReadTimeout detected! Consecutive ReadTimeouts: #{$timeout_failure_count}"
log_server_response_time
else
handle_screenshot_and_relog(scenario, current_epoch) if web_session_is_active?
print_server_logs
end
end
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By doing that, you remove the web_session_is_active?
from the after step. This check save lot of time if an api call is failing. (eg: sanity_check fail in 10 minutes with it but more than 2 hours without if issues happen during client checks)
warn "Timeout failure threshold reached, stopping test suite." | ||
exit(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about this https://cucumber.io/docs/cucumber/api/ ?
warn "Timeout failure threshold reached, stopping test suite." | |
exit(1) | |
warn 'Timeout failure threshold reached, stopping test suite.' | |
Cucumber.wants_to_quit = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't work on my side.
After do |scenario| | ||
# Handle timeout failure threshold | ||
if $timeout_failure_count >= $timeout_threshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's please don't put anything before logging the time this scenario took, otherwise we might be giving wrong timing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the condition is true, all next features are going to be skipped.
# Check if we need to dump code coverage into Redis DB | ||
next unless $code_coverage_mode | ||
next unless $feature_path != scenario.location.file | ||
|
||
process_code_coverage | ||
$feature_path = scenario.location.file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the code coverage process in a different After method as it was.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code was creating a hanging process when running multiple features as in a same run.
I tried different solutions to stop the cucumber testsuite and none are working. I tried And the testsuite is still running.
The only solution I found is to exit all scenarios once we reach the limit. |
About having two before |scenario | and two after |scenario|. I have to regroup them to be sure of the order. |
What does this PR change?
Improve the failing the time when we have read:timeout for more than 10 times.
Currently, in the testsuite, when we have the read:timeout, the failing time is taking more than 18 hours.
This error is probably related to the server or network behind slow to response. See: https://github.com/SUSE/spacewalk/issues/26585
This change will show use the current time response of the server and fail fast the features.
GUI diff
No difference.
Before:
After:
Documentation
No documentation needed: add explanation. This can't be used if there is a GUI diff
No documentation needed: only internal and user invisible changes
Documentation issue was created: Link for SUSE Multi-Linux Manager contributors, Link for community contributors.
API documentation added: please review the Wiki page Writing Documentation for the API if you have any changes to API documentation.
(OPTIONAL) Documentation PR
DONE
Test coverage
ℹ️ If a major new functionality is added, it is strongly recommended that tests for the new functionality are added to the Cucumber test suite
No tests: add explanation
No tests: already covered
Unit tests were added
Cucumber tests were added
DONE
Links
Issue(s): #
Port(s): # add downstream PR(s), if any
Changelogs
Make sure the changelogs entries you are adding are compliant with https://github.com/uyuni-project/uyuni/wiki/Contributing#changelogs and https://github.com/uyuni-project/uyuni/wiki/Contributing#uyuni-projectuyuni-repository
If you don't need a changelog check, please mark this checkbox:
If you uncheck the checkbox after the PR is created, you will need to re-run
changelog_test
(see below)Re-run a test
If you need to re-run a test, please mark the related checkbox, it will be unchecked automatically once it has re-run:
Before you merge
Check How to branch and merge properly!