Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve read timeout failing #9888

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

maximenoel8
Copy link
Contributor

What does this PR change?

Improve the failing the time when we have read:timeout for more than 10 times.
Currently, in the testsuite, when we have the read:timeout, the failing time is taking more than 18 hours.
This error is probably related to the server or network behind slow to response. See: https://github.com/SUSE/spacewalk/issues/26585

This change will show use the current time response of the server and fail fast the features.

GUI diff

No difference.

Before:

After:

  • DONE

Documentation

Test coverage

ℹ️ If a major new functionality is added, it is strongly recommended that tests for the new functionality are added to the Cucumber test suite

  • No tests: add explanation

  • No tests: already covered

  • Unit tests were added

  • Cucumber tests were added

  • DONE

Links

Issue(s): #
Port(s): # add downstream PR(s), if any

  • DONE

Changelogs

Make sure the changelogs entries you are adding are compliant with https://github.com/uyuni-project/uyuni/wiki/Contributing#changelogs and https://github.com/uyuni-project/uyuni/wiki/Contributing#uyuni-projectuyuni-repository

If you don't need a changelog check, please mark this checkbox:

  • No changelog needed

If you uncheck the checkbox after the PR is created, you will need to re-run changelog_test (see below)

Re-run a test

If you need to re-run a test, please mark the related checkbox, it will be unchecked automatically once it has re-run:

  • Re-run test "changelog_test"
  • Re-run test "backend_unittests_pgsql"
  • Re-run test "java_pgsql_tests"
  • Re-run test "schema_migration_test_pgsql"
  • Re-run test "susemanager_unittests"
  • Re-run test "javascript_lint"
  • Re-run test "spacecmd_unittests"

Before you merge

Check How to branch and merge properly!

Copy link
Contributor

github-actions bot commented Mar 4, 2025

👋 Hello! Thanks for contributing to our project.
Acceptance tests will take some time (aprox. 1h), please be patient ☕

You can see the progress at the end of this page and at https://github.com/uyuni-project/uyuni/pull/9888/checks
Once tests finish, if they fail, you can check 👀 the cucumber report. See the link at the output of the action.
You can also check the artifacts section, which contains the logs at https://github.com/uyuni-project/uyuni/pull/9888/checks.

If you are unsure the failing tests are related to your code, you can check the "reference jobs". These are jobs that run on a scheduled time with code from master. If they fail for the same reason as your build, it means the tests or the infrastructure are broken. If they do not fail, but yours do, it means it is related to your code.

Reference tests:

KNOWN ISSUES

Sometimes the build can fail when pulling new jar files from download.opensuse.org . This is a known limitation. Given this happens rarely, when it does, all you need to do is rerun the test. Sorry for the inconvenience.

For more tips on troubleshooting, see the troubleshooting guide.

Happy hacking!
⚠️ You should not merge if acceptance tests fail to pass. ⚠️

@maximenoel8 maximenoel8 force-pushed the improve_read_timeout_failing branch 2 times, most recently from 9bfbfcd to da7df49 Compare March 4, 2025 03:18
@maximenoel8 maximenoel8 force-pushed the improve_read_timeout_failing branch from da7df49 to e45bc8c Compare March 4, 2025 03:19
@maximenoel8 maximenoel8 force-pushed the improve_read_timeout_failing branch from e45bc8c to 135e5a0 Compare March 4, 2025 03:20
@@ -73,6 +74,10 @@
$beta_enabled = (ENV.fetch('BETA_ENABLED', 'False') == 'True')
$api_protocol = ENV.fetch('API_PROTOCOL', nil) if ENV['API_PROTOCOL'] # force the API protocol to be used. You can use 'http' or 'xmlrpc'

# Define a global counter to track consecutive ReadTimeout errors
$timeout_failure_count = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better if we set this counter using the add_context('timeout_failure_counter', 0) and get_context('timeout_failure_counter')

@@ -73,6 +74,10 @@
$beta_enabled = (ENV.fetch('BETA_ENABLED', 'False') == 'True')
$api_protocol = ENV.fetch('API_PROTOCOL', nil) if ENV['API_PROTOCOL'] # force the API protocol to be used. You can use 'http' or 'xmlrpc'

# Define a global counter to track consecutive ReadTimeout errors
$timeout_failure_count = 0
$timeout_threshold = 10 # Set the threshold for stopping the suite
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value does not aim to change during the whole cucumber execution, so I would better define it as a constant than a global variable. Also, if we have it as a global variable, please move it up to the block of lines 45-59

Comment on lines -214 to -222
# Dump feature code coverage into a Redis DB
After do |scenario|
next unless $code_coverage_mode
next unless $feature_path != scenario.location.file

process_code_coverage
$feature_path = scenario.location.file
end

Copy link
Member

@srbarrios srbarrios Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not mix different responsibilities please.
Instead of having all mixed in a unique After method, I think it's better to have it separated in multiple After methods each of them responsible of one thing.

That way, even if one of them fails, others can still pass.

Copy link
Contributor Author

@maximenoel8 maximenoel8 Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having two after do |scenario| seem to create an unstable behavior. I'm not sure having two after do |scenario| and two before do |scenario| is recommended. How do you know which one will be executed before an other. In the case of my fast failing, this duplication is creating a hanging process between two features.

Comment on lines +199 to +211
# Method to check server response time (could be using an HTTP request or ping)
def log_server_response_time
begin
start_time = Time.now
uri = URI.parse("https://#{ENV.fetch('SERVER', nil)}")
Net::HTTP.get_response(uri)
response_time = Time.now - start_time
log "Server response time: #{response_time} seconds"
rescue => e
warn "Error checking server response time: #{e.message}"
end
end

Copy link
Member

@srbarrios srbarrios Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems to fit better on features/support/system_monitoring.rb

Comment on lines +142 to +150
feature_path = scenario.location.file
$feature_filename = feature_path.split(%r{(\.feature|/)})[-2]
next if get_context('user_created') == true

# Create own user based on feature filename. Exclude core, reposync, finishing and build_validation features.
unless feature_path.match?(/core|reposync|finishing|build_validation/)
step %(I create a user with name "#{$feature_filename}" and password "linux")
add_context('user_created', true)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are duplicating these steps in another Before method, please let's remove this from this Before method. We better have separated Before methods each of them having a single responsibility, it gonna be more readable and easier to maintain.

@@ -130,26 +135,80 @@ def capybara_register_driver
# Define the current feature scope
Before do |scenario|
$feature_scope = scenario.location.file.split(%r{(\.feature|/)})[-2]
if $timeout_failure_count >= $timeout_threshold
warn "Timeout failure threshold reached, stopping test suite."
exit(1)
Copy link
Member

@srbarrios srbarrios Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this https://cucumber.io/docs/cucumber/api/ ?

Suggested change
exit(1)
warn 'Timeout failure threshold reached, stopping test suite.'
Cucumber.wants_to_quit = true

Copy link
Contributor Author

@maximenoel8 maximenoel8 Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cucumber.wants_to_quit = true seems to do nothing.

Comment on lines 165 to 181
if scenario.failed?
begin
if web_session_is_active?
handle_screenshot_and_relog(scenario, current_epoch)
else
warn 'There is no active web session; unable to take a screenshot or relog.'
if scenario.exception.is_a?(Net::ReadTimeout)
$timeout_failure_count += 1
warn "Net::ReadTimeout detected! Consecutive ReadTimeouts: #{$timeout_failure_count}"
log_server_response_time
else
begin
if web_session_is_active?
handle_screenshot_and_relog(scenario, current_epoch)
else
warn 'There is no active web session; unable to take a screenshot or relog.'
end
ensure
print_server_logs
end
ensure
print_server_logs
end
end
Copy link
Member

@srbarrios srbarrios Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's a bit more readable like that?
Otherwise, maybe separating it in multiple functions.

After do |scenario|
  current_epoch = Time.new.to_i
  log "This scenario took: #{current_epoch - @scenario_start_time} seconds"

  if $timeout_failure_count >= $timeout_threshold
    warn 'Timeout failure threshold reached, stopping test suite.'
    Cucumber.wants_to_quit = true
  end
  
  if scenario.failed?
    case scenario.exception
    when Net::ReadTimeout
      $timeout_failure_count += 1
      warn "Net::ReadTimeout detected! Consecutive ReadTimeouts: #{$timeout_failure_count}"
      log_server_response_time
    else
      handle_screenshot_and_relog(scenario, current_epoch) if web_session_is_active?

      print_server_logs
    end
  end
end

Copy link
Contributor Author

@maximenoel8 maximenoel8 Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By doing that, you remove the web_session_is_active? from the after step. This check save lot of time if an api call is failing. (eg: sanity_check fail in 10 minutes with it but more than 2 hours without if issues happen during client checks)

Comment on lines +157 to +158
warn "Timeout failure threshold reached, stopping test suite."
exit(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this https://cucumber.io/docs/cucumber/api/ ?

Suggested change
warn "Timeout failure threshold reached, stopping test suite."
exit(1)
warn 'Timeout failure threshold reached, stopping test suite.'
Cucumber.wants_to_quit = true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't work on my side.

After do |scenario|
# Handle timeout failure threshold
if $timeout_failure_count >= $timeout_threshold
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's please don't put anything before logging the time this scenario took, otherwise we might be giving wrong timing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the condition is true, all next features are going to be skipped.

Comment on lines +183 to +188
# Check if we need to dump code coverage into Redis DB
next unless $code_coverage_mode
next unless $feature_path != scenario.location.file

process_code_coverage
$feature_path = scenario.location.file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the code coverage process in a different After method as it was.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was creating a hanging process when running multiple features as in a same run.

@maximenoel8
Copy link
Contributor Author

I tried different solutions to stop the cucumber testsuite and none are working. I tried Cucumber.wants_to_quit = true

And the testsuite is still running.

  Scenario: Bootstrap a y Linux 9 Salt minion             # features/test.feature:20
Timeout failure threshold reached, stopping test suite.
      This scenario ran at: 2025-03-04 20:55:14 +0100
    When I follow the left menu "Systems > Bootstrapping" # features/step_definitions/navigation_steps.rb:369
      Unable to find xpath "//aside/div[@id='nav']/nav/ul/li/div/a[contains(.,'Systems')]/parent::div/parent::li" (Capybara::ElementNotFound)
      ./features/step_definitions/navigation_steps.rb:390:in `block (2 levels) in <top (required)>'
      ./features/step_definitions/navigation_steps.rb:382:in `each'
      ./features/step_definitions/navigation_steps.rb:382:in `each_with_index'
      ./features/step_definitions/navigation_steps.rb:382:in `/^I follow the left menu "([^"]*)"$/'
      features/test.feature:21:in `I follow the left menu "Systems > Bootstrapping"'
    Then I should see a "Bootstrap Minions" text          # features/step_definitions/navigation_steps.rb:651
Reached the timeout failure threshold of 3 consecutive ReadTimeouts. Stopping the test suite.
There is no active web session; unable to take a screenshot or relog.
=> /var/log/rhn/rhn_web_ui.log

=> /var/log/rhn/rhn_web_api.log

      This scenario took: 10 seconds

  Scenario: Log inin user                         # features/test.feature:24
Timeout failure threshold reached, stopping test suite.
      This scenario ran at: 2025-03-04 20:55:45 +0100
    Given I am authorized for the "Admin" section # features/step_definitions/navigation_steps.rb:447
      Net::ReadTimeout with #<TCPSocket:(closed)> (Net::ReadTimeout)
      ./features/step_definitions/navigation_steps.rb:580:in `/^I am authorized as "([^"]*)" with password "([^"]*)"$/'
      ./features/step_definitions/navigation_steps.rb:450:in `/^I am authorized for the "([^"]*)" section$/'
      features/test.feature:25:in `I am authorized for the "Admin" section'
Reached the timeout failure threshold of 3 consecutive ReadTimeouts. Stopping the test suite.
Net::ReadTimeout detected! Consecutive ReadTimeouts: 4

The only solution I found is to exit all scenarios once we reach the limit.

@maximenoel8
Copy link
Contributor Author

About having two before |scenario | and two after |scenario|. I have to regroup them to be sure of the order.
This will require probably more testing but from my limited testing, having those two setup and teardown separate is creating a mess. For the approach of skipping every, they have to be together or have the timeout condition at the beginning of each. I have the feeling having those after and before duplicating is increasing the time in case the testsuite is failing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants