Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement WDA restart for unexpected crashes #2531

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

abhinvv1
Copy link

@abhinvv1 abhinvv1 commented Mar 10, 2025

  • With this PR we want to add the functionality of restarting WDA in case of crashes especially socket hangups while maintaining the same appium session, which will eventually prevent the user session from breaking
  • While in some cases the command under execution would break, but this can be handled client side and the following commands will not break
  • Planning to put this behind a capability, would request sharing initial thoughts on the intent and the approach

@abhinvv1 abhinvv1 changed the title feat: Implement WDA restarts feat: Implement WDA restart for unexpected crashes Mar 10, 2025
@KazuCocoa
Copy link
Member

KazuCocoa commented Mar 11, 2025

I assume you'll add more changes, so I haven't left any code review right now. (Btw, it would be appreciated to keep Draft PR state before your implementation is ready for review)

The approach looks good to me, although we should carefully review the recoverable conditions. e.g. initialDeeplinkUrl also probably should be considered. Perhaps settings API related materials as well. Documents in the capabilities table might be too small to describe this behavior, so we also need to put this somewhere, like troubleshooting? or a more proper place.

I was also wondering if adding recovery logic in https://github.com/appium/WebDriverAgent/blob/f2de1a6e4baed7cce9a28a08d9b0998383c19e65/lib/webdriveragent.js#L381 would be appropriate in some part so that the xcuitest driver layer doesn't need to take care so much. Some existing code in https://github.com/appium/WebDriverAgent/blob/f2de1a6e4baed7cce9a28a08d9b0998383c19e65/lib/webdriveragent.js such as https://github.com/appium/WebDriverAgent/blob/f2de1a6e4baed7cce9a28a08d9b0998383c19e65/lib/webdriveragent.js#L183-L185 might simplify some code in this pr change also. (Anyway, I haven't read this pr code details yet so this is just a comment)

@abhinvv1 abhinvv1 marked this pull request as draft March 11, 2025 05:27
@abhinvv1 abhinvv1 marked this pull request as ready for review March 11, 2025 20:23
@abhinvv1
Copy link
Author

abhinvv1 commented Mar 11, 2025

@KazuCocoa requesting review.

In the case of large-size DOM applications, we have observed that WDA is terminated due to excessive memory usage (exceeding 2-3GB) which is too much for a background process. This results in os killing WDA and socket hangups along with session disruptions for users. However, we have found that restarting WDA significantly reduces memory consumption as that memory is released.
With this pull request, our objective is to restart WDA following a socket hangup, while ensuring that the session remains intact.

We would appreciate your feedback on the intended approach and strategy, meanwhile working on Appium changes to introduce a new capability enableWdaRestart.

@@ -902,7 +906,7 @@ export class XCUITestDriver extends BaseDriver {
// In case the bundle id process start got failed because of
// auth popup in the device. Then, the bundle id process itself started. It is safe to stop it here.
await this.mobileKillApp(this.wda.bundleIdForXctest);
} catch {};
} catch { };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you keep spaces/indentations in current master?

* @param {Error|null|undefined} err - The error to check
* @returns {boolean} - True if the error is a WDA connection error
*/
isWdaConnectionError(err) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be replaced with this.wda.isRunning()?

Copy link
Author

@abhinvv1 abhinvv1 Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can, but we'll won't be able to check whether the error was specifically socket hang up or ECONNREFUSED, does this.wda.isRunning() ensures it's a socket hangup for sure?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code is https://github.com/appium/WebDriverAgent/blob/f2de1a6e4baed7cce9a28a08d9b0998383c19e65/lib/webdriveragent.js#L183 So, basically the this.wda.isRunning() returns false when /status endpoint gets an error. The endpoint is usually used as a WDA liveness check in XCUITest driver.
One case I could think of right now is when WDA process ends by a device disconnection in the middle. Then, it might return false while WDA process on the device is running, but the error is the same as ECONNREFUSED like you do right now (afaik)

* @returns {Promise<object>} - The WDA status
* @throws {Error} - If WDA does not become reachable within the timeout
*/
async reconnectWda() {
Copy link
Member

@KazuCocoa KazuCocoa Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this kind of method should be in https://github.com/appium/WebDriverAgent/blob/f2de1a6e4baed7cce9a28a08d9b0998383c19e65/lib/webdriveragent.js? Then, as this xcuitest driver, this kind of reconnect can be done with this.wda.reconnect().

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think, if you check the content of this method, we are more of just pinging wda and putting sleep if not running, which is more suitable on driver side, because driver should manage this logic of waiting time

* @returns {Promise<void>}
*/
async delay(ms) {
return new Promise((resolve) => setTimeout(resolve, ms));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

await B.delay(ms); instead?

}
await this.startWda();
} else {
this.log.info('WDA process is still running');
Copy link
Member

@KazuCocoa KazuCocoa Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What case did you expect for this flow?

isRunning checks /status endpoint call in WDA. So isRunning === false means WDA returns socket error etc already. I guess in this case, WDA is properly running but you'd like to call the this.startWdaSession below? Then, what kind of situation is expected? (just a question)

https://github.com/appium/WebDriverAgent/blob/f2de1a6e4baed7cce9a28a08d9b0998383c19e65/lib/webdriveragent.js#L184

this.log.info('WDA process is still running');
}

await this.startWdaSession(this.opts.bundleId, this.opts.processArguments);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this.startWda(); was called in the above line in if (!isRunning) {, this this.startWdaSession occurs twice because this.startWda(); also does the same steps. Is it expected?

* @private
* @returns {Promise<void>}
*/
async restoreContext() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to this context restore, should this reconnection idea needs to restore settings? https://appium.github.io/appium-xcuitest-driver/8.4/reference/settings/

https://github.com/appium/WebDriverAgent/blob/f2de1a6e4baed7cce9a28a08d9b0998383c19e65/lib/types.ts#L2-L29 these settings items might be lost after this reinitialization.

const maxAttempts = this.opts.wdaStartupRetries || 2;
let attempts = 0;

while (true) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
while (true) {
while (attempts < maxAttempts) {

(no much meanings, but just to avoid while (true))

// Verify session is still valid
// This is important because in case the WDA was not killed and it was only an intermittent connection issue,
// and the session is still valid then this would prevent us restarting it unnecesarilly
await this.proxyCommand(`/session/${this.sessionId}/url`, 'GET');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This endpoint call could be expensive in WebView context. Getting a timeout endpoint call could be lightweight.

Comment on lines +1276 to +1277
(err.message.includes('invalid session id') ||
err.message.includes('Session does not exist')),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These messages are managed in https://github.com/appium/WebDriverAgent (WDA code).

To keep error message management in a closed place, probably this kind of error check should be in https://github.com/appium/WebDriverAgent/tree/master/lib. So, as xcuitest driver, https://github.com/appium/appium-xcuitest-driver/pull/2531/files#diff-ed3c013edbc896bd87ef62539626fcdc8930a31f3cee7ecb37438e172af2c8e0R1291-R1303 these lines can be in the WDA JS side and XCTest driver could call it as this.wda.hasCrashedOrSessionInvalidError(). If true, xcuitest driver can do the recovery steps. This is similar to this.wda.isRunning()

@@ -132,6 +135,9 @@ export default {
if (err.message && err.message.match(/Condition unmet/)) {
// condition was not met setting res to empty array
els = [];
}
else if (err.message && err.message.match(/socket hang up/)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how what difference this condition makes here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the error message in case of socket hangups for find element commands was getting overriden by NoSuchElementError which makes it undetectable in driver.js file, hence we are trying to bubble it up as socket hang up only to be able to identify on driver.js level and trigger the restart flow

@@ -122,7 +122,10 @@ export default {
els = /** @type {Element[]|undefined} */ (
await this.proxyCommand(endpoint, method, body)
);
} catch {
} catch (err) {
if (err.message && err.message.match(/socket hang up/)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the regexp should be moved to a global constant

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do something like SOCKET_HANG_UP_PATTERN.test(err.message) instead

}
},
enableWdaRestart: {
presence: false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the default setting, so not needed

@@ -383,7 +383,11 @@ const desiredCapConstraints = /** @type {const} */ ({
pageLoadStrategy: {
isString: true,
inclusionCaseInsensitive: ['none', 'eager', 'normal']
}
},
enableWdaRestart: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the capability name is too broad. I would rather use something like restartWdaOnCrash

return false;
}
const errorMessage = err.message || '';
const CONNECTION_ERRORS = ['ECONNREFUSED', 'socket hang up'];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a module constant


this.log.info(`Attempting to reconnect to WDA with ${timeout}ms timeout`);

while (Date.now() - startTime < timeout) {
Copy link
Contributor

@mykola-mokhnach mykola-mokhnach Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use primitives from the asyncbox module for this purpose, e.g. waitForCondition

* @param {string} cmd
* @param {...any} args
* @returns {Promise<any>}
* Checks if an error is related to WDA connection issues
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these all new method overcomplicate the driver class. consider extracting them to a separate module and make to mixins, similar to other commands

@mykola-mokhnach
Copy link
Contributor

mykola-mokhnach commented Mar 12, 2025

I am not totally convinced by this idea. Here is why:

  • All client commands are going to fail while WDA is being restarted. Users don't know when it might happen and how long it is going to take, thus I don't see any reliable path to recover on the client side
  • I would like to know what's the reason for unexpected WDA crashes. Maybe these could be resolved within WDA, so we don't need to add such hacks to the driver
  • WebDriverAgent keeps the internal session state in memory. Currently there is no way to restore that state, which means the restored session won't be the same as the original one thus causing further potential issues. One example of such state is native video recording, another example is the state of the application under test.

Copy link
Contributor

@mykola-mokhnach mykola-mokhnach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment above

@abhinvv1
Copy link
Author

abhinvv1 commented Mar 12, 2025

@mykola-mokhnach

  1. While the WDA is being restarted we are holding the current appium command response and post restarting we are retrying the same command again and then sending the final response, hence only the command under execution might break if that command depends on the previous state of WDA somehow, which can be retried client side, and overall session execution won't break
  2. And because we are not sending any response for the current command that caused the WDA crash, the next command won't arrive, the client will still wait for the previous response.
  3. The time to restart WDA can be configured client-side depending on how much they want to wait if they expect socket hang-ups more often, upon testing this time is around <2mins to restart WDA successfully
  4. In all the cases I've tested only the command under execution gets an impact if it depends on the previous WDA state something like find_element/<id> and that id is not more present in WDA
  5. We have tried solving the increase in memory consumption on the WDA side but haven't been able to do so, it appears that with large-size DOM apps, we cannot control this memory increase on WDA due to leaks, hence restarting upon crash only seems a possible approach
  6. Can you please help me understand what state of the application under test is stored in WDA which might cause potential issues even if the command is retried from scratch?

@mykola-mokhnach
Copy link
Contributor

@mykola-mokhnach

  1. While the WDA is being restarted we are holding the current appium command response and post restarting we are retrying the same command again and then sending the final response, hence only the command under execution might break if that command depends on the previous state of WDA somehow, which can be retried client side, and overall session execution won't break
  2. And because we are not sending any response for the current command that caused the WDA crash, the next command won't arrive, the client will still wait for the previous response.
  3. The time to restart WDA can be configured client-side depending on how much they want to wait if they expect socket hang-ups more often, upon testing this time is around <2mins to restart WDA successfully
  4. In all the cases I've tested only the command under execution gets an impact if it depends on the previous WDA state something like find_element/<id> and that id is not more present in WDA
  5. We have tried solving the increase in memory consumption on the WDA side but haven't been able to do so, it appears that with large-size DOM apps, we cannot control this memory increase on WDA due to leaks, hence restarting upon crash only seems a possible approach
  6. Can you please help me understand what state of the application under test is stored in WDA which might cause potential issues even if the command is retried from scratch?

@abhinvv1 It is a large topic to discuss. Would it be possible for us to schedule a meeting? We could also communicate in appium slack.

@abhinvv1 abhinvv1 marked this pull request as draft March 14, 2025 09:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants