-
-
Notifications
You must be signed in to change notification settings - Fork 427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Implement WDA restart for unexpected crashes #2531
base: master
Are you sure you want to change the base?
Conversation
abhinvv1
commented
Mar 10, 2025
•
edited
Loading
edited
- With this PR we want to add the functionality of restarting WDA in case of crashes especially socket hangups while maintaining the same appium session, which will eventually prevent the user session from breaking
- While in some cases the command under execution would break, but this can be handled client side and the following commands will not break
- Planning to put this behind a capability, would request sharing initial thoughts on the intent and the approach
I assume you'll add more changes, so I haven't left any code review right now. (Btw, it would be appreciated to keep Draft PR state before your implementation is ready for review) The approach looks good to me, although we should carefully review the recoverable conditions. e.g. I was also wondering if adding recovery logic in https://github.com/appium/WebDriverAgent/blob/f2de1a6e4baed7cce9a28a08d9b0998383c19e65/lib/webdriveragent.js#L381 would be appropriate in some part so that the xcuitest driver layer doesn't need to take care so much. Some existing code in https://github.com/appium/WebDriverAgent/blob/f2de1a6e4baed7cce9a28a08d9b0998383c19e65/lib/webdriveragent.js such as https://github.com/appium/WebDriverAgent/blob/f2de1a6e4baed7cce9a28a08d9b0998383c19e65/lib/webdriveragent.js#L183-L185 might simplify some code in this pr change also. (Anyway, I haven't read this pr code details yet so this is just a comment) |
@KazuCocoa requesting review. In the case of large-size DOM applications, we have observed that WDA is terminated due to excessive memory usage (exceeding 2-3GB) which is too much for a background process. This results in os killing WDA and socket hangups along with session disruptions for users. However, we have found that restarting WDA significantly reduces memory consumption as that memory is released. We would appreciate your feedback on the intended approach and strategy, meanwhile working on Appium changes to introduce a new capability |
@@ -902,7 +906,7 @@ export class XCUITestDriver extends BaseDriver { | |||
// In case the bundle id process start got failed because of | |||
// auth popup in the device. Then, the bundle id process itself started. It is safe to stop it here. | |||
await this.mobileKillApp(this.wda.bundleIdForXctest); | |||
} catch {}; | |||
} catch { }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you keep spaces/indentations in current master?
* @param {Error|null|undefined} err - The error to check | ||
* @returns {boolean} - True if the error is a WDA connection error | ||
*/ | ||
isWdaConnectionError(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be replaced with this.wda.isRunning()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can, but we'll won't be able to check whether the error was specifically socket hang up
or ECONNREFUSED
, does this.wda.isRunning()
ensures it's a socket hangup for sure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is https://github.com/appium/WebDriverAgent/blob/f2de1a6e4baed7cce9a28a08d9b0998383c19e65/lib/webdriveragent.js#L183 So, basically the this.wda.isRunning()
returns false when /status
endpoint gets an error. The endpoint is usually used as a WDA liveness check in XCUITest driver.
One case I could think of right now is when WDA process ends by a device disconnection in the middle. Then, it might return false while WDA process on the device is running, but the error is the same as ECONNREFUSED
like you do right now (afaik)
* @returns {Promise<object>} - The WDA status | ||
* @throws {Error} - If WDA does not become reachable within the timeout | ||
*/ | ||
async reconnectWda() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this kind of method should be in https://github.com/appium/WebDriverAgent/blob/f2de1a6e4baed7cce9a28a08d9b0998383c19e65/lib/webdriveragent.js? Then, as this xcuitest driver, this kind of reconnect can be done with this.wda.reconnect()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think, if you check the content of this method, we are more of just pinging wda and putting sleep if not running, which is more suitable on driver side, because driver should manage this logic of waiting time
* @returns {Promise<void>} | ||
*/ | ||
async delay(ms) { | ||
return new Promise((resolve) => setTimeout(resolve, ms)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
await B.delay(ms);
instead?
} | ||
await this.startWda(); | ||
} else { | ||
this.log.info('WDA process is still running'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What case did you expect for this flow?
isRunning
checks /status
endpoint call in WDA. So isRunning === false
means WDA returns socket error etc already. I guess in this case, WDA is properly running but you'd like to call the this.startWdaSession
below? Then, what kind of situation is expected? (just a question)
this.log.info('WDA process is still running'); | ||
} | ||
|
||
await this.startWdaSession(this.opts.bundleId, this.opts.processArguments); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this.startWda();
was called in the above line in if (!isRunning) {
, this this.startWdaSession
occurs twice because this.startWda();
also does the same steps. Is it expected?
* @private | ||
* @returns {Promise<void>} | ||
*/ | ||
async restoreContext() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to this context restore, should this reconnection idea needs to restore settings? https://appium.github.io/appium-xcuitest-driver/8.4/reference/settings/
https://github.com/appium/WebDriverAgent/blob/f2de1a6e4baed7cce9a28a08d9b0998383c19e65/lib/types.ts#L2-L29 these settings items might be lost after this reinitialization.
const maxAttempts = this.opts.wdaStartupRetries || 2; | ||
let attempts = 0; | ||
|
||
while (true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while (true) { | |
while (attempts < maxAttempts) { |
(no much meanings, but just to avoid while (true)
)
// Verify session is still valid | ||
// This is important because in case the WDA was not killed and it was only an intermittent connection issue, | ||
// and the session is still valid then this would prevent us restarting it unnecesarilly | ||
await this.proxyCommand(`/session/${this.sessionId}/url`, 'GET'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This endpoint call could be expensive in WebView context. Getting a timeout endpoint call could be lightweight.
(err.message.includes('invalid session id') || | ||
err.message.includes('Session does not exist')), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These messages are managed in https://github.com/appium/WebDriverAgent (WDA code).
To keep error message management in a closed place, probably this kind of error check should be in https://github.com/appium/WebDriverAgent/tree/master/lib. So, as xcuitest driver, https://github.com/appium/appium-xcuitest-driver/pull/2531/files#diff-ed3c013edbc896bd87ef62539626fcdc8930a31f3cee7ecb37438e172af2c8e0R1291-R1303 these lines can be in the WDA JS side and XCTest driver could call it as this.wda.hasCrashedOrSessionInvalidError()
. If true, xcuitest driver can do the recovery steps. This is similar to this.wda.isRunning()
@@ -132,6 +135,9 @@ export default { | |||
if (err.message && err.message.match(/Condition unmet/)) { | |||
// condition was not met setting res to empty array | |||
els = []; | |||
} | |||
else if (err.message && err.message.match(/socket hang up/)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure how what difference this condition makes here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the error message in case of socket hangups for find element commands was getting overriden by NoSuchElementError which makes it undetectable in driver.js file, hence we are trying to bubble it up as socket hang up only to be able to identify on driver.js level and trigger the restart flow
@@ -122,7 +122,10 @@ export default { | |||
els = /** @type {Element[]|undefined} */ ( | |||
await this.proxyCommand(endpoint, method, body) | |||
); | |||
} catch { | |||
} catch (err) { | |||
if (err.message && err.message.match(/socket hang up/)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the regexp should be moved to a global constant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do something like SOCKET_HANG_UP_PATTERN.test(err.message)
instead
} | ||
}, | ||
enableWdaRestart: { | ||
presence: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the default setting, so not needed
@@ -383,7 +383,11 @@ const desiredCapConstraints = /** @type {const} */ ({ | |||
pageLoadStrategy: { | |||
isString: true, | |||
inclusionCaseInsensitive: ['none', 'eager', 'normal'] | |||
} | |||
}, | |||
enableWdaRestart: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the capability name is too broad. I would rather use something like restartWdaOnCrash
return false; | ||
} | ||
const errorMessage = err.message || ''; | ||
const CONNECTION_ERRORS = ['ECONNREFUSED', 'socket hang up']; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be a module constant
|
||
this.log.info(`Attempting to reconnect to WDA with ${timeout}ms timeout`); | ||
|
||
while (Date.now() - startTime < timeout) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use primitives from the asyncbox
module for this purpose, e.g. waitForCondition
* @param {string} cmd | ||
* @param {...any} args | ||
* @returns {Promise<any>} | ||
* Checks if an error is related to WDA connection issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these all new method overcomplicate the driver class. consider extracting them to a separate module and make to mixins, similar to other commands
I am not totally convinced by this idea. Here is why:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment above
|
@abhinvv1 It is a large topic to discuss. Would it be possible for us to schedule a meeting? We could also communicate in appium slack. |