Skip to content

Fix failed during restart in Ambari 2.5 #29

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion package/scripts/presto_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ def execute_query(self, sql, schema='sf1', catalog='tpch'):
try:
_LOGGER.info('Connecting to server at: ' + self.server +
':' + str(self.port) + ' as user ' + self.user)
conn = HTTPConnection(self.server, self.port, False,
conn = HTTPConnection(self.server, int(self.port), False,
URL_TIMEOUT_MS)
conn.request('POST', '/v1/statement', sql, headers)
response = conn.getresponse()
Expand Down
22 changes: 15 additions & 7 deletions package/scripts/presto_coordinator.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,17 @@
# limitations under the License.

import uuid
import time
import os.path as path

from resource_management.libraries.script.script import Script
from resource_management.core.resources.system import Execute
from common import PRESTO_RPM_URL, PRESTO_RPM_NAME, create_connectors,\
from resource_management.core.exceptions import ExecutionFailed, ComponentIsNotRunning
from common import PRESTO_RPM_URL, PRESTO_RPM_NAME, create_connectors, \
delete_connectors
from presto_client import smoketest_presto, PrestoClient


class Coordinator(Script):
def install(self, env):
from params import java_home
Expand All @@ -39,17 +42,22 @@ def start(self, env):
Execute('{0} start'.format(daemon_control_script))
if 'presto_worker_hosts' in host_info.keys():
all_hosts = host_info['presto_worker_hosts'] + \
host_info['presto_coordinator_hosts']
host_info['presto_coordinator_hosts']
else:
all_hosts = host_info['presto_coordinator_hosts']
smoketest_presto(
PrestoClient('localhost','root',
config_properties['http-server.http.port']),
all_hosts)
# sleep some seconds for waiting coordinator start finished
time.sleep(30)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the sleep and the comment as they're unnecessary. The PrestoClient code has a retry with timeout mechanism to ensure the server has started.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I have seen it. To add this code just because some errors will be raised (just like ExecutionFailed mentioned blew) . And the retries will be interrupted, Never retry again.
Add some seconds for sleeping will skip the ExecutionFailed exception in most cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@petroav As I said above, I think I have fixed the root cause of the interruption of the error, so this code can now be removed. It may also be possible to keep this code, and the sleep time setting is shorter, just if there are other exceptions that are thrown and then interrupted? What is your opinion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that you fixed the underlying problem yes, I think this sleep should be removed.

smoketest_presto(PrestoClient('localhost', 'root', config_properties['http-server.http.port']), all_hosts)

def status(self, env):
from params import daemon_control_script
Execute('{0} status'.format(daemon_control_script))
try:
Execute('{0} status'.format(daemon_control_script))
except ExecutionFailed as ef:
if ef.code == 3:
raise ComponentIsNotRunning("ComponentIsNotRunning")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the advantage of doing this? Will Ambari recognize the exception and do something special as a result?

Copy link
Contributor Author

@icanfly icanfly Aug 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Ambari only accept zero value as normal return, any other return value will cause an ExecutionFailed exception thrown. As presto status will return the code: 3 .

Also In Ambari 2.5, Ambari will handle ComponentIsNotRunning exception as a normal logic procession during restart like below:
/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py (Ambari Libaray):

image

So. TO AVOID THE EXCEPTION

raise Fail("Stop command finished but process keep running.")

raise a ComponentIsNotRunning exception should be the right way.

I only test it in Ambari 2.5. I'm not sure is any older version play the same behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, that makes sense.

else:
raise ef

def configure(self, env):
from params import node_properties, jvm_config, config_properties, \
Expand Down
10 changes: 9 additions & 1 deletion package/scripts/presto_worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@

from resource_management.libraries.script.script import Script
from resource_management.core.resources.system import Execute
from resource_management.core.exceptions import ExecutionFailed, ComponentIsNotRunning
from common import PRESTO_RPM_URL, PRESTO_RPM_NAME, create_connectors, \
delete_connectors

Expand All @@ -39,7 +40,14 @@ def start(self, env):

def status(self, env):
from params import daemon_control_script
Execute('{0} status'.format(daemon_control_script))
try:
Execute('{0} status'.format(daemon_control_script))
except ExecutionFailed as ef:
if ef.code == 3:
raise ComponentIsNotRunning("ComponentIsNotRunning")
else:
raise ef


def configure(self, env):
from params import node_properties, jvm_config, config_properties, \
Expand Down
10 changes: 9 additions & 1 deletion tests/resource_management/core/exceptions.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,10 @@
class ClientComponentHasNoStatus(Exception):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the unit tests! The Ambari SDK doesn't make it easy to test your code so I had to resort to this hacky solution.

pass
pass


class ExecutionFailed(Exception):
pass


class ComponentIsNotRunning(Exception):
pass