-
Notifications
You must be signed in to change notification settings - Fork 23
Add StackHPC Ironic tunings #1011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: stackhpc/2023.1
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,6 +10,7 @@ the various features provided. | |
|
||
release-train | ||
host-images | ||
ironic | ||
lvm | ||
swap | ||
cephadm | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
====== | ||
Ironic | ||
====== | ||
|
||
Cleaning | ||
======== | ||
|
||
Storage | ||
------- | ||
|
||
Hardware assisted secure erase, i.e the ``erase_devices`` clean step, is | ||
enabled by default. This is normally dependent on the `Hardware Manager | ||
<https://docs.openstack.org/ironic-python-agent/latest/contributor/hardware_managers.html>`__ | ||
in use. For example, when using the GenericHardwareManager the priority would | ||
be 10, whereas if using the `ProliantHardwareManager | ||
<https://docs.openstack.org/ironic/latest/admin/drivers/ilo.html#disk-erase-support>`__ | ||
it would be 0. The idea is that we will prevent the catastrophic case where | ||
data could be leaked to another tenant; forcing you to have to explicitly relax | ||
this setting if this is a risk you want to take. This can be customised by | ||
editing the following variables: | ||
|
||
.. code-block:: | ||
:caption: $KAYOBE_CONFIG_PATH/kolla/config/ironic/ironic-conductor.conf | ||
|
||
[deploy] | ||
erase_devices_priority=10 | ||
erase_devices_metadata_priority=0 | ||
|
||
See `Ironic documentation | ||
<https://docs.openstack.org/ironic/latest/admin/cleaning.html>`__ for more | ||
details. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
[DEFAULT] | ||
timeout = 0 | ||
{% if "genericswitch" in kolla_neutron_ml2_mechanism_drivers %} | ||
# We are increasing the RPC response timeouts to 5 minutes due to the neutron | ||
# generic switch driver, which synchronously applies switch configuration for | ||
# each ironic port during node provisioning and tear down. | ||
# The specific API calls that require this long timeout are: | ||
# - Creation and deletion of VLAN networks. | ||
# - Creation or update of ports, adding binding information. | ||
# - Update of ports, removing binding information. | ||
# - Deletion of ports. | ||
rpc_response_timeout = 360 | ||
{% endif %} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
[DEFAULT] | ||
# Avoid some timeouts of heartbeats and vif deletes | ||
rpc_response_timeout = 360 | ||
|
||
[neutron] | ||
timeout = 300 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
[DEFAULT] | ||
# Make direct deploy faster, transfer sparse qcow2 images | ||
force_raw_images = False | ||
# Avoid some rpc timeouts | ||
rpc_response_timeout = 360 | ||
|
||
[conductor] | ||
automated_clean=true | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I saw at some sites we have this cleaning tweak from the Ironic docs:
I know cleaning is very site specific. Probably safer to leave the above out and default to scrub. Perhaps worth linking to the Ironic Doc? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would actually like our default to be fail if secure erase fails, with a note on how to work around that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I explicitly set a value for erase_devices as the priroity is dependent on the hardware manager in use. What do you think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Coming back to this, I think we should move to erase_devices_express, now I understand it better. I think we need to make the default good out the box with our pre-built IPA. |
||
# We have busy conductors failing to heartbeat | ||
# Default is 10 secs | ||
heartbeat_interval = 30 | ||
# Default is 60 seconds | ||
heartbeat_timeout = 360 | ||
sync_local_state_interval = 360 | ||
|
||
# Normally this is 100. We see eventlet threads | ||
# not making much progress, to for saftey reduce | ||
# this by half, should leave work on rabbit queu | ||
workers_pool_size = 50 | ||
# Normally this is 8, keep it same | ||
period_max_workers = 8 | ||
|
||
# Increase power sync interval to reduce load | ||
sync_power_state_interval = 120 | ||
power_failure_recovery_interval = 120 | ||
# Stop checking for orphan allocations for now | ||
check_allocations_interval = 120 | ||
|
||
# Wait much longer before provision timeout check, to reduce background load | ||
# The default is 60 seconds | ||
check_provision_state_interval = 120 | ||
check_rescue_state_interval = 120 | ||
|
||
[database] | ||
jovial marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# Usually this is 50, reduce to stop DB connection timeouts | ||
# and instead just make eventlet threads wait a bit longer | ||
max_overflow = 5 | ||
# By default this is 30 seconds, but as we reduce | ||
# the pool overflow, some people will need to wait longer | ||
pool_timeout = 60 | ||
|
||
[deploy] | ||
# Force Hardware assisted secure erase by default. | ||
erase_devices_priority=10 | ||
erase_devices_metadata_priority=0 | ||
|
||
[pxe] | ||
# Increase cache size to 120GB and TTL to 28 hours | ||
image_cache_size = 122880 | ||
image_cache_ttl = 100800 | ||
|
||
[neutron] | ||
# Increase the neutron client timeout to allow for the slow management | ||
# switches. | ||
timeout = 300 | ||
request_timeout = 300 | ||
|
||
[glance] | ||
# Retry image download at least once if failure | ||
num_retries = 1 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
[DEFAULT] | ||
{% if kolla_enable_ironic | bool and "genericswitch" in kolla_neutron_ml2_mechanism_drivers %} | ||
# We are increasing the RPC response timeouts to 5 minutes due to the neutron | ||
# generic switch driver, which synchronously applies switch configuration for | ||
# each ironic port during node provisioning and tear down. | ||
# The specific API calls that require this long timeout are: | ||
# - Creation and deletion of VLAN networks. | ||
# - Creation or update of ports, adding binding information. | ||
# - Update of ports, removing binding information. | ||
# - Deletion of ports. | ||
rpc_response_timeout = 360 | ||
{% endif %} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,15 @@ | ||
[DEFAULT] | ||
{% if kolla_enable_ironic | bool and "genericswitch" in kolla_neutron_ml2_mechanism_drivers %} | ||
# We are increasing the RPC response timeouts to 5 minutes due to the neutron | ||
# generic switch driver, which synchronously applies switch configuration for | ||
# each ironic port during node provisioning and tear down. | ||
# The specific API calls that require this long timeout are: | ||
# - Creation and deletion of VLAN networks. | ||
# - Creation or update of ports, adding binding information. | ||
# - Update of ports, removing binding information. | ||
# - Deletion of ports. | ||
rpc_response_timeout = 360 | ||
{% endif %} | ||
|
||
[libvirt] | ||
hw_machine_type = x86_64=q35 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,17 @@ | ||
{% if kolla_enable_ironic|bool and kolla_nova_compute_ironic_host is not none %} | ||
[DEFAULT] | ||
{% if kolla_enable_ironic|bool and kolla_nova_compute_ironic_host is not none %} | ||
jovial marked this conversation as resolved.
Show resolved
Hide resolved
|
||
host = {{ kolla_nova_compute_ironic_static_host_name | mandatory('You must set a static host name to help with service failover. See the operations documentation, Ironic section.') }} | ||
{% endif %} | ||
# Don't limit the number of concurrent builds for the nova ironic compute | ||
# service. | ||
max_concurrent_builds = 35 | ||
|
||
force_config_drive = True | ||
|
||
[ironic] | ||
# Ramp up maximum retries to allow time for baremetal node reboot and switch configs | ||
api_max_retries = 720 | ||
|
||
[compute] | ||
# Don't disable the compute service due to failed builds. | ||
consecutive_build_service_disable_threshold = 0 |
Uh oh!
There was an error while loading. Please reload this page.