fix(RHINENG-15555): Fix infinite export when a host is deleted #2236

fstavela · 2025-02-04T13:41:48Z

Overview

This PR is being created to address RHINENG-15555.

When we initiated an export and then deleted a host, the export would never finish and the status would say that it is "pending" forever. I found out that there were 2 reasons why this was happening:

When we delete a host while fetching them for export, SQLAlchemy raises a DeletedObjectError. We didn't catch the DB errors when fetching hosts for export, so when the DB error occurred, we didn't send the failure status to the export service. I'm fixing this by catching the DB errors in get_hosts_to_export function and raising InventoryException, which is caught and correctly handled by create_export function. I also added a unit test for this.
Recently we switched from db.session.query to sqlalchemy.select when fetching the hosts for export. It turns out that this is significantly slower (exporting 1000 in ephemeral hosts with db.session.query takes about 0.3s, while exporting 1000 hosts in ephemeral with select takes more than 5 seconds - more than 16 times slower), I'll try to explain why later. This means that there is a much higher chance of seeing DB errors (for example when a host is deleted), because the export takes much more time.

I looked at the DB queries we make with both db.session.query and select, and while db.session.query makes just a single SQL query, select makes 4 additional SQL queries for EVERY host that is being exported. These additional queries don't even make any sense, as they query hosts by null ids.

Here is what db.session.query does:

**** QUERY:

SELECT hbi.hosts.system_profile_facts[:param_1] AS anon_1, hbi.hosts.id AS hbi_hosts_id, hbi.hosts.account AS hbi_hosts_account, hbi.hosts.org_id AS hbi_hosts_org_id, hbi.hosts.display_name AS hbi_hosts_display_name, hbi.hosts.ansible_host AS hbi_hosts_ansible_host, hbi.hosts.created_on AS hbi_hosts_created_on, hbi.hosts.modified_on AS hbi_hosts_modified_on, hbi.hosts.facts AS hbi_hosts_facts, hbi.hosts.tags AS hbi_hosts_tags, hbi.hosts.canonical_facts AS hbi_hosts_canonical_facts, hbi.hosts.system_profile_facts AS hbi_hosts_system_profile_facts, hbi.hosts.groups AS hbi_hosts_groups, hbi.hosts.stale_timestamp AS hbi_hosts_stale_timestamp, hbi.hosts.reporter AS hbi_hosts_reporter, hbi.hosts.per_reporter_staleness AS hbi_hosts_per_reporter_staleness
FROM hbi.hosts LEFT OUTER JOIN hbi.hosts_groups ON hbi.hosts.id = hbi.hosts_groups.host_id LEFT OUTER JOIN hbi.groups ON hbi.groups.id = hbi.hosts_groups.group_id
WHERE hbi.hosts.org_id = :org_id_1 AND (hbi.hosts.system_profile_facts[:system_profile_facts_1] = :param_2 AND (hbi.hosts.modified_on > :modified_on_1 OR hbi.hosts.modified_on > :modified_on_2 AND hbi.hosts.modified_on <= :modified_on_3 OR hbi.hosts.modified_on > :modified_on_4 AND hbi.hosts.modified_on <= :modified_on_5) OR hbi.hosts.system_profile_facts[:system_profile_facts_2] IS NULL AND (hbi.hosts.modified_on > :modified_on_6 OR hbi.hosts.modified_on > :modified_on_7 AND hbi.hosts.modified_on <= :modified_on_8 OR hbi.hosts.modified_on > :modified_on_9 AND hbi.hosts.modified_on <= :modified_on_10))

**** PARAMETERS:
{
    "modified_on_1": "2025-02-05 10:48:19.119217+00:00",
    "modified_on_10": "2025-01-31 10:48:19.120059+00:00",
    "modified_on_2": "2024-08-11 10:48:19.119217+00:00",
    "modified_on_3": "2025-02-05 10:48:19.119217+00:00",
    "modified_on_4": "2023-02-08 10:48:19.119217+00:00",
    "modified_on_5": "2024-08-11 10:48:19.119217+00:00",
    "modified_on_6": "2025-02-06 05:48:19.120059+00:00",
    "modified_on_7": "2025-01-31 10:48:19.120059+00:00",
    "modified_on_8": "2025-02-06 05:48:19.120059+00:00",
    "modified_on_9": "2025-01-24 10:48:19.120059+00:00",
    "org_id_1": "3340851",
    "param_1": "host_type",
    "param_2": "edge",
    "system_profile_facts_1": "host_type",
    "system_profile_facts_2": "host_type"
}
****************

And here is what select does:

**** QUERY:

SELECT hbi.hosts.system_profile_facts[:param_1] AS anon_1, hbi.hosts.id, hbi.hosts.account, hbi.hosts.org_id, hbi.hosts.display_name, hbi.hosts.created_on, hbi.hosts.modified_on, hbi.hosts.facts, hbi.hosts.system_profile_facts, hbi.hosts.groups, hbi.hosts.reporter
FROM hbi.hosts LEFT OUTER JOIN hbi.hosts_groups ON hbi.hosts.id = hbi.hosts_groups.host_id LEFT OUTER JOIN hbi.groups ON hbi.groups.id = hbi.hosts_groups.group_id
WHERE hbi.hosts.org_id = :org_id_1 AND (hbi.hosts.system_profile_facts[:system_profile_facts_1] = :param_2 AND (hbi.hosts.modified_on > :modified_on_1 OR hbi.hosts.modified_on > :modified_on_2 AND hbi.hosts.modified_on <= :modified_on_3 OR hbi.hosts.modified_on > :modified_on_4 AND hbi.hosts.modified_on <= :modified_on_5) OR hbi.hosts.system_profile_facts[:system_profile_facts_2] IS NULL AND (hbi.hosts.modified_on > :modified_on_6 OR hbi.hosts.modified_on > :modified_on_7 AND hbi.hosts.modified_on <= :modified_on_8 OR hbi.hosts.modified_on > :modified_on_9 AND hbi.hosts.modified_on <= :modified_on_10))

**** PARAMETERS:
{
    "modified_on_1": "2025-02-05 10:39:08.785591+00:00",
    "modified_on_10": "2025-01-31 10:39:08.786155+00:00",
    "modified_on_2": "2024-08-11 10:39:08.785591+00:00",
    "modified_on_3": "2025-02-05 10:39:08.785591+00:00",
    "modified_on_4": "2023-02-08 10:39:08.785591+00:00",
    "modified_on_5": "2024-08-11 10:39:08.785591+00:00",
    "modified_on_6": "2025-02-06 05:39:08.786155+00:00",
    "modified_on_7": "2025-01-31 10:39:08.786155+00:00",
    "modified_on_8": "2025-02-06 05:39:08.786155+00:00",
    "modified_on_9": "2025-01-24 10:39:08.786155+00:00",
    "org_id_1": "3340851",
    "param_1": "host_type",
    "param_2": "edge",
    "system_profile_facts_1": "host_type",
    "system_profile_facts_2": "host_type"
}
****************

**** QUERY:

SELECT hbi.hosts.canonical_facts AS hbi_hosts_canonical_facts
FROM hbi.hosts
WHERE hbi.hosts.id = :pk_1

**** PARAMETERS:
{
    "pk_1": null
}

****************

**** QUERY:

SELECT hbi.hosts.ansible_host AS hbi_hosts_ansible_host
FROM hbi.hosts
WHERE hbi.hosts.id = :pk_1

**** PARAMETERS:
{
    "pk_1": null
}

****************

**** QUERY:

SELECT hbi.hosts.per_reporter_staleness AS hbi_hosts_per_reporter_staleness
FROM hbi.hosts
WHERE hbi.hosts.id = :pk_1

**** PARAMETERS:
{
    "pk_1": null
}

****************

**** QUERY:

SELECT hbi.hosts.tags AS hbi_hosts_tags
FROM hbi.hosts
WHERE hbi.hosts.id = :pk_1

**** PARAMETERS:
{
    "pk_1": null
}

****************
.
.
.

The initial query looks exactly the same, but for some reason, it makes these additional 4 queries for every single exported host (this is not a complete log, the logs go like this for a long time). I have no idea why this is happening. Everything I found on the internet suggests that these 2 methods should do essentially the same thing, it's just a different way how to construct the DB query in Python. If anyone knows what might cause this please share, I think that this can be fixed in another Jira, but for now, I just returned back to using db.session.query to fix the bug.

There were a few other issues that I noticed while working on this:

When we sent a failure status report to the export service in _handle_export_error, it returned HTTP error 400 and complained that the message was empty. This was happening because the InventoryException didn't have a __str__ method, so doing str(e) returned an empty string. I added the __str__ method to the InventoryException and made it return a string out of to_json method.
When I added type annotations to the exceptions, mypy complained about sending a string value to an int parameter in _handle_export_response. This was caused by sending a response text into the status code parameter. This is a nice example of how type-checking can help us prevent from creating bugs.
The number of hosts to export log didn't show the actual number of hosts, it logged just the SQL query, because this query was never executed.
I noticed that all logs (not only when exporting) are doubled: RHINENG-15778
sqldumps function decorator didn't work for me: RHINENG-15779
The SQLDump logs don't look very nice when we use logger to write the messages: RHINENG-15874
I noticed that the staleness queries are not very efficient, maybe there is an opportunity to improve performance: RHINENG-15780

PR Checklist

Secure Coding Practices Documentation Reference

You can find documentation on this checklist here.

Secure Coding Checklist

rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED

thearifismail · 2025-02-11T14:12:22Z

My question is, is it a valid usecase? If an export has been initiated, should the host deletion be prevented until the export has completed in success or failure?

fstavela · 2025-02-11T15:17:35Z

My question is, is it a valid usecase? If an export has been initiated, should the host deletion be prevented until the export has completed in success or failure?

I don't think the host deletion should be prevented. But, when we initiate an export and then delete a host, the host should either be in the final export results (if there has been a reasonable time between these 2 events), or the export should fail, but the export shouldn't say that it's "pending" forever.

fstavela force-pushed the fix-export-delete branch 2 times, most recently from 9328692 to b102389 Compare February 6, 2025 17:52

fstavela added 2 commits February 11, 2025 09:03

fix(RHINENG-15555): Fix infinite export when a host is deleted

300b3c6

rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED

Catch export DB error + other fixes

8a3c10f

rh-pre-commit.version: 2.3.2 rh-pre-commit.check-secrets: ENABLED

fstavela force-pushed the fix-export-delete branch from b102389 to 8a3c10f Compare February 11, 2025 08:10

fstavela marked this pull request as ready for review February 11, 2025 09:15

fstavela requested a review from a team as a code owner February 11, 2025 09:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(RHINENG-15555): Fix infinite export when a host is deleted #2236

fix(RHINENG-15555): Fix infinite export when a host is deleted #2236

fstavela commented Feb 4, 2025 •

edited

Loading

thearifismail commented Feb 11, 2025

fstavela commented Feb 11, 2025

fix(RHINENG-15555): Fix infinite export when a host is deleted #2236

Are you sure you want to change the base?

fix(RHINENG-15555): Fix infinite export when a host is deleted #2236

Conversation

fstavela commented Feb 4, 2025 • edited Loading

Overview

PR Checklist

Secure Coding Practices Documentation Reference

Secure Coding Checklist

thearifismail commented Feb 11, 2025

fstavela commented Feb 11, 2025

fstavela commented Feb 4, 2025 •

edited

Loading