Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(RHINENG-15555): Fix infinite export when a host is deleted #2236

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

fstavela
Copy link
Contributor

@fstavela fstavela commented Feb 4, 2025

Overview

This PR is being created to address RHINENG-15555.

When we initiated an export and then deleted a host, the export would never finish and the status would say that it is "pending" forever. I found out that there were 2 reasons why this was happening:

  • When we delete a host while fetching them for export, SQLAlchemy raises a DeletedObjectError. We didn't catch the DB errors when fetching hosts for export, so when the DB error occurred, we didn't send the failure status to the export service. I'm fixing this by catching the DB errors in get_hosts_to_export function and raising InventoryException, which is caught and correctly handled by create_export function. I also added a unit test for this.
  • Recently we switched from db.session.query to sqlalchemy.select when fetching the hosts for export. It turns out that this is significantly slower (exporting 1000 in ephemeral hosts with db.session.query takes about 0.3s, while exporting 1000 hosts in ephemeral with select takes more than 5 seconds - more than 16 times slower), I'll try to explain why later. This means that there is a much higher chance of seeing DB errors (for example when a host is deleted), because the export takes much more time.

I looked at the DB queries we make with both db.session.query and select, and while db.session.query makes just a single SQL query, select makes 4 additional SQL queries for EVERY host that is being exported. These additional queries don't even make any sense, as they query hosts by null ids.

Here is what db.session.query does:

**** QUERY:

SELECT hbi.hosts.system_profile_facts[:param_1] AS anon_1, hbi.hosts.id AS hbi_hosts_id, hbi.hosts.account AS hbi_hosts_account, hbi.hosts.org_id AS hbi_hosts_org_id, hbi.hosts.display_name AS hbi_hosts_display_name, hbi.hosts.ansible_host AS hbi_hosts_ansible_host, hbi.hosts.created_on AS hbi_hosts_created_on, hbi.hosts.modified_on AS hbi_hosts_modified_on, hbi.hosts.facts AS hbi_hosts_facts, hbi.hosts.tags AS hbi_hosts_tags, hbi.hosts.canonical_facts AS hbi_hosts_canonical_facts, hbi.hosts.system_profile_facts AS hbi_hosts_system_profile_facts, hbi.hosts.groups AS hbi_hosts_groups, hbi.hosts.stale_timestamp AS hbi_hosts_stale_timestamp, hbi.hosts.reporter AS hbi_hosts_reporter, hbi.hosts.per_reporter_staleness AS hbi_hosts_per_reporter_staleness
FROM hbi.hosts LEFT OUTER JOIN hbi.hosts_groups ON hbi.hosts.id = hbi.hosts_groups.host_id LEFT OUTER JOIN hbi.groups ON hbi.groups.id = hbi.hosts_groups.group_id
WHERE hbi.hosts.org_id = :org_id_1 AND (hbi.hosts.system_profile_facts[:system_profile_facts_1] = :param_2 AND (hbi.hosts.modified_on > :modified_on_1 OR hbi.hosts.modified_on > :modified_on_2 AND hbi.hosts.modified_on <= :modified_on_3 OR hbi.hosts.modified_on > :modified_on_4 AND hbi.hosts.modified_on <= :modified_on_5) OR hbi.hosts.system_profile_facts[:system_profile_facts_2] IS NULL AND (hbi.hosts.modified_on > :modified_on_6 OR hbi.hosts.modified_on > :modified_on_7 AND hbi.hosts.modified_on <= :modified_on_8 OR hbi.hosts.modified_on > :modified_on_9 AND hbi.hosts.modified_on <= :modified_on_10))

**** PARAMETERS:
{
    "modified_on_1": "2025-02-05 10:48:19.119217+00:00",
    "modified_on_10": "2025-01-31 10:48:19.120059+00:00",
    "modified_on_2": "2024-08-11 10:48:19.119217+00:00",
    "modified_on_3": "2025-02-05 10:48:19.119217+00:00",
    "modified_on_4": "2023-02-08 10:48:19.119217+00:00",
    "modified_on_5": "2024-08-11 10:48:19.119217+00:00",
    "modified_on_6": "2025-02-06 05:48:19.120059+00:00",
    "modified_on_7": "2025-01-31 10:48:19.120059+00:00",
    "modified_on_8": "2025-02-06 05:48:19.120059+00:00",
    "modified_on_9": "2025-01-24 10:48:19.120059+00:00",
    "org_id_1": "3340851",
    "param_1": "host_type",
    "param_2": "edge",
    "system_profile_facts_1": "host_type",
    "system_profile_facts_2": "host_type"
}
****************

And here is what select does:

**** QUERY:

SELECT hbi.hosts.system_profile_facts[:param_1] AS anon_1, hbi.hosts.id, hbi.hosts.account, hbi.hosts.org_id, hbi.hosts.display_name, hbi.hosts.created_on, hbi.hosts.modified_on, hbi.hosts.facts, hbi.hosts.system_profile_facts, hbi.hosts.groups, hbi.hosts.reporter
FROM hbi.hosts LEFT OUTER JOIN hbi.hosts_groups ON hbi.hosts.id = hbi.hosts_groups.host_id LEFT OUTER JOIN hbi.groups ON hbi.groups.id = hbi.hosts_groups.group_id
WHERE hbi.hosts.org_id = :org_id_1 AND (hbi.hosts.system_profile_facts[:system_profile_facts_1] = :param_2 AND (hbi.hosts.modified_on > :modified_on_1 OR hbi.hosts.modified_on > :modified_on_2 AND hbi.hosts.modified_on <= :modified_on_3 OR hbi.hosts.modified_on > :modified_on_4 AND hbi.hosts.modified_on <= :modified_on_5) OR hbi.hosts.system_profile_facts[:system_profile_facts_2] IS NULL AND (hbi.hosts.modified_on > :modified_on_6 OR hbi.hosts.modified_on > :modified_on_7 AND hbi.hosts.modified_on <= :modified_on_8 OR hbi.hosts.modified_on > :modified_on_9 AND hbi.hosts.modified_on <= :modified_on_10))

**** PARAMETERS:
{
    "modified_on_1": "2025-02-05 10:39:08.785591+00:00",
    "modified_on_10": "2025-01-31 10:39:08.786155+00:00",
    "modified_on_2": "2024-08-11 10:39:08.785591+00:00",
    "modified_on_3": "2025-02-05 10:39:08.785591+00:00",
    "modified_on_4": "2023-02-08 10:39:08.785591+00:00",
    "modified_on_5": "2024-08-11 10:39:08.785591+00:00",
    "modified_on_6": "2025-02-06 05:39:08.786155+00:00",
    "modified_on_7": "2025-01-31 10:39:08.786155+00:00",
    "modified_on_8": "2025-02-06 05:39:08.786155+00:00",
    "modified_on_9": "2025-01-24 10:39:08.786155+00:00",
    "org_id_1": "3340851",
    "param_1": "host_type",
    "param_2": "edge",
    "system_profile_facts_1": "host_type",
    "system_profile_facts_2": "host_type"
}
****************

**** QUERY:

SELECT hbi.hosts.canonical_facts AS hbi_hosts_canonical_facts
FROM hbi.hosts
WHERE hbi.hosts.id = :pk_1

**** PARAMETERS:
{
    "pk_1": null
}

****************

**** QUERY:

SELECT hbi.hosts.ansible_host AS hbi_hosts_ansible_host
FROM hbi.hosts
WHERE hbi.hosts.id = :pk_1

**** PARAMETERS:
{
    "pk_1": null
}

****************

**** QUERY:

SELECT hbi.hosts.per_reporter_staleness AS hbi_hosts_per_reporter_staleness
FROM hbi.hosts
WHERE hbi.hosts.id = :pk_1

**** PARAMETERS:
{
    "pk_1": null
}

****************

**** QUERY:

SELECT hbi.hosts.tags AS hbi_hosts_tags
FROM hbi.hosts
WHERE hbi.hosts.id = :pk_1

**** PARAMETERS:
{
    "pk_1": null
}

****************
.
.
.

The initial query looks exactly the same, but for some reason, it makes these additional 4 queries for every single exported host (this is not a complete log, the logs go like this for a long time). I have no idea why this is happening. Everything I found on the internet suggests that these 2 methods should do essentially the same thing, it's just a different way how to construct the DB query in Python. If anyone knows what might cause this please share, I think that this can be fixed in another Jira, but for now, I just returned back to using db.session.query to fix the bug.

There were a few other issues that I noticed while working on this:

  • When we sent a failure status report to the export service in _handle_export_error, it returned HTTP error 400 and complained that the message was empty. This was happening because the InventoryException didn't have a __str__ method, so doing str(e) returned an empty string. I added the __str__ method to the InventoryException and made it return a string out of to_json method.
  • When I added type annotations to the exceptions, mypy complained about sending a string value to an int parameter in _handle_export_response. This was caused by sending a response text into the status code parameter. This is a nice example of how type-checking can help us prevent from creating bugs.
  • The number of hosts to export log didn't show the actual number of hosts, it logged just the SQL query, because this query was never executed.
  • I noticed that all logs (not only when exporting) are doubled: RHINENG-15778
  • sqldumps function decorator didn't work for me: RHINENG-15779
  • The SQLDump logs don't look very nice when we use logger to write the messages: RHINENG-15874
  • I noticed that the staleness queries are not very efficient, maybe there is an opportunity to improve performance: RHINENG-15780

PR Checklist

  • Keep PR title short, ideally under 72 characters
  • Descriptive comments provided in complex code blocks
  • Include raw query examples in the PR description, if adding/modifying SQL query
  • Tests: validate optimal/expected output
  • Tests: validate exceptions and failure scenarios
  • Tests: edge cases
  • Recovers or fails gracefully during potential resource outages (e.g. DB, Kafka)
  • Uses type hinting, if convenient
  • Documentation, if this PR changes the way other services interact with host inventory
  • Links to related PRs

Secure Coding Practices Documentation Reference

You can find documentation on this checklist here.

Secure Coding Checklist

  • Input Validation
  • Output Encoding
  • Authentication and Password Management
  • Session Management
  • Access Control
  • Cryptographic Practices
  • Error Handling and Logging
  • Data Protection
  • Communication Security
  • System Configuration
  • Database Security
  • File Management
  • Memory Management
  • General Coding Practices

@fstavela fstavela force-pushed the fix-export-delete branch 2 times, most recently from 9328692 to b102389 Compare February 6, 2025 17:52
rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
rh-pre-commit.version: 2.3.2
rh-pre-commit.check-secrets: ENABLED
@fstavela fstavela marked this pull request as ready for review February 11, 2025 09:15
@fstavela fstavela requested a review from a team as a code owner February 11, 2025 09:15
@thearifismail
Copy link
Contributor

My question is, is it a valid usecase? If an export has been initiated, should the host deletion be prevented until the export has completed in success or failure?

@fstavela
Copy link
Contributor Author

My question is, is it a valid usecase? If an export has been initiated, should the host deletion be prevented until the export has completed in success or failure?

I don't think the host deletion should be prevented. But, when we initiate an export and then delete a host, the host should either be in the final export results (if there has been a reasonable time between these 2 events), or the export should fail, but the export shouldn't say that it's "pending" forever.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants