Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syncing from upstream odoo/runbot (17.0) #802

Open
wants to merge 23 commits into
base: 17.0
Choose a base branch
from
Open

Conversation

bt-admin
Copy link

@bt-admin bt-admin commented Feb 7, 2025

bt_gitbot

- Use f-strings where appropriate.
- Get the repo url from github instead of composing it by hand
- Resolve URI template for the contents endpoint instead of
  hand-building the URL. URI template implementation is janky AF.
Also use the fork's response for the fork's own URL instead of
composing it by hand.
Should limit the risk of cases where the fork contains outdated
versions of the reference branches and we end up with odd outdated /
not up to date basis for branches & updates, which can lead to
confusing situations.
For historical reasons pretty much all tests used to use the contexts
legal/cla and ci/runbot. While there are a few tests where we need the
interactions of multiple contexts and that makes sense, on the vast
majority of tests that's just extra traffic and noise in the
test (from needing to send multiple statuses unnecessarily).

In fact on the average PR where everything passes by default we could
even remove the required statuses entirely...
Looks like `--force-with-lease` can work in multi-head updates, and
this should allow just one push per repository for the entire sequence
of update, then commit the entire thing if it succeeds.
@bt-admin bt-admin added the 17.0 label Feb 7, 2025
xmo-odoo and others added 18 commits February 11, 2025 14:27
This is a bit of an odd case which was only noticed because of
persistent forwardport.batches, which ended up having a ton of related
traceback in the logs (the mergebot kept trying to create forward
ports from Jan 27th to Feb 10th, thankfully the errors happened in git
so did not seem to eat through our API rate limiting).

The issue was triggered by the addition of odoo/enterprise#77876 to
odoo/odoo#194818. This triggered a completion job which led to the
creation of odoo/enterprise#77877 to odoo/enterprise#77880, so far so
good.

Except the bit of code responsible for creating completion jobs only
checked if the PR was being added to a batch with a descendant. That
is the case of odoo/enterprise#77877 to odoo/enterprise#77879 (not
odoo/enterprise#77880 because that's the end of the line). As a
result, those triggered 3 more completion jobs, which kept failing in
a loop because they tried pushing different commits to their
next-siblings (without forcing, leading git to reject the non-ff push,
hurray).

A completion request should only be triggered by the addition of a
new *source* (a PR without a source) to an existing batch with
descendants, so add that to the check. This requires updating
`_from_json` to create PRs in a single step (rather than one step to
create based on github's data, and an other one for the hierarchical
tracking) as we need the source to be set during `create` not as a
post-action.

Although there was a test which could have triggered this issue, the
test only had 3 branches so was not long enough to trigger the issue:

- Initial PR 1 on branch A merged then forward-ported to B and C.
- Sibling PR 2 added to the batch in B.
- Completed to C.
- Ending there as C(1) has no descendant batch, leading to no further
  completion request.

Adding a 4th branch did surface / show the issue by providing space
for a new completion request from the creation of C(2). Interestingly
even though I the test harness attempts to run all triggered crons to
completion there can be misses, so the test can fail in two different
ways (being now checked for):

- there's a leftover forwardport.batch after we've created all our
  forwardports
- there's an extra PR targeting D, descending from C(2)
- in almost every case there's also a traceback in the logs, which
  does fail the build thanks to the `env` fixture's check
That's a bit of a weird one: apparently the boolean_toggle widget has
an `autosave` option which should be `true` by default, effecting the
row as soon as the toggle is toggled[^1]. But in 15.0 and 18.0 it
seems to have no effect, the `boolean_toggle` always just stores the
change in the parent form and that gets "committed on save.

In 16.0 and 17.0 however it does have an effect, toggling the control
will immediately save its value *without going through the parent
form*, resulting in the override to `Project.write` managing
new/existing branches to not be called, thus not calling
`Project_followup_prs`, and ultimately not creating the followup
forward ports.

After contacting AAB to get more info (and grepping a bit):

- autosave was added (enabled by default) in 16.0 after the owl
  rewrite (odoo/odoo@28e6b7e)
- toggle was added in 17.0
  (odoo/odoo@a449b05)
- feature was removed in 18.0
  (odoo/odoo@6bd2c1f)

Which explains why the issue occurs in 16.0 and 17.0, and not in 15.0
or 18.0.

Fixes #1051

[^1]: but apparently not going through the parent form...
When an fw batch fails, log a message to its chatter (so the reason
for the failure doesn't necessarily have to be hunted down in the
logs, although depending on the quality of the error that might still
be an issue).

Also if a batch keeps failing (fails for more than a day at a retry
every hour -- increased from a previous 30mn), stop retrying it and
flag it in the list view, it's clearly not going to go through.

This is because we hit an issue with a completion fw batch created on
January 27th which kept failing until it was deleted on February
10th. Thankfully it failed on `git push` and git operations apparently
are not rate limited at all, but still it's not great stewartship to
keep trying a forwardport which keeps failing. Retrying in case of
transient failure makes sense, but after 24 attempts over a day it's
either not transient, or it's not working because github is down and
hammering it won't help.
In the case where multiple batch completions are triggered (see 3
commits earlier) one of the issues is that the "already ported" check
is incorrect, as it checks the (source, target) pair against the PR
used to trigger the completion check. That PR *should* be the one PR
which was just added to a batch, but if one of its descendants is used
to re-trigger a port then the dupe check fails and we try to port the
thing again.

Remember to deref' the subject's source in order to perform the dupe
check, as well as set the source for the forwardport we create.

Fixes #1052
Hit issues where this didn't seem to work after deactivating 17.2 and
trying to recover from the autoport being broken (see previous
commits), but apparently this says everything works just fine so
╮( ̄▽ ̄"")╭

Fixes #1052
Since forever, if a PR gets added to an existing forward ported batch
it will be forward ported immediately, leading to a forwardport having
a source with no merge date.

Since the addition of the skipmerge feature, it's even more possible
than before.

As a result, the `outstanding` listing should not assume the
merge_date of a source is always set.
This can yield odd effects when playing around e.g. opening freeze
wizards then cancelling them. There is a check that there are no
wizards left, but if a freeze is performed then an other is created
and cancelled it will immediately re-enable forward ports,
unexpectedly.
"Run manually" is a bit meh, as it runs the cron synchronously (so you
have to wait for it, and hope it doesn't run for longer than the
request timeout which may be smaller than the cron timeout) and it can
run in a subtly different environment than normal, which can lead to
different behaviour.

Instead add a button to enqueue a cron trigger, now that they exist
that's much closer to what we actually want, and it does run the cron
in a normal / expected environment.
`groupby` returns an iterator of `(key, list[value])`, not a
recordset. So `BaseModel.__or__` is invalid.

I missed that bit because I didn't bother writing a test where a PR
has more than 6 months and triggers generating / sending emails, but
turns out we do have those in the production database, and as a result
the mergebot hasn't sent an outstanding fw notification since I
deployed 892fce4 in early
february (oops).

Thanks to @clbr-odoo for the notification / report of the issue.
Mostly for tests: it can be really difficult to correlate issues as
there are 3 different processes involved (the test runner, the odoo
being tested, and dummy-central (as github)) and the intermixing of
logs between the test driver and the odoo being tested is
not *amazing*.

- `pytest-opentelemetry`'s `--export-traces` is the signal for running
  tests with tracing enabled, that way just having
  `pytest-opentelemetry` installed does not do anything untowards.
- Until chrisguidry/pytest-opentelemetry#34 is merged, should probably
  use the linked branch as the default / base mode of having a single
  trace for the entire test suite is not great when there are 460
  tests, especially as local clients (opentelemetry-desktop-viewer,
  venator) mostly work on traces and aren't very good at zooming on
  *spans* at least currently.
- Additionally, the conftest plugin adds hooks for propagating through
  the xmlrpc client (communications with odoo) and enables the
  requests instrumentor from the opentelemetry python contribs.
- The dummy `saas_worker` was moved into the filesystem, that makes it
  easier to review and update.
- A second such module was added for the otel instrumentation *of the
  odoo under test*, that instruments psycopg2, requests, wsgi, and the
  server side of xmlrpc.
- The git ops were instrumented directly in runbot_merge, as I've
  tried to keep `saas_tracing` relatively generic, in case it could be
  moved to community or used by internal eventually.

Some typing was added to the conftest hooks and fixtures, and they
were migrated from tmpdir (py.path) to tmp_path (pathlib) for
consistency, even though technically the `mkdir` of pathlib is an
annoying downgrade.

Fixes #835
If a status is defined as `optional`, then the PR is considered valid
if the status is never sent, but *if* the status is sent then it
becomes required.

Note that as ever this is a per-commit requirement, so it's mostly
useful for conditional statuses.

Fixes #1062
Verify that the tree is different before and after applying the patch,
otherwise if there's a mistake made (e.g. a script does not check that
they have content and request a patch applying an existing commit
which is how odoo/enterprise#612a9cf3cadba64e4b18d535ca0ac7e3f4429a08
occurred) we end up with a completely empty commit and a duplicated
commit message.

Fixes #1063

Note: using `rev-parse` to retrieve the commit's tree would be 50%
faster, but we're talking 3.2 versus 2.4ms and it requires string
formatting instead of nice clean arguments so it's a bit meh.
It's shorter, it's simpler (kinda), and it's 70% faster (although
that's unlikely to be any sort of bottleneck given applying patches
involves multiple network roundtrips).
Commits can take some time to propagate through the network (I guess),
or human error can lead to the wrong commit being set.

Either way, because the entire thing was done using a single fetch in
`check=True` the cron job would fail entirely if any of the patch
commits was yet unavailable.

Update the updater to:

- fallback on individual fetches
- remove the patch from the set of applicable patch if we (still)
  can't find its commit

I'd have hoped `fetch` could retrieve whatever it found, but
apparently the server just crashes out when it doesn't find the commit
we ask for, and `fetch` doesn't update anything.

No linked issue because I apparently forgot to jot it down (and only
remembered about this issue with the #1063 patching issue) but this
was reported by mat last week (2025-02-21) when they were wondering
why one of their patches was taking a while:

- At 0832 patch was created by automated script.
- At 0947, an attempt to apply was made, the commit was not found.
- At 1126, a second attempt was made but an other patch had been
  created whose commit was not found, failing both.
- At 1255, there was a concurrency error ("cannot lock ref" on the
  target branch).
- Finally at 1427 the patch was applied.

All in all it took 6 hours to apply the patch, which is 3-4 staging
cycles.
This used to be done in RPC but the goal is to be able to call this from
another server and not rely on a user account.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants