Missed pairing opportunities on a weekly basis #55

cceckman · 2024-07-30T17:38:09Z

I'm signed up for pairing Tuesday through Friday, but the last couple of weeks Pairing Bot has skipped Tuesday and Thursday without any message.

A status message from Ian shows T/W/R/F pairing schedule.

@stillgreenmoss notes that we may need to instrument for more information (i.e. log more) to effectively debug this.

The text was updated successfully, but these errors were encountered:

cceckman · 2024-07-30T17:43:51Z

My thoughts on "how to investigate", having done no actual investigation yet:

Where is the intent to match Ian getting dropped?
- Is there a way to run the pairing algorithm locally, as f(real preferences, simulated date), and see what pairings it spits out? What messages?
- What happens if a message doesn't get sent? Is there a (noisy) error? What if a message does get sent -- do we have a confirmation?

A little investigation: in http://console.cloud.google.com/logs/query,

resource.type="gae_app"
protoPayload.userAgent="AppEngine-Google; (+http://code.google.com/appengine)"

makes a good starting point for finding the /match invocation.

cceckman · 2024-07-30T18:02:52Z

A red herring path I went down: timezones.

The crontab says to perform matches at 4:00. Docs say if unspecified, the timezone is UTC.

In the winter, 4:00 UTC is 11:00 EST (EST is UTC - 5:00). In the summer, 4:00 UTC is 00:00 EDT (EDT is UTC - 4:00). So in the winter the pairing bot runs "the day before", computing "tomorrow"'s matches; in the summer, it runs "the same day".

But ListPairingTomorrow does the right thing: it uses GAE's time zone, UTC, and picks up the same day. That's correct regardless of what Eastern time is doing.

cceckman · 2024-07-30T18:25:42Z

OK, I see something odd in the logs, at least.

A lot of the pairings generated wind up with a BAD_REQUEST message attached. In this case, there's two adjacent messages:

2024/07/30 04:00:02 {"result":"error","msg":"Invalid email '[redacted A]'","code":"BAD_REQUEST"}
2024/07/30 04:00:02 [redacted A] was matched with [redacted B]

"Invalid email" is one of the messages, for several e-mail addresses. There's also one report of [email] is no longer using Zulip.

It looks like this error isn't showing up in the match function; the prefix "Error when trying to send..." isn't showing up in the logs.

sendUserMessage is logging the response body with no additional annotation; but it looks like nothing is classifying that Zulip response with an error. :(

At least for today, I think this is a sufficient explanation for why Ian didn't get a pairing. The bot sent a message to multiple recipients, but one of those recipients was invalid; the message as a whole didn't send. (I haven't check to see if this was also the problem on other days.)

I suspect this could be responsible for a few different dropped pairings.

Some changes, maybe in priority order:

Configure the Golang logger to log the file name and file number of the log statement. This is Immensely Useful for debugging -- it doesn't quite turn the logs into a trace, but it makes it a lot easier to line things up.
A non-2xx HTTP response doesn't cause an error; we should at least check the HTTP response code for the Zulip responses.
Check the Zulip API, and if needed, perform more-complete error checking for Zulip replies. Do we expect Zulip to return a non-2xx error for these (BAD_REQUEST)? Or do we need to check the body?
Alerting! Proactive telemetry, not reactive log-looking; if the bot fails, or fails repeatedly over some amount of time, there should be something that says "hey go look at the logs pls?".
Data cleanup: it looks like there are some users that Zulip considers invalid. Why? How can/should we know, so we can remove them from the pairing set?

Per [issue 55], Zulip will occasionally refuse a message. We weren't passing this further up the stack, or even logging the status code. Log the status code along with the body, and return an error up the stack, if the Zulip request gets a user error (4xx) or server error (5xx). [issue 55]: #55

cceckman · 2024-07-30T20:13:19Z

Zulip documentation confirms that errors will return 4xx/5xx errors, as well as error information in the body.

Neither the "Invalid e-mail" nor the "...is no longer using Zulip" messages have the relevant e-mail as structured data. The API docs have a "tip" saying to start a conversation on the Zulip Zulip instance as to whether there are specific key/value pairs that would be relevant.

Consider following up with Zulip: can the invalid e-mail(s) be returned as a structured error response?

This would be one way (not the only way) for us to do more graceful error handling; we could punt the user that is valid down the list and try to pair them again.
An alternative: check each user before pairing them / filter down the list of pairing candidates to "current active users", based on querying Zulip.

(It is possible, but unlikely, that the user's e-mail could change between the check query and the send-message.)
Do we need to use e-mails at all, or could we sent to Zulip users by-ID? The above points still apply - Zulip users could disappear - but it seems more stable.

It looks like the RC API includes the Zulip ID...

cceckman · 2024-07-30T20:28:02Z

@stillgreenmoss spotted the underlying issue:

Pairing Bot uses the Zulip ID as the lookup key.
...but conducts pairing, "left batch", messages, etc. based on the e-mail address.
The e-mail address is mutable on the Zulip side without syncing to RC (?), and is only updated in the pairing DB when the user interacts.

So a "real fix" here:

Use the Zulip ID consistently: as the pairing DB key, as the message key, and as the key to join those databases with the RC API.

Easy tweaks for #55

Per [issue 55], Zulip will occasionally refuse a message. We weren't passing this further up the stack, or even logging the status code. Log the status code along with the body, and return an error up the stack, if the Zulip request gets a user error (4xx) or server error (5xx). [issue 55]: #55

cceckman · 2024-08-06T12:07:57Z

Current status: I think testbot is using Zulip IDs instead of e-mail for all identity purposes. Including talking to the RC API:

User: [redacted] was at RC last week: false and is at RC this week: true

The one remaining use of e-mail is for review identity. I think that's OK to leave for now; I'll file a follow-up "enhancement".

cceckman · 2024-08-06T15:19:54Z

OK! #62 is in and deployed to prod Pairing Bot. I tagged the original commits in issue55 if anyone wants to see the history in more detail.

I'm going to leave this open until we've seen all the crons run and take action:

Daily match-making
End-of-batch: should run this Saturday, and @jdkaplan should get a message
Status report: Thursday (2024-08-08)
Welcome: second Tuesday of the next batch (2024-08-20)

which we'll see trickle in over the next couple weeks.

jdkaplan · 2024-08-14T16:32:46Z

Potentially related: #70

In cases where someone had chosen to skip their next pairing (by sending a skip command to PB), the matching job would panic when "unskipping" them for the next day. That would prevent the matching job from sending out pairs.

So far, I've seen daily matching, checkin, and end-of-batch succeed (after some fixes to each), so I'll check those boxes now.

jdkaplan · 2024-08-20T18:34:33Z

Here to confirm the last cron job run! The welcome message was sent out a few minutes ago. Logs (maintainers-only link), Zulip message

I haven't seen any more reports of pairing issues in the last week, so I think the Zulip ID and unskip fixes were all we needed. That said, I wasn't watching the logs particularly closely, so (anyone) please feel free to open a new issue if this happens again!

cceckman added the bug Something isn't working label Jul 30, 2024

cceckman mentioned this issue Jul 30, 2024

Easy tweaks for issue 55 #56

Merged

cceckman added a commit that referenced this issue Jul 30, 2024

Merge pull request #56 from recursecenter/issue55

fda00b4

Easy tweaks for #55

cceckman mentioned this issue Jul 31, 2024

Merge dev to prod #57

Merged

cceckman mentioned this issue Aug 5, 2024

Alert on error #61

Open

cceckman mentioned this issue Aug 6, 2024

Remove e-mail usage #63

Open

cceckman assigned jdkaplan Aug 6, 2024

jdkaplan closed this as completed Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missed pairing opportunities on a weekly basis #55

Missed pairing opportunities on a weekly basis #55

cceckman commented Jul 30, 2024

cceckman commented Jul 30, 2024

cceckman commented Jul 30, 2024

cceckman commented Jul 30, 2024 •

edited

Loading

cceckman commented Jul 30, 2024 •

edited

Loading

cceckman commented Jul 30, 2024 •

edited

Loading

cceckman commented Aug 6, 2024

cceckman commented Aug 6, 2024 •

edited by jdkaplan

Loading

jdkaplan commented Aug 14, 2024 •

edited

Loading

jdkaplan commented Aug 20, 2024

Missed pairing opportunities on a weekly basis #55

Missed pairing opportunities on a weekly basis #55

Comments

cceckman commented Jul 30, 2024

cceckman commented Jul 30, 2024

cceckman commented Jul 30, 2024

cceckman commented Jul 30, 2024 • edited Loading

cceckman commented Jul 30, 2024 • edited Loading

cceckman commented Jul 30, 2024 • edited Loading

cceckman commented Aug 6, 2024

cceckman commented Aug 6, 2024 • edited by jdkaplan Loading

jdkaplan commented Aug 14, 2024 • edited Loading

jdkaplan commented Aug 20, 2024

cceckman commented Jul 30, 2024 •

edited

Loading

cceckman commented Jul 30, 2024 •

edited

Loading

cceckman commented Jul 30, 2024 •

edited

Loading

cceckman commented Aug 6, 2024 •

edited by jdkaplan

Loading

jdkaplan commented Aug 14, 2024 •

edited

Loading