Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missed pairing opportunities on a weekly basis #55

Closed
cceckman opened this issue Jul 30, 2024 · 9 comments
Closed

Missed pairing opportunities on a weekly basis #55

cceckman opened this issue Jul 30, 2024 · 9 comments
Assignees
Labels
bug Something isn't working

Comments

@cceckman
Copy link
Collaborator

@iafisher reports:

I'm signed up for pairing Tuesday through Friday, but the last couple of weeks Pairing Bot has skipped Tuesday and Thursday without any message.

A status message from Ian shows T/W/R/F pairing schedule.

@stillgreenmoss notes that we may need to instrument for more information (i.e. log more) to effectively debug this.

@cceckman cceckman added the bug Something isn't working label Jul 30, 2024
@cceckman
Copy link
Collaborator Author


My thoughts on "how to investigate", having done no actual investigation yet:

  • Where is the intent to match Ian getting dropped?
    • Is there a way to run the pairing algorithm locally, as f(real preferences, simulated date), and see what pairings it spits out? What messages?
    • What happens if a message doesn't get sent? Is there a (noisy) error? What if a message does get sent -- do we have a confirmation?

A little investigation: in http://console.cloud.google.com/logs/query,

resource.type="gae_app"
protoPayload.userAgent="AppEngine-Google; (+http://code.google.com/appengine)"

makes a good starting point for finding the /match invocation.

@cceckman
Copy link
Collaborator Author

A red herring path I went down: timezones.

The crontab says to perform matches at 4:00. Docs say if unspecified, the timezone is UTC.

In the winter, 4:00 UTC is 11:00 EST (EST is UTC - 5:00). In the summer, 4:00 UTC is 00:00 EDT (EDT is UTC - 4:00). So in the winter the pairing bot runs "the day before", computing "tomorrow"'s matches; in the summer, it runs "the same day".

But ListPairingTomorrow does the right thing: it uses GAE's time zone, UTC, and picks up the same day. That's correct regardless of what Eastern time is doing.

@cceckman
Copy link
Collaborator Author

cceckman commented Jul 30, 2024

OK, I see something odd in the logs, at least.

A lot of the pairings generated wind up with a BAD_REQUEST message attached. In this case, there's two adjacent messages:

2024/07/30 04:00:02 {"result":"error","msg":"Invalid email '[redacted A]'","code":"BAD_REQUEST"}
2024/07/30 04:00:02 [redacted A] was matched with [redacted B]

"Invalid email" is one of the messages, for several e-mail addresses. There's also one report of [email] is no longer using Zulip.


It looks like this error isn't showing up in the match function; the prefix "Error when trying to send..." isn't showing up in the logs.

sendUserMessage is logging the response body with no additional annotation; but it looks like nothing is classifying that Zulip response with an error. :(

At least for today, I think this is a sufficient explanation for why Ian didn't get a pairing. The bot sent a message to multiple recipients, but one of those recipients was invalid; the message as a whole didn't send. (I haven't check to see if this was also the problem on other days.)


I suspect this could be responsible for a few different dropped pairings.

Some changes, maybe in priority order:

  • Configure the Golang logger to log the file name and file number of the log statement. This is Immensely Useful for debugging -- it doesn't quite turn the logs into a trace, but it makes it a lot easier to line things up.

  • A non-2xx HTTP response doesn't cause an error; we should at least check the HTTP response code for the Zulip responses.

  • Check the Zulip API, and if needed, perform more-complete error checking for Zulip replies. Do we expect Zulip to return a non-2xx error for these (BAD_REQUEST)? Or do we need to check the body?

  • Alerting! Proactive telemetry, not reactive log-looking; if the bot fails, or fails repeatedly over some amount of time, there should be something that says "hey go look at the logs pls?".

  • Data cleanup: it looks like there are some users that Zulip considers invalid. Why? How can/should we know, so we can remove them from the pairing set?

cceckman added a commit that referenced this issue Jul 30, 2024
Per [issue 55], Zulip will occasionally refuse a message. We weren't
passing this further up the stack, or even logging the status code.

Log the status code along with the body, and return an error up the
stack, if the Zulip request gets a user error (4xx) or server error
(5xx).

[issue 55]: #55
@cceckman
Copy link
Collaborator Author

cceckman commented Jul 30, 2024

Zulip documentation confirms that errors will return 4xx/5xx errors, as well as error information in the body.

Neither the "Invalid e-mail" nor the "...is no longer using Zulip" messages have the relevant e-mail as structured data. The API docs have a "tip" saying to start a conversation on the Zulip Zulip instance as to whether there are specific key/value pairs that would be relevant.

  • Consider following up with Zulip: can the invalid e-mail(s) be returned as a structured error response?

    This would be one way (not the only way) for us to do more graceful error handling; we could punt the user that is valid down the list and try to pair them again.

  • An alternative: check each user before pairing them / filter down the list of pairing candidates to "current active users", based on querying Zulip.

    (It is possible, but unlikely, that the user's e-mail could change between the check query and the send-message.)

  • Do we need to use e-mails at all, or could we sent to Zulip users by-ID? The above points still apply - Zulip users could disappear - but it seems more stable.

    It looks like the RC API includes the Zulip ID...

@cceckman
Copy link
Collaborator Author

cceckman commented Jul 30, 2024

@stillgreenmoss spotted the underlying issue:

  • Pairing Bot uses the Zulip ID as the lookup key.
  • ...but conducts pairing, "left batch", messages, etc. based on the e-mail address.
  • The e-mail address is mutable on the Zulip side without syncing to RC (?), and is only updated in the pairing DB when the user interacts.

So a "real fix" here:

  • Use the Zulip ID consistently: as the pairing DB key, as the message key, and as the key to join those databases with the RC API.

cceckman added a commit that referenced this issue Jul 30, 2024
cceckman added a commit that referenced this issue Jul 31, 2024
Per [issue 55], Zulip will occasionally refuse a message. We weren't
passing this further up the stack, or even logging the status code.

Log the status code along with the body, and return an error up the
stack, if the Zulip request gets a user error (4xx) or server error
(5xx).

[issue 55]: #55
@cceckman
Copy link
Collaborator Author

cceckman commented Aug 6, 2024

Current status: I think testbot is using Zulip IDs instead of e-mail for all identity purposes. Including talking to the RC API:

User: [redacted] was at RC last week: false and is at RC this week: true

The one remaining use of e-mail is for review identity. I think that's OK to leave for now; I'll file a follow-up "enhancement".

@cceckman
Copy link
Collaborator Author

cceckman commented Aug 6, 2024

OK! #62 is in and deployed to prod Pairing Bot. I tagged the original commits in issue55 if anyone wants to see the history in more detail.

I'm going to leave this open until we've seen all the crons run and take action:

  • Daily match-making
  • End-of-batch: should run this Saturday, and @jdkaplan should get a message
  • Status report: Thursday (2024-08-08)
  • Welcome: second Tuesday of the next batch (2024-08-20)

which we'll see trickle in over the next couple weeks.

@jdkaplan
Copy link
Collaborator

jdkaplan commented Aug 14, 2024

Potentially related: #70

In cases where someone had chosen to skip their next pairing (by sending a skip command to PB), the matching job would panic when "unskipping" them for the next day. That would prevent the matching job from sending out pairs.


So far, I've seen daily matching, checkin, and end-of-batch succeed (after some fixes to each), so I'll check those boxes now.

@jdkaplan
Copy link
Collaborator

Here to confirm the last cron job run! The welcome message was sent out a few minutes ago. Logs (maintainers-only link), Zulip message

I haven't seen any more reports of pairing issues in the last week, so I think the Zulip ID and unskip fixes were all we needed. That said, I wasn't watching the logs particularly closely, so (anyone) please feel free to open a new issue if this happens again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants