Skip to content

Commit eccd43c

Browse files
Status report on evaluation function outage
1 parent 408803d commit eccd43c

File tree

1 file changed

+24
-0
lines changed

1 file changed

+24
-0
lines changed

docs/releases/status.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,30 @@ This page contains information about any known incidents where service was inter
44

55
The Severity of incidents is the product of number of users affected (for 100 users, N = 1), magnitude of the effect (scale 1-5 from workable to no service), and the duration (in hours). Severity below 1 is LOW, between 1 and 100 is SIGNIFICANT, and above 100 is HIGH. The severity is used to decide how much we invest in preventative measures, detection, mitigation plans, and rehearsals.
66

7+
## 2025 August 27th: Evaluation functions temporarily unavailable (Severity: LOW)
8+
9+
The app was available and fully functional during this time and successfully called external evaluation functions. The evaluation functions managed by the Lambda Feedback team (which is most of them at the current time) became unavailable due to the API gateway of those functions being modified incorrectly. During this time, users submitting an answer on the app were given an error message.
10+
11+
### Timeline
12+
13+
2025/08/26 17:54 Evaluation functions became unavailable due to a deployment error.
14+
2025/08/26 18:21 Message added to the home page. Fix began development and testing.
15+
2025/08/26 21:51 Fix is complete and home pag eupdated.
16+
17+
Estimated number of users affected: one. This low number was due to a quiet period in the academic year, and the rapid response to the problem.
18+
19+
### Analysis
20+
21+
- Due to evaluation functions having only one environment ('staging') that was used by both the STAGING and PROD versions of the app, changes to the staging gateway affected the production application.
22+
- The error itself happened because an update to infrastructure included changes by a different developer that weren't noticed by the one pushing the changes
23+
24+
### Lessons learned
25+
26+
- Implement independent staging and prod environments for evaluation functions (DONE as part of the fix)
27+
- When pushing infrastructure changes, always run Pulumi preview before starting, to see if changes are already awaiting push
28+
- Don't push infrastructure changes when no other developers are available to support any issues
29+
- Create a feature on the app for admins to optionally declare a base URL for evaluation functions, allowing groups of evaluation functions to be rapidly redirected
30+
731
## 2025 March 28th: access blocked within a particular organisation's WiFi (Severity: SIGNIFICANT)
832

933
The URL lambdafeedback.com is served by a content delivery network (CDN), that was blocked by a particular organisation's WiFi. During this period, users on that WiFi couldn't access the site.

0 commit comments

Comments
 (0)