---------- Email 1 ---------- From: IT Helpdesk To: All Staff Date: Monday, February 24, 2025 7:58 AM Subject: [RESOLVED] M365 Copilot Unavailable — 6:00 AM to 7:45 AM Copilot services have been restored as of 7:45 AM. We apologize for the disruption. A root cause investigation is underway. — IT Operations ---------- Email 2 ---------- From: Marcus Webb To: IT Operations ; Sandra Lim Date: Monday, February 24, 2025 8:22 AM Subject: RE: [RESOLVED] M365 Copilot Unavailable — 6:00 AM to 7:45 AM Sandra — this is the second outage in three weeks. We had an exec briefing at 6:30 this morning and four of the attendees couldn't access their Copilot meeting prep summaries. What happened? Marcus Webb Chief of Staff ---------- Email 3 ---------- From: Sandra Lim To: Marcus Webb ; IT Operations Date: Monday, February 24, 2025 9:47 AM Subject: RE: [RESOLVED] M365 Copilot Unavailable — 6:00 AM to 7:45 AM Marcus, Here is what we know so far: At 5:58 AM, an automated certificate renewal job ran against our Azure AD tenant. The job was scheduled for 3:00 AM but was delayed due to a queue backlog from Sunday's backup window. The certificate renewal process temporarily invalidated the authentication tokens used by Copilot's service-to-service connections. The on-call engineer was alerted at 6:12 AM but was responding to a separate storage alert and didn't triage the Copilot alert until 6:31 AM. Manual certificate re-issuance took approximately 70 minutes because the runbook required two-person approval and the secondary approver was unavailable until 7:15 AM. Root factors: 1. Certificate job was not isolated from backup scheduling windows — both are managed by the same queue 2. On-call alert priority for Copilot auth failures was set to P2 (non-urgent) rather than P1 3. Two-person approval requirement for cert re-issuance has no escalation path for out-of-hours incidents 4. No automated rollback capability exists for this cert type This is the same underlying scheduling conflict that caused the January 31st outage. The January fix only patched the specific cert that failed that day, rather than addressing the queue isolation issue. I'm scheduling a post-incident review for Wednesday. I'd like to bring in the Azure team to assess the queue fix. Sandra Lim Director, IT Operations ---------- Email 4 ---------- From: James Ochoa To: Sandra Lim ; Marcus Webb Date: Monday, February 24, 2025 11:03 AM Subject: RE: [RESOLVED] M365 Copilot Unavailable — 6:00 AM to 7:45 AM Sandra, Two things from the Azure team's side: The queue isolation fix is straightforward — estimated 1 day of engineering work. We can have it done by Thursday if we get sign-off from change management today. On the runbook approval process — that two-person rule was put in place after the 2023 accidental tenant lockout. It can't be removed, but we can pre-authorize a list of break-glass approvers who can be called at any hour. We never set that up. That's on us. I can have both proposals documented for the Wednesday review. James Ochoa Azure Platform Engineering ---------- Email 5 ---------- From: Sandra Lim To: Marcus Webb ; James Ochoa ; IT Operations Date: Monday, February 24, 2025 2:15 PM Subject: RE: [RESOLVED] M365 Copilot Unavailable — 6:00 AM to 7:45 AM Marcus — please treat today's summary as the preliminary RCA. I'll have a formal write-up ready before the Wednesday session. James — approved to proceed with the queue isolation fix. Submitting the change request now. Action items before Wednesday: - James: Queue isolation fix (Thu) + break-glass approver list (Tue) - Sandra: Escalate Copilot alert priority to P1 in PagerDuty (today) - Sandra: Formal RCA document for Wednesday review - All: Post-incident review Wednesday 2:00 PM, Conference Room B / Teams Sandra