Tech:Incidents/2018-11-04-all-wikis-down

From Miraheze Meta, Miraheze's central coordination wiki

Summary[edit | edit source]

Provide a summary of the incident:

  • What services were affected?
    • MediaWiki
  • How long was there a visible outage?
    • 8 mins.
  • Was it caused by human error, supply/demand issues or something unknown currently?
    • yes, was caused by syntax error.
  • Was the incident aggravated by human contact, users or investigating?
    • No.

Timeline[edit | edit source]

Provide a timeline of everything that happened from the first reports to the resolution of the incident. If the time of the very first incident is know (previous incident, the time the service failed, time a patch was applied), include it as well. Time should be in 24-hour standard based on the UTC timezone.

  • 21:27 paladox committed 6a1381f - Testing CentralNotice opt out on test1wiki
  • 21:33 icinga-miraheze PROBLEM - cp2 Varnish Backends on cp2 is CRITICAL: 3 backends are down. mw1 mw2 mw3
  • 21:36 John reverts commit 19a30d8 - Revert "Testing CentralNotice opt out on test1wiki" This reverts commit 6a1381f99902009f30f8c6211ae014b6d0f4510a.
  • 21:38 icinga-miraheze RECOVERY - cp2 HTTPS on cp2 is OK: HTTP OK: HTTP/1.1 200 OK - 23567 bytes in 0.501 second response time

Quick facts[edit | edit source]

Provide any relevant quick facts that may be relevant:

  • Are there any known issues with the service in production?
    • Nope.
  • Was the cause preventable by us?
    • Yes, by making sure to look over your change before merging, or when stepping away tell someone to keep a eye out.
  • Have there been any similar incidents?
    • Definitely

Conclusions[edit | edit source]

Provide conclusions that have been drawn from this incident only:

  • Was the incident preventable? If so, how?
    • Yes, when stepping away tell someone so they monitor the roll out.
  • Is the issue rooted in our infrastructure design?
    • Nope.
  • State any weaknesses and how they can be addressed.
    • None.
  • State any strengths and how they prevented or assisted in investigating the incident.
    • John was quick to revert my patch.

Meta[edit | edit source]

  • Who responded to this incident?
    • John
  • What services were affected?
    • MediaWiki
  • Who, therefore, needs to review this report?
    • John
  • Timestamp.
    • 22:38, 4 November 2018 (UTC)