Tech:Incidents/2017-10-28-Database

Db2 disk space was critical and reached the point where it was 0 MB, resulting in all it's dependencies (MediaWiki, Puppet) which lead to the site being down (503s). Same re-occurring issue as Tech:Incidents/2017-10-04-Database.

Summary[edit | edit source]

What services were affected?
- All services where dependent on db2 (MediaWiki and Puppet)
How long was there a visible outage?
- 2017-10-28 09:07 UTC until 18:30 UTC (9 hours 23 minutes)
What was/were the response times by each Operations member?
- Reception123 responded at 09:23 on IRC, emailed staff notifying them of the site being down and posted about the error on Twitter.
- NDKilla deleted binary logs on db2 and therefore the issue was resolved..
Was it caused by human error, supply/demand issues or something unknown currently?
- Caused by disk space getting to 0 MB.
Was the incident aggravated by human contact, users or investigating?
- Does not seem to be aggravated in any way.
How could response time by improved?
- Response time was better than previously, with the error being fixed in 9 hours. It could be improved by current sysadmins being notified quicker of the downtime, and being able to act after that.

Timeline[edit | edit source]

All times are in UTC.

09:07: The backends are sick and all wikis go down with 503 Backend Fetch Failed error
09:23: Reception123 notifies sysadmins via IRC and email about the error.
18:30 NDKilla deletes binary logs, and therefore the wikis go back up.

Quick facts[edit | edit source]

Db2 was close to critical for a long time, only it suddenly went from about 1.5 GB to 0 very quickly

Conclusions[edit | edit source]

The incident could have been prevented if binary logs were deleted and wikis were moved to db3 before db2 getting to 0 MB

Reporting[edit | edit source]

What services/sites were used to report the downtime?
- Twitter, IRC (topic)
What other services/sites were available for reporting, but were not used?
- Facebook

Actionables[edit | edit source]

Permanent fix

Allow CreateWiki to create databases on other servers other than db2.
Store binlogs for a shorter amount of time ( Done, changed from 14 to 5 days)

Others

Increase response time of Operations
Have more volunteers/operations to be able to respond to these situations, and monitor the servers so that they do not reach this point. ( Done added Reception123 as operations)
Manually move more wikis to db3 until number 1 on this list is resolved ( Done moved a few other large wikis to db3)

Meta[edit | edit source]

Who responded to this incident?
- Reception123, NDKilla
What services were affected?
- All services where dependent on db2 (MediaWiki and Puppet)
Who, therefore, needs to review this report?
- All Operations members
Timestamp: ...