Tech:Incidents/2017-10-28-Database
Db2 disk space was critical and reached the point where it was 0 MB, resulting in all it's dependencies (MediaWiki, Puppet) which lead to the site being down (503s). Same re-occurring issue as Tech:Incidents/2017-10-04-Database.
Summary[edit | edit source]
- What services were affected?
- All services where dependent on db2 (MediaWiki and Puppet)
- How long was there a visible outage?
- 2017-10-28 09:07 UTC until 18:30 UTC (9 hours 23 minutes)
- What was/were the response times by each Operations member?
- Reception123 responded at 09:23 on IRC, emailed staff notifying them of the site being down and posted about the error on Twitter.
- NDKilla deleted binary logs on db2 and therefore the issue was resolved..
- Was it caused by human error, supply/demand issues or something unknown currently?
- Caused by disk space getting to 0 MB.
- Was the incident aggravated by human contact, users or investigating?
- Does not seem to be aggravated in any way.
- How could response time by improved?
- Response time was better than previously, with the error being fixed in 9 hours. It could be improved by current sysadmins being notified quicker of the downtime, and being able to act after that.
Timeline[edit | edit source]
All times are in UTC.
- 09:07: The backends are sick and all wikis go down with 503 Backend Fetch Failed error
- 09:23: Reception123 notifies sysadmins via IRC and email about the error.
- 18:30 NDKilla deletes binary logs, and therefore the wikis go back up.
Quick facts[edit | edit source]
- Db2 was close to critical for a long time, only it suddenly went from about 1.5 GB to 0 very quickly
Conclusions[edit | edit source]
- The incident could have been prevented if binary logs were deleted and wikis were moved to db3 before db2 getting to 0 MB
Reporting[edit | edit source]
- What services/sites were used to report the downtime?
- Twitter, IRC (topic)
- What other services/sites were available for reporting, but were not used?
Actionables[edit | edit source]
- Permanent fix
- Allow CreateWiki to create databases on other servers other than db2.
- Store binlogs for a shorter amount of time ( Done, changed from 14 to 5 days)
- Others
- Increase response time of Operations
- Have more volunteers/operations to be able to respond to these situations, and monitor the servers so that they do not reach this point. ( Done added Reception123 as operations)
- Manually move more wikis to db3 until number 1 on this list is resolved ( Done moved a few other large wikis to db3)
Meta[edit | edit source]
- Who responded to this incident?
- Reception123, NDKilla
- What services were affected?
- All services where dependent on db2 (MediaWiki and Puppet)
- Who, therefore, needs to review this report?
- All Operations members
- Timestamp: ...