Time: Aug 20th 2016, 0:00 a.m. - Aug 21st 2016, 6:00 p.m.
Affected: all GWDG users
Impact: alfunction of services
During the night of Saturday, Aug 20th, 2016, there were power outages due to problems at the substation Göttingen-Weende in several districts of Göttingen and also on the Max Planck Campus in Göttingen-Nikolausberg, which also affected the GWDG. Three power outages in the period 0:07 a.m. to 1:13 a.m., 1:31 a.m. to 1:38 a.m. and 1:38 a.m. to 1:48 a.m. led to problems with and/or harm especially to multiple storage systems, network components and servers. Beside regular uninterruptible power supplies (UPS), the main components are additionally supplied by a diesel generator UPS. Unfortunately for unknown reasons one of three phases failed due to a blown fuse so that parts of these key components failed unexpectedly, too.
Many services of GWDG were affected; most important are, ordered by time of re-commissioning:
Central network components to Aug 20th, 2016, around 8 a.m.
Virtual Server (vSphere) to Aug 20th, 2016, around 10 a.m.
E-mail Service (Exchange) to Aug 20th, 2016, around 10 a.m.
GWDG-Cloud Share and Aleph library system to Aug 20th, 2016, about 1:30 p.m.
Extranet of MPG General Administration to Aug 20th, 2016, about 6 p.m.
FTP server to Aug 21st, 2016, around 12 a.m.
Virtual Web servers to Aug 21st, 2016, about 1 p.m.
File Service (Personal and group drives, UNIX file systems) to Aug 21st, 2016, about 2 p.m.
HPC systems and virtual machines in the GWDG cloud to Aug 21st, 2016, about 2 p.m.
A power failure for more than 10 to 15 minutes can not be compensated by UPS systems, because significant heat problems due to the lack of air conditioning arise. Already the heat input of about 20 kW by the systems supplied by the diesel generator leads to room temperatures about 30 degrees Celsius. The lack of a phase due to the blown fuse led to failed central network components so that connections to the redundancy locations in the telecommunications headquarters and the State and University Library were disturbed and the fail-over of appropriately configured services worked partially only.
The long-term failure of the general storage area (the high-availability storage NetApp Metro Cluster failed over correctly to the SUB site) is caused by the conceptual structure. GWDG operates about 80 storage systems, which are abstracted by virtualization. The file systems are distributed for performance, scaling, and licensing issues across all systems. Five of the 80 systems with a total capacity of about 150 TB of data failed on initialization after the power outages. Repair works lasted until the night hours of Saturday. After subsequent verification of the storage virtualization which was absolutely necessary the storage environment was available again on Sunday about noon. Thereafter, with smaller effort, file servers and file systems and its follow-up services could be put back into operation.