Time: starting at 25.01.2016, especially at 28.01.2016, 01:22 pm – 28.01.2016, 01:38 pm
Affected: Computing center services , at 28.01. especially vweb servers and some storage systems, as consequence thereupon among otthers Aleph und GWDG Cloud Share
Impact: Intermittent delays when accessing service from GWDG, at 28.01. for around 15 minutes complete failure of the vweb servers and of some of GWDG’s storage systems. Due to the starage system failures the Aleph library servers and the Cloud Share service faild as well
Beginning at Monday 25th of January intermittent and vageu delays or dropout where reported for access to GWDG servers. Because many different systems seem to be affected, a problem in the data network is suspected. Due to the distribution about different systems and only sporadic occurence the investigation of the problem is regrettably very difficult.
Corrently the investigation focusses on the redundant connection between the data center routers at GWDG and SUB and the GÖNET backbone routers at GWDG and FMZ.
At 28th of January between 1:22pm an 1:38pm a complete failure of a subset of network connections in the computing center occured, when the data center router at the redundany data center at SUB was removed from the setup to reduce the complexity of the original setup. Reason for this failure seems to be a software bug of the router software, which can be triggered in the special setup at GWDG and leads to shutdown of network connections in special circumstances. The failure was resolved by reconnecting the router at SUB.
On Friday afternoon (29th) an on Mondy morning (1st) the complexity of the redunten setup was reduced by other measures to avoid the software bug. The first step at Friday semmed not to improve the situation substantially. The change at Mondy morning seems to lead to an improvement, but the situation is still momitored and investigated.
We apologize for any inconvenience.