Malfunction: Partial outage of Jupyter-Cloud from 20th 4pm to 21st 10.30am

Message-Id: 202201211748
Time: 20th 4pm to 21st 10.30am
Affected: Jupyter-Cloud
Impact: Malfunction

Yesterday on the 20th at around 4pm the service Jupyter-Cloud suffered a partial outage which was corrected today around 10.30am.

As a result some users were able to log into the service but starting their notebook servers failed with an error message.

The problem impacted only some users who are scheduled to the newer, ESX-based part of the service, which is true for most new users.

In order to avoid side effects of the problem after the resolution was implemented all notebook server were stopped once. Users currently logged into the server were asked to restart their notebooks through the website of the service.

Causal to the problem was the exhaustion of an index in the server’s kernel which maintains keys for authentication, encryption and similar data in a server. In the newer part of the service notebooks are run in a Docker service which operates in a restricted context, unlike the unrestricted context which is the Docker default. A restricted context for applications also has a vastly restricted size for such indexes, which was quickly exceeded by our service. A permanent enlargement of said index has resolved the problem.