Information: ownCloud Clients lose their authorization – a post mortem analysis

Message-Id: 201803061809
Time: since 01.03.2018
Affected: ownCloud users using desktop and mobile clients
Impact: Clients lost access regularily

For the past five days, the ownCloud clients seemed to occasionally, at irregular intervals, forget their authorization. Since the GWDG ownCloud team did not make any further changes to the system during this time, we had a very diffuse problem to solve.

For all those who are not interested in the technical details of the problem, here is a short summary:

A higher latency between the database servers led to the described behavior. An interim solution was implemented, the clean solution was rolled out in the test system.

Now for the technical details:

For the ownCloud instance we use a Galera Cluster as database server with Maxscale as connection router. The database cluster consists of three nodes that are redundantly distributed over three locations. The connection router uses read-write splitting, which always distributes write accesses to one server and read accesses to the other two servers. The advantage of this is that a read-only-heavy application like ownCloud can then benefit from multiple reading servers. So far, so good.

The ownCloud clients have been using the oAuth authorization since version 10. Theoretically, a so-called “Refresh Token” must be stored once in the client in order to retrieve an “Auth Token” from the server. The “Auth Token” will be used for any future logins.

The creation of the Refresh token is noticed by the users. You have to agree once in the browser that this client is allowed to access its ownCloud account. If the Auth token, that was fetched with the “Refresh Token” expires, the client gets a new Auth token. This process should not be noticed by the user and happens once every hour.

If a client wants to log on to the server shortly afterwards with the newly received Auth token, it can happen that the token has not yet arrived on the reading servers due to too slow replication of the database cluster. The authorization fails and the user is prompted again to approve the creation of a new refresh token.

As a short interim measure, read-write splitting was dispensed with in all production systems and all servers were configured against a single database server.

We are currently testing a filter in the test systems that routes all read accesses that belong to a write access to the same database server within a certain interval. If this should prove to be stable within this week, the filter will also be deployed in the production systems.