Privacy firm Proton suffered a massive worldwide outage today, taking down most services, with Proton Mail and Calendar users still unable to connect to their accounts. […]

  • gaiety@lemmy.blahaj.zone
    link
    fedilink
    English
    arrow-up
    8
    ·
    13 hours ago

    I noticed this, but it felt like a blip. I’m a paying customer and their status page was kept up to date, I still trust Proton to maintain better services than I can self host while having better values than something with insignificantly more uptime like Gmail which has had its own share of outages time to time.

  • lIlIllIlIIIllIlIlII@lemmy.zip
    link
    fedilink
    English
    arrow-up
    3
    ·
    edit-2
    13 hours ago

    From https://status.proton.me/

    Earlier today at around 4PM Zurich, the number of new connections to Proton’s database servers increased sharply globally across Proton’s infrastructure.

    This overloaded Proton’s infrastructure, and made it impossible for us to serve all customer connections. While Proton VPN, Proton Pass, Proton Drive/Docs, and Proton Wallet were recovered quickly, issues persisted for longer on Proton Mail and Proton Calendar. For those services, during the incident, approximately 50% of requests failed, leading to intermittent service unavailability for some users (the service would look to be alternating between up and down from minute to minute).

    Normally, Proton would have sufficient extra capacity to absorb this load while we debug the problem, but in recent months, we have been migrating our entire infrastructure to a new one based on Kubernetes. This requires us to run two parallel infrastructure at the same time, without having the ability to easily move load between the two very different infrastructures. While all other services have been migrated to the new infrastructure, Proton Mail is still in middle of the migration process.

    Because of this, we were not able to automatically scale capacity to handle the massive increase in load. In total, it took us approximately 2 hours to get back to the state where we could service 100% of requests, with users experiencing degraded performance until then. The service was available, but only intermittently, with performance being substantially improved during the second hour of the incident, but requiring an additional hour to fully resolve.

    A parallel investigation by our site reliability engineering team identified a software change that we suspected was responsible for the initial load spike. After this change was rolled back, database load returned to normal. This change was not initially suspected because a long period of time had elapsed between when this change was introduced and when the problem manifested itself, and an initial analysis of the code suggested that it should have no impact on the number of database connections. A deeper analysis will be done as part of our post-mortem process to understand this better.

  • calmluck9349@infosec.pub
    link
    fedilink
    English
    arrow-up
    3
    ·
    14 hours ago

    It’s back online now an they have a very detailed report about it on protons status page. I did not notice the outage.