Boards PROD02 login issue

Incident Report for Diligent

Postmortem

Systems Impacted: The Identity Service is used during the login process of multiple applications. During this incident, login for the Boards, Messenger, and Secure File Share applications were all impacted.

What happened: On January 19th, 2023 Diligent experienced failure of the Identity Login Service. This is a critical service and functions as the backend authentication framework for multiple Diligent applications. During the incident, Diligent’s SRE team discovered that multiple iterations of the Identity Service were hung in a startup state. Even though the Identity Service has multiple redundancies, the service required manual intervention to recover.

Technical details: Diligent uses the Kubernetes platform to manage the availability and scalability of its software. Kubernetes does automatic failover between cluster nodes to rebalance computing workloads, scale resources and to self-recover crashed services. During this incident, our Engineers determined that Identity service did not crash. As multiple Identity pods were moved around to different Kubernetes nodes, these pods could not come online because there was an issue connecting to our Artifactory repository backend. The SRE team discovered that both the primary Artifactory cluster in the local datacenter and the backup (main) Artifactory cluster in the remote datacenter were not available. The solution was to restart the main Artifactory server and then reissue a restart of the IdentityService pods to restore the ability to login.

Corrective and preventing steps: Diligent applications are designed with high availability and redundancy in mind. We treat every incident as an opportunity for lessons learn to further improve our systems and processes. The following improvements will be implemented:

Improvements to Artifactory –better monitoring, review primary and secondary cluster deployments for redundancy.
Improvements to Kubernetes pod start up process –the pods should have not halted when Artifactory wasn’t available. A better process would stop trying to pull new images and use the existing images on the local system
Improvements to Kubernetes disk pressure –the Kubernetes node should never have reached 80%capacity. A review of log rotation and garbage collector clean-up is underway.
Backend improvements–is Artifactory the correct tool? Why did it halt? Is there a better tool in the cloud?

Posted Feb 07, 2023 - 11:58 UTC

Resolved

The service disruption has been resolved.

Posted Jan 19, 2023 - 15:19 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jan 19, 2023 - 14:54 UTC

Update

Systems are now operational again. Further investigations are being completed and a root cause will be updated here soon.

Posted Jan 19, 2023 - 14:53 UTC

Investigating

We are currently experiencing a service disruption.

Our Site Reliability team is working to identify the root cause and implement a solution.

Users may be experiencing difficulties accessing Boards materials or Messenger communications.

Posted Jan 19, 2023 - 13:59 UTC

This incident affected: Board & Leadership - Boards (Boards (North America - US)).