Systems Impacted: The Identity Service is used during the login process of multiple applications. During this incident, login for the Boards, Messenger, and Secure File Share applications were all impacted.
What happened: On January 19th, 2023 Diligent experienced failure of the Identity Login Service. This is a critical service and functions as the backend authentication framework for multiple Diligent applications. During the incident, Diligent’s SRE team discovered that multiple iterations of the Identity Service were hung in a startup state. Even though the Identity Service has multiple redundancies, the service required manual intervention to recover.
Technical details: Diligent uses the Kubernetes platform to manage the availability and scalability of its software. Kubernetes does automatic failover between cluster nodes to rebalance computing workloads, scale resources and to self-recover crashed services. During this incident, our Engineers determined that Identity service did not crash. As multiple Identity pods were moved around to different Kubernetes nodes, these pods could not come online because there was an issue connecting to our Artifactory repository backend. The SRE team discovered that both the primary Artifactory cluster in the local datacenter and the backup (main) Artifactory cluster in the remote datacenter were not available. The solution was to restart the main Artifactory server and then reissue a restart of the IdentityService pods to restore the ability to login.
Corrective and preventing steps: Diligent applications are designed with high availability and redundancy in mind. We treat every incident as an opportunity for lessons learn to further improve our systems and processes. The following improvements will be implemented: