Disaster Recovery Setup of a Live Production System for a Threat Exposure Management Company

Developed a full disaster recovery setup for live production environment in AWS

Project Overview

Designed and implemented a full disaster recovery (DR) strategy for a live production system running on AWS for a Threat Exposure Management Company. The solution included multi-region failover, data replication, and automated recovery processes to ensure business continuity. Critical workloads such as RabbitMQ (Amazon MQ), Redis (AWS ElastiCache), and MongoDB Atlas were integrated into the DR plan. Kubernetes workloads were made highly available with EFS-backed persistent volumes and GitOps-driven deployments via Helm and ArgoCD. Security and reliability were strengthened with AWS WAFv2 and Datadog observability, while infrastructure provisioning was automated with Terraform and Pulumi.

Key Outcomes

Delivered a fully tested disaster recovery environment with defined RTO/RPO objectives

Enabled multi-region failover for Kubernetes workloads and stateful services

Enhanced MongoDB Atlas availability and durability during disaster

Implemented automated infrastructure recovery workflows with Terraform and Pulumi

Enhanced system reliability, security, and monitoring with WAFv2, EFS, and Datadog

Project Toolstack

AWSTerraformPulumiKubernetesRabbitMQAmazon MQAWS ElastiCacheRedisMongoDB AtlasDatadogAWS EFSWAFv2S3AWS ClientVPNHelmArgoCD

Interested?

Want to learn more about this project or discuss a similar solution for your business?