One of the great features (also unknown feature) of VMware Cloud on AWS (VMC) is its ability to handle infrastructure problems \ failures, the feature is know as Auto-Remediation.

The Auto-Remediation service within VMC, monitors the health of your environment. This feature allows customers to build a resilient and a high availability software defined data center (SDDC) to run their applications from.

Hardware NEVER FAILS!!!! Right???

AWS Infrastructure is very very very very very reliable, but failures are inevitable. Just like your on premises hardware, cloud hardware can have the similar issues as well. Failed Disks, failed hosts or even more wide spread failures. The benefit of AWS is they have built frameworks to help customers protect themselves from failures.

The AWS Well Architected Framework helps customers build the most secure, high-performing, resilient, and efficient infrastructure possible for their applications.

Using Dynatrace to master the 5 pillars of the AWS Well-Architected  Framework (Part 1) | LaptrinhX

One of the 5 pillars above is the AWS Reliability Architecture Framework which covers 5 design principles for reliability in the cloud, these include:

(Note: I have copied/pasted the below five point from the above link)

  • Automatically recover from failure: By monitoring a workload for key performance indicators (KPIs), you can trigger automation when a threshold is breached. These KPIs should be a measure of business value, not of the technical aspects of the operation of the service. This allows for automatic notification and tracking of failures, and for automated recovery processes that work around or repair the failure. With more sophisticated automation, it’s possible to anticipate and remediate failures before they occur.
  • Test recovery procedures: In an on premises environment, testing is often conducted to prove that the workload works in a particular scenario. Testing is not typically used to validate recovery strategies. In the cloud, you can test how your workload fails, and you can validate your recovery procedures. You can use automation to simulate different failures or to recreate scenarios that led to failures before. This approach exposes failure pathways that you can test and fix before a real failure scenario occurs, thus reducing risk.
  • Scale horizontally to increase aggregate workload availability: Replace one large resource with multiple small resources to reduce the impact of a single failure on the overall workload. Distribute requests across multiple, smaller resources to ensure that they don’t share a common point of failure.
  • Stop guessing capacity: A common cause of failure in on premises workloads is resource saturation, when the demands placed on a workload exceed the capacity of that workload (this is often the objective of denial of service attacks). In the cloud, you can monitor demand and workload utilization, and automate the addition or removal of resources to maintain the optimal level to satisfy demand without over- or under-provisioning. There are still limits, but some quotas can be controlled and others can be managed (see Manage Service Quotas and Constraints).
  • Manage change in automation: Changes to your infrastructure should be made using automation. The changes that need to be managed include changes to the automation, which then can be tracked and reviewed.

How does this related to VMC? Well, VMC has been designed inline with the above 5 principles, which means our customers have a highly automated and protected platform to run their applications.

The best thing about the auto-redefinition service is, most of the action happens in the background, without affecting your existing workloads.

Auto-Remediation goes into action when a problem is detected, once detected the service inserts new AWS hardware into the VMC SDDC and removes the degraded / failed hardware. In the event that host fails, a new AWS host is added into the VMC SDDC and the workloads that were on the failed host will take advantage of vSphere HA, which automatically restarts any VMs which were running on a failed server.

Auto-remediation runs in the auto-scaler service and receives alerts.

  • SDDC: Every SDDC runs a monitoring service that checks the host health.
  • AWS: AWS host planned maintenance events are sent to auto-scaler

The high-level view of the auto-remediation architecture.

  • A monitoring service at the SDDC level receives notifications from the underlying components of the system.
  • AWS sends VMware, host level information, most notably AWS Planned Maintenance events. The auto-scaler service receives these notifications and automatically remediates any issues within the SDDC.

The Auto-Remediation service runs in the background, so you won’t really see much action, it just does its thing and works. But most technical people want to know the process. So here it is:

  1. The VMC platform continuously monitors the system health of all your SDDCs. When a failure is detected, an event is sent to auto-remediation. The Process is:
    • Monitors hardware and software faults
    • Provision hardware automatically when a failure detected
    • Automatically remediate failures when possible
    • SRE manually interjects when an automatic resolution is unavailable
  2. Wait for transient events – Some of the detected failures can be temporary. For example, the monitoring system can not reach a host due to a temporary connectivity issue. Auto-remediation waits for 5 minutes to determine if the problem is temporary. If the problem clears, the auto-remediation returns without taking any action.
  3. Add a Host – If the error does not resolve after 5 minutes, auto-remediation begins adding a host to the SDDC even if you do not yet know if a host is required or not. Adding a host now ensures that it is available when required. Note that you are not billed for this host until it replaces a faulty host in your SDDC.
  4. Determine a failure type and take action – Hosts can fail for different reasons, and require different action. For example, a vSAN disk failure on a host that is still connected to a vCenter Server can be remediated through a soft reboot, whereas a PSOD host requires a hard reboot. The auto-remediation logic for this is complex and constantly evolving, but you can review the error and take the least intrusive action. Auto-Remediation is an internal process, and customers have no access to the logic. If you encounter any issues, you can contact VMware support.
  5. Check Host Health – The next step is to check if the remediation action has fixed the host. If the failed host is now healthy after a soft or hard reboot, auto-remediation avoids further disruption to the SDDC. It collects and takes any other necessary actions and removing the new host that was added pre-emptively in Step 2.
  6. Replace Host – If the failed host cannot be revived then auto-scaler removes the failed host, and replaces it with the host that was added in Step 2. At this stage, auto scaler removes the failed host and replace it with the host that was added in Step 2. vSphere HA and vSAN are triggered and compute policy tags are attached to the new host.

While this is a feature you don’t get to play with, or really see, it is essential in providing VMC customer with an automated and highly available platform that can keep customers applications running during failures.