Health Check and Self-Healing Patterns in Java Microservices – Ensure Availability and Resilience

Illustration for Health Check and Self-Healing Patterns in Java Microservices – Ensure Availability and Resilience
By Last updated:

Introduction

Modern Java microservices need to be always-on, self-aware, and resilient. But what happens when a service hangs, a DB goes down, or a thread pool is exhausted?

Enter the Health Check and Self-Healing Patterns—two foundational resilience strategies that help detect failures and recover from them automatically, often without human intervention.

In this tutorial, you'll learn how to implement both patterns in Java applications using Spring Boot, Kubernetes, and custom logic.


🧠 What Are Health Check and Self-Healing Patterns?

Health Check Pattern

Provides an endpoint or mechanism to assess the status of critical components (DB, cache, memory, disk, etc.).

Self-Healing Pattern

Triggers automated actions (like restart, scale, reroute) when the system detects unhealthy behavior.


UML Diagram (Conceptual)

[Client or Load Balancer]
       |
       |---> /actuator/health --> [Service A] --> OK
       |                                 |
       |<--- Auto-restart (K8s probe fails)

👥 Core Participants

  • Health Indicator: Checks a component's health.
  • Monitoring System: Polls health endpoints (e.g., Prometheus, K8s).
  • Recovery Agent: Triggers actions like restart or fallback.
  • Self-Healing Logic: Built-in recovery routines within services.

🌍 Real-World Use Cases

  • Restarting crashed services in Kubernetes.
  • Re-initializing broken Kafka consumers.
  • Reconnecting to DB if a pool is exhausted.
  • Scaling pods when latency spikes.

🧰 Implementation Strategies in Java

1. Spring Boot Health Checks (Actuator)

Maven Dependency

<dependency>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

application.yml

management:
  endpoints:
    web:
      exposure:
        include: health, info

Custom Health Indicator

@Component
public class RedisHealthIndicator implements HealthIndicator {

    @Override
    public Health health() {
        boolean redisUp = checkRedis();
        return redisUp ? Health.up().build() : Health.down().withDetail("error", "Redis unreachable").build();
    }

    private boolean checkRedis() {
        // ping Redis or check connection
        return false; // simulate down
    }
}

2. Kubernetes Liveness and Readiness Probes

K8s YAML

livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 15

readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10

3. Self-Healing via Recovery Code

@Scheduled(fixedRate = 30000)
public void autoReconnectIfDbDown() {
    if (!dataSource.isHealthy()) {
        logger.warn("DB unhealthy. Re-initializing...");
        dataSource.reconnect();
    }
}

✅ Pros and Cons

Pros Cons
Detects failure before users are impacted May restart during transient spikes
Enables automatic recovery and uptime Requires careful configuration
Improves observability and SLA enforcement May mask underlying root causes

❌ Anti-Patterns and Misuse

  • Not customizing health checks (only default checks)
  • Using the same endpoint for liveness and readiness
  • Over-restarting due to aggressive probe settings
  • Ignoring service logs during healing

Pattern Purpose
Health Check Detect component status
Self-Healing Take automated recovery actions
Circuit Breaker Stop calling failing services
Retry Pattern Retry operations on failure
Failover Switch to backup system

💻 Java Code – Health Check + Recovery

@Component
public class KafkaHealthIndicator implements HealthIndicator {

    @Autowired
    private KafkaTemplate<String, String> kafkaTemplate;

    @Override
    public Health health() {
        try {
            kafkaTemplate.send("health-check", "ping");
            return Health.up().build();
        } catch (Exception e) {
            return Health.down(e).build();
        }
    }
}

Recovery Code

@EventListener(ApplicationReadyEvent.class)
public void startHealthWatcher() {
    Executors.newSingleThreadScheduledExecutor().scheduleAtFixedRate(() -> {
        if (!kafkaHealthy()) {
            restartKafkaClient();
        }
    }, 0, 30, TimeUnit.SECONDS);
}

🔧 Refactoring Legacy Code

Before

  • No health endpoint
  • No recovery logic

After

  • Add /actuator/health
  • Add liveness/readiness probe support
  • Add periodic checks to restart flaky services

🌟 Best Practices

  • Separate liveness and readiness checks.
  • Customize health indicators for all critical components.
  • Use exponential backoff in retries.
  • Add alerting on repeated self-healing.
  • Document what constitutes an “unhealthy” state.

🧠 Real-World Analogy

Think of your car’s dashboard lights (health checks). When something's wrong, it tells you. But modern cars also auto-heal—like switching to eco-mode or rerouting power when a tire pressure drops. That’s self-healing.


☕ Java Feature Relevance

  • Spring Boot Actuator: Easily expose health metrics.
  • @Scheduled: Run periodic recovery checks.
  • Records/Sealed Types: Model recovery response objects.
  • CompletableFuture: Retry or parallel healing logic.

🔚 Conclusion & Key Takeaways

Health checks detect problems. Self-healing fixes them.

Together, they form the backbone of microservice resilience, ensuring systems recover fast, scale smartly, and stay available even in partial failure scenarios.

✅ Summary

  • Use /actuator/health with custom indicators.
  • Integrate with Kubernetes probes.
  • Implement self-healing with scheduled logic.
  • Monitor, test, and refine frequently.

❓ FAQ – Health Check & Self-Healing in Java

1. What’s the difference between liveness and readiness?

Liveness: Is app running?
Readiness: Is app ready to receive traffic?

2. Can I customize Spring Boot health checks?

Yes. Implement HealthIndicator.

3. Should I restart on all failures?

No. Use smart recovery logic.

4. What’s a good retry interval?

Start with 30 seconds and monitor.

5. Can self-healing mask real issues?

Yes. Add alerting to monitor recovery attempts.

6. What tools integrate with health checks?

Kubernetes, Prometheus, Grafana, ELK.

7. Do all services need readiness probes?

Only those with long startup processes or external dependencies.

8. How can I simulate failure locally?

Kill DB connection or simulate high CPU to test liveness.

9. What about multi-region failover?

Use readiness checks to direct traffic only to healthy regions.

10. Can I use retries and self-healing together?

Yes. Retry short failures, self-heal persistent issues.