Health Check and Self-Healing Patterns in Java Microservices – Ensure Availability and Resilience

Q: Can I customize Spring Boot health checks?

Yes. Implement HealthIndicator.

Introduction

Modern Java microservices need to be always-on, self-aware, and resilient. But what happens when a service hangs, a DB goes down, or a thread pool is exhausted?

Enter the Health Check and Self-Healing Patterns—two foundational resilience strategies that help detect failures and recover from them automatically, often without human intervention.

In this tutorial, you'll learn how to implement both patterns in Java applications using Spring Boot, Kubernetes, and custom logic.

🧠 What Are Health Check and Self-Healing Patterns?

Health Check Pattern

Provides an endpoint or mechanism to assess the status of critical components (DB, cache, memory, disk, etc.).

Self-Healing Pattern

Triggers automated actions (like restart, scale, reroute) when the system detects unhealthy behavior.

UML Diagram (Conceptual)

[Client or Load Balancer]
       |
       |---> /actuator/health --> [Service A] --> OK
       |                                 |
       |<--- Auto-restart (K8s probe fails)

👥 Core Participants

Health Indicator: Checks a component's health.
Monitoring System: Polls health endpoints (e.g., Prometheus, K8s).
Recovery Agent: Triggers actions like restart or fallback.
Self-Healing Logic: Built-in recovery routines within services.

🌍 Real-World Use Cases

Restarting crashed services in Kubernetes.
Re-initializing broken Kafka consumers.
Reconnecting to DB if a pool is exhausted.
Scaling pods when latency spikes.

🧰 Implementation Strategies in Java

1. Spring Boot Health Checks (Actuator)

Maven Dependency

<dependency>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>

application.yml

management:
  endpoints:
    web:
      exposure:
        include: health, info

Custom Health Indicator

@Component
public class RedisHealthIndicator implements HealthIndicator {

    @Override
    public Health health() {
        boolean redisUp = checkRedis();
        return redisUp ? Health.up().build() : Health.down().withDetail("error", "Redis unreachable").build();
    }

    private boolean checkRedis() {
        // ping Redis or check connection
        return false; // simulate down
    }
}

2. Kubernetes Liveness and Readiness Probes

K8s YAML

livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 15

readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10

3. Self-Healing via Recovery Code

@Scheduled(fixedRate = 30000)
public void autoReconnectIfDbDown() {
    if (!dataSource.isHealthy()) {
        logger.warn("DB unhealthy. Re-initializing...");
        dataSource.reconnect();
    }
}

✅ Pros and Cons

Pros	Cons
Detects failure before users are impacted	May restart during transient spikes
Enables automatic recovery and uptime	Requires careful configuration
Improves observability and SLA enforcement	May mask underlying root causes

❌ Anti-Patterns and Misuse

Not customizing health checks (only default checks)
Using the same endpoint for liveness and readiness
Over-restarting due to aggressive probe settings
Ignoring service logs during healing

Pattern	Purpose
Health Check	Detect component status
Self-Healing	Take automated recovery actions
Circuit Breaker	Stop calling failing services
Retry Pattern	Retry operations on failure
Failover	Switch to backup system

💻 Java Code – Health Check + Recovery

@Component
public class KafkaHealthIndicator implements HealthIndicator {

    @Autowired
    private KafkaTemplate<String, String> kafkaTemplate;

    @Override
    public Health health() {
        try {
            kafkaTemplate.send("health-check", "ping");
            return Health.up().build();
        } catch (Exception e) {
            return Health.down(e).build();
        }
    }
}

Recovery Code

@EventListener(ApplicationReadyEvent.class)
public void startHealthWatcher() {
    Executors.newSingleThreadScheduledExecutor().scheduleAtFixedRate(() -> {
        if (!kafkaHealthy()) {
            restartKafkaClient();
        }
    }, 0, 30, TimeUnit.SECONDS);
}

🔧 Refactoring Legacy Code

Before

No health endpoint
No recovery logic

After

Add /actuator/health
Add liveness/readiness probe support
Add periodic checks to restart flaky services

🌟 Best Practices

Separate liveness and readiness checks.
Customize health indicators for all critical components.
Use exponential backoff in retries.
Add alerting on repeated self-healing.
Document what constitutes an “unhealthy” state.

🧠 Real-World Analogy

Think of your car’s dashboard lights (health checks). When something's wrong, it tells you. But modern cars also auto-heal—like switching to eco-mode or rerouting power when a tire pressure drops. That’s self-healing.

☕ Java Feature Relevance

Spring Boot Actuator: Easily expose health metrics.
@Scheduled: Run periodic recovery checks.
Records/Sealed Types: Model recovery response objects.
CompletableFuture: Retry or parallel healing logic.

🔚 Conclusion & Key Takeaways

Health checks detect problems. Self-healing fixes them.

Together, they form the backbone of microservice resilience, ensuring systems recover fast, scale smartly, and stay available even in partial failure scenarios.

✅ Summary

Use /actuator/health with custom indicators.
Integrate with Kubernetes probes.
Implement self-healing with scheduled logic.
Monitor, test, and refine frequently.

❓ FAQ – Health Check & Self-Healing in Java

1. What’s the difference between liveness and readiness?

Liveness: Is app running?
Readiness: Is app ready to receive traffic?

2. Can I customize Spring Boot health checks?

Yes. Implement HealthIndicator.

3. Should I restart on all failures?

No. Use smart recovery logic.

4. What’s a good retry interval?

Start with 30 seconds and monitor.

5. Can self-healing mask real issues?

Yes. Add alerting to monitor recovery attempts.

6. What tools integrate with health checks?

Kubernetes, Prometheus, Grafana, ELK.

7. Do all services need readiness probes?

Only those with long startup processes or external dependencies.

8. How can I simulate failure locally?

Kill DB connection or simulate high CPU to test liveness.

9. What about multi-region failover?

Use readiness checks to direct traffic only to healthy regions.

10. Can I use retries and self-healing together?

Yes. Retry short failures, self-heal persistent issues.