October 19, 2025By Jun Nguyen13 min read2797 words

Debugging Kubernetes CrashLoopBackOff: A Deep Dive into Redis Eviction Policies

Technology

Redis

TL;DR: Your pod is crashing because it’s refusing to start with an unsafe Redis configuration. This article explains why that’s actually good design and how to fix it.

🔴 The Problem: A Stubborn Pod

Picture this: You’ve just deployed your service, but it keeps crashing. The Kubernetes logs show a familiar nightmare:

stream closed: EOF for k8s-namespace/statistics-service-deployment-xxx

The pod enters

CrashLoopBackOff

, and before you can even SSH in to debug, it’s gone. Again. And again.

But buried in the logs, there’s a critical clue:

IMPORTANT! Eviction policy is volatile-lru. It should be "noeviction"

This isn’t just a warning—it’s your application taking a stand against configuration drift that could lead to silent data loss in production.

📊 The Root Cause Analysis

Our service connects to Redis during initialization and performs a configuration sanity check. When it detects

volatile-lru

instead of the expected

noeviction

policy, it makes a bold decision: refuse to start.

Why? Because running with the wrong eviction policy is like building a house on quicksand—it might work today, but eventually, something important will disappear beneath your feet.

🎓 Redis Eviction Policies 101

Before we dive deeper, let’s understand what Redis eviction policies are and why they matter.

When Redis reaches its configured memory limit (

maxmemory

), it needs to make a choice: evict (delete) some data, or refuse new writes? The eviction policy determines exactly how this decision is made.

Think of Redis as a parking lot with limited spaces. The eviction policy is your parking enforcement strategy:

Do you tow random cars? (
allkeys-random
)
Do you tow cars that have been there longest? (
allkeys-lru
)
Do you refuse entry when full? (
noeviction
)

The Complete Eviction Policy Catalog

Redis offers 8 distinct eviction policies, each with unique characteristics and trade-offs. Let’s explore them:

🛡️ No Eviction Policy

1. noeviction
(Recommended for Critical Data) ✅

What it does: Returns errors on write operations when memory limit is reached. Reads continue to work normally. What it does: Returns errors on write operations when memory limit is reached. Reads continue to work normally.

When to use:

When data loss is unacceptable
For databases or critical application state
When you want explicit control over memory pressure

Benefits:

Prevents silent data loss
Forces the application to handle memory pressure explicitly
More predictable behavior for critical data

Trade-off: Application must handle write errors and implement cleanup logic

Real-world analogy: Like a bouncer at a nightclub—when capacity is reached, new guests wait outside rather than randomly kicking people out.

📉 LRU (Least Recently Used) Policies

2. volatile-lru
⚠️ (Current Setting in Our Case)

What it does: Evicts the least recently used keys that have an expiration (TTL) set

When to use:

Caching scenarios where only temporary data has TTLs
When you want to protect non-expiring keys

Risks:

Loss of cached data unexpectedly
Application errors if it expects certain keys to exist
Potential data inconsistency issues
If no keys have TTL, behaves like
noeviction

Real-world analogy: Evicting cars with temporary parking permits, but only the ones that haven’t moved recently.

3. allkeys-lru

What it does: Evicts the least recently used keys among ALL keys (regardless of TTL)

When to use:

Pure caching use cases
When all data is equally evictable
When you want automatic cache management

Benefits: Simple, effective for general-purpose caching

Risk: Any key can be evicted, including important ones

Real-world analogy: A true LRU cache—evict whatever hasn’t been accessed recently, no exceptions.

🎯 LFU (Least Frequently Used) Policies

4. volatile-lfu

What it does: Evicts the least frequently used keys that have an expiration (TTL) set

When to use:

When access frequency matters more than recency
Caching hot/popular data

Benefits: Better retention of frequently accessed data

Trade-off: More complex tracking, slower than LRU

5. allkeys-lfu

What it does: Evicts the least frequently used keys among ALL keys

When to use:

When you want to keep frequently accessed data regardless of when it was last used
Advanced caching scenarios

Benefits: Optimizes for access patterns over time

Trade-off: More CPU overhead for tracking

Real-world analogy: Like a library that keeps popular books readily available, even if they haven’t been checked out recently.

🎲 Random Eviction Policies

6. volatile-random

What it does: Evicts a random key that has an expiration (TTL) set

When to use:

When you don’t care about access patterns
Low-overhead eviction
Testing or non-critical caching

Benefits: Very fast, minimal CPU overhead

Risk: May evict frequently used data

7. allkeys-random

What it does: Evicts a random key from all keys

When to use:

Similar to volatile-random but for all keys
Simplest eviction strategy

Benefits: Fastest eviction, no tracking needed

Risk: Unpredictable behavior

Real-world analogy: Like playing Russian roulette with your data—fast but chaotic.

⏰ TTL-Based Policy

8. volatile-ttl

What it does: Evicts keys with an expiration (TTL) set, choosing the ones with the shortest time-to-live first

When to use:

When you want to evict data that’s about to expire anyway
Time-sensitive caching (session data, temporary tokens)

Benefits: Natural eviction order based on expiration

Risk: May evict recently set keys with short TTLs over old keys with long TTLs

Real-world analogy: Like eating food in your fridge based on expiration dates—what’s about to spoil goes first.

📊 Quick Reference Comparison

Policy	Scope	Algorithm	CPU Overhead	Best For
noeviction	N/A	None	None	Critical data, databases
volatile-lru	Keys with TTL	Least Recently Used	Low	Mixed data with TTLs
allkeys-lru	All keys	Least Recently Used	Low	General caching
volatile-lfu	Keys with TTL	Least Frequently Used	Medium	Popular data with TTLs
allkeys-lfu	All keys	Least Frequently Used	Medium	Frequency-based caching
volatile-random	Keys with TTL	Random	Very Low	Non-critical caching
allkeys-random	All keys	Random	Very Low	Testing, simple caches
volatile-ttl	Keys with TTL	Shortest TTL first	Low	Time-sensitive data

📊 Quick Reference Comparison

Policy	Scope	Algorithm	CPU Overhead	Best For	Predictability
noeviction	N/A	None	None	Critical data, databases	⭐⭐⭐⭐⭐
volatile-lru	Keys with TTL	Least Recently Used	Low	Mixed data with TTLs	⭐⭐⭐
allkeys-lru	All keys	Least Recently Used	Low	General caching	⭐⭐⭐
volatile-lfu	Keys with TTL	Least Frequently Used	Medium	Popular data with TTLs	⭐⭐⭐⭐
allkeys-lfu	All keys	Least Frequently Used	Medium	Frequency-based caching	⭐⭐⭐⭐
volatile-random	Keys with TTL	Random	Very Low	Non-critical caching, testing	⭐
allkeys-random	All keys	Random	Very Low	Testing, simple caches	⭐
volatile-ttl	Keys with TTL	Shortest TTL first	Low	Time-sensitive data	⭐⭐⭐⭐

🔍 Why Our Service is Refusing to Start

Now that we understand eviction policies, let’s connect the dots:

Current State:

volatile-lru

(evicts least recently used keys with TTL)

Required State:

noeviction

(no automatic eviction, return errors instead)

The Architecture Decision

This isn’t a bug—it’s a feature. The development team implemented a fail-fast pattern with configuration validation at startup. Here’s why:

Scenario 1: The Silent Data Loss Problem

Imagine this production horror story:

09:00 - Service starts normally with volatile-lru
09:15 - Redis reaches memory limit
09:15 - Redis silently evicts critical booking state with TTL
09:16 - User completes checkout, app expects booking data
09:16 - Data is gone → Payment processed but booking lost
09:17 - Customer service nightmare begins 😱

Scenario 2: The Race Condition

Thread 1: Write booking data with 1-hour TTL
Thread 2: Redis evicts the booking (memory pressure)
Thread 1: Try to read booking → 404 Not Found
Thread 1: Throws unhandled exception → Pod crashes

Scenario 3: The Debugging Nightmare

With

volatile-lru

, data disappears silently. No errors, no logs, just… gone. With

noeviction

, you get explicit

OOM

errors that you can monitor, alert on, and handle gracefully.

Why Fail-Fast is Good Engineering

The service implements the “fail loud, fail early” principle:

Detection at startup - Problems caught before serving traffic
Clear error message - No ambiguity about what’s wrong
Prevents data loss - Won’t run in an unsafe configuration
Forces proper fix - Can’t be ignored or worked around

This is production-ready design at its finest. 🎯

💥 Understanding the CrashLoopBackOff Cycle

Let’s trace what’s actually happening in your cluster:

1. Kubernetes starts the pod
                ↓
2. Container starts, NestJS initializes
                ↓
3. App connects to Redis and checks configuration
                ↓
4. Detects: volatile-lru ≠ noeviction
                ↓
5. Logs warning and exits with error code
                ↓
6. Kubernetes sees exit code ≠ 0
                ↓
7. Restart policy: Always → Restart the pod
                ↓
8. Wait 10s, then 20s, then 40s... (exponential backoff)
                ↓
9. Eventually: CrashLoopBackOff status
                ↓
10. Go to step 1 (until you fix the Redis config)

The pod isn’t broken—it’s protecting you from running with dangerous configuration.

🚀 The Solution: Three Ways to Fix This

Let’s explore your options, from quick fixes to proper solutions:

Option 1: Fix at Redis Instance Level (Recommended) ✅

This is the proper, infrastructure-level fix that persists across restarts and deployments.

If Using Google Cloud Memorystore

Via gcloud CLI:

# List your Redis instances firstgcloud redis instances list --region=asia-southeast1
# Update the eviction policygcloud redis instances update blink-redis-test \  --region=asia-southeast1 \  --redis-config maxmemory-policy=noeviction
# Verify the changegcloud redis instances describe blink-redis-test \  --region=asia-southeast1 \  --format="value(redisConfigs.maxmemory-policy)"

Via GCP Console:

Navigate to Memorystore → Redis
Select your instance (e.g.,
blink-redis-test
)
Click Edit
Scroll to Redis configuration
Add/modify:
maxmemory-policy = noeviction
Click Save
Wait for the update to apply (~2-5 minutes)

Pros:

✅ Permanent fix
✅ Survives instance restarts
✅ Auditable via infrastructure-as-code
✅ No manual intervention needed per pod

Cons:

⏱️ Requires infrastructure access
⏱️ May need approval for production changes

Option 2: Fix via Kubernetes ConfigMap

If Redis is deployed inside Kubernetes (not a managed service):

apiVersion: v1
kind: ConfigMap
metadata:
  name: redis-config
  namespace: blink-test-qa
data:
  redis.conf: |
    maxmemory 2gb
    maxmemory-policy noeviction
    save ""

Then reference it in your Redis deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis
spec:
  template:
    spec:
      containers:
        - name: redis
          image: redis:7-alpine
          command:
            - redis-server
            - /etc/redis/redis.conf
          volumeMounts:
            - name: config
              mountPath: /etc/redis
      volumes:
        - name: config
          configMap:
            name: redis-config

Apply with:

kubectl apply -f redis-config.yaml -n k8s-namespace
kubectl rollout restart deployment/redis -n k8s-namespace

Pros:

✅ Version controlled
✅ GitOps friendly
✅ Environment-specific configurations

Cons:

⚠️ Only works for self-hosted Redis
⚠️ Requires Redis restart

Option 3: Runtime Fix via Redis CLI (Temporary)

For immediate debugging or hot-fix scenarios:

# 1. Find the Redis pod (if in Kubernetes)
kubectl get pods -n k8s-namespace | grep redis

# 2. Connect to Redis
kubectl exec -it redis-master-0 -n k8s-namespace -- redis-cli

# 3. Or if using password authentication
kubectl exec -it redis-master-0 -n k8s-namespace -- redis-cli -a "your-password"

# 4. Set the policy
127.0.0.1:6379> CONFIG SET maxmemory-policy noeviction
OK

# 5. Verify
127.0.0.1:6379> CONFIG GET maxmemory-policy
1) "maxmemory-policy"
2) "noeviction"

# 6. Persist to disk (if CONFIG REWRITE is supported)
127.0.0.1:6379> CONFIG REWRITE
OK

# 7. Exit
127.0.0.1:6379> exit

For managed services (Memorystore), connect via Cloud Shell:

# 1. Get Redis IP from your service configuration
REDIS_IP="10.0.0.3"

# 2. Install redis-cli if not available
sudo apt-get install redis-tools

# 3. Connect
redis-cli -h $REDIS_IP
# 4. Follow same commands as above

⚠️ Warning: This fix is NOT persistent for managed services unless you use

CONFIG REWRITE

(which may not be available).

Pros:

✅ Immediate effect
✅ No deployment needed
✅ Great for testing

Cons:

❌ May not persist across Redis restarts
❌ Not infrastructure-as-code
❌ Easy to forget and lose on next deployment

✅ Verification & Testing

After applying any fix, verify the change worked:

Step 1: Check Redis Configuration

# 1. Via kubectl (if Redis is in K8s)
kubectl exec -it redis-master-0 -n k8s-namespace -- redis-cli CONFIG GET maxmemory-policy
# 2. Via redis-cli (direct connection)
redis-cli -h <redis-host> CONFIG GET maxmemory-policy
# 3. Expected output:
# 1) "maxmemory-policy"
# 2) "noeviction"

Step 2: Restart Your Application Pod

# 1. kDelete the failing pod to trigger restart
kubectl delete pod statistics-service-deployment-xxx -n k8s-namespace
# 2. Or rollout restart the deployment
kubectl rollout restart deployment statistics-service-deployment-xxx -n k8s-namespace
# Watch the new pod come up
kubectl get pods -n k8s-namespace -w

Step 3: Check Application Logs

# Watch the logs in real-time
kubectl logs -f deployment/statistics-service-deployment -n k8s-namespace

# Look for success indicators:
# ✅ No "IMPORTANT! Eviction policy is volatile-lru" warning
# ✅ Application starts successfully
# ✅ Health checks passing

Step 4: Verify Pod Status

# Check pod status
kubectl get pods -n k8s-namespace | grep statistics

# Should show: Running (1/1)
# NOT: CrashLoopBackOff or Error

🎯 Best Practices & Recommendations

For Infrastructure Teams

Infrastructure as Code:

# Terraform example for GCP Memorystore
resource "google_redis_instance" "cache" {
  name           = "redis-${var.environment}"
  tier           = "STANDARD_HA"
  memory_size_gb = 4

  redis_configs = {
    maxmemory-policy = "noeviction"
    # Add other configs as needed
  }
}

Document Redis dependencies in your service README:

## Redis Configuration Requirements
- `maxmemory-policy`: Must be set to `noeviction`
- `maxmemory`: Recommended 2GB minimum
- Reason: Prevents data loss for critical booking state

Add monitoring alerts:

# Example Prometheus alert
- alert: RedisMemoryHigh
  expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.9
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Redis memory usage above 90%"
    description: "Consider scaling Redis or implementing data cleanup"

For Application Teams

Configuration validation is good! Keep that startup check—it saved you from production issues.

Handle OOM errors gracefully:

try {
  await redis.set(key, value);
} catch (error) {
  if (error.message.includes("OOM")) {
    logger.error("Redis out of memory", { key });
    // Implement fallback: cleanup old data, use alternative storage, etc.
    await cleanupOldEntries();
    // Retry or fail gracefully
  }
}

Monitor Redis metrics:
- Memory usage
- Evicted keys (should be 0 with noeviction)
- Connection errors
- Command latency

For DevOps Teams

Pre-deployment checks:

#!/bin/bash
# verify-redis-config.sh

REDIS_HOST=$1
POLICY=$(redis-cli -h $REDIS_HOST CONFIG GET maxmemory-policy | tail -1)

if [ "$POLICY" != "noeviction" ]; then
  echo "❌ Error: Redis eviction policy is $POLICY, expected noeviction"
  exit 1
fi

echo "✅ Redis configuration validated"

Environment parity: Ensure dev/staging/prod all use the same Redis configuration to catch issues early.
Capacity planning: With
noeviction
, you need proper memory sizing. Monitor and scale proactively.

🎬 Conclusion

What looked like a frustrating CrashLoopBackOff was actually smart engineering preventing silent data loss. By understanding Redis eviction policies and implementing proper configuration validation, your team built a resilient system that fails safely rather than failing silently.

Key Takeaways:

✅
noeviction
prevents silent data loss
✅ Configuration validation at startup is good practice
✅ Fix the infrastructure, not the symptoms
✅ Monitor Redis memory and plan capacity accordingly
✅ Document configuration requirements clearly

Now go forth and configure that Redis instance properly! 🚀

Questions or issues? Check your specific environment:

Environment:
k8s-namespace
namespace
Service:
statistics-service
Pod:
statistics-service-deployment-xxx

🐛 Troubleshooting Common Issues

Issue: “CONFIG SET is disabled”

If you see this error when trying to change the configuration:

(error) ERR unknown command 'CONFIG', with args beginning with: 'SET' 'maxmemory-policy' 'noeviction'

Solution: This means your Redis instance has disabled the CONFIG command (common in managed services). You must use the cloud provider’s management interface (Option 1) instead.

Issue: “Connection refused” when connecting to Redis

Check 1: Network policies

# Test connectivity from your app pod
kubectl exec -it statistics-service-deployment-xxx -n k8s-namespace -- nc -zv <redis-host> 6379

Check 2: Firewall rules (GCP)

gcloud compute firewall-rules list | grep redis

Issue: Pod still crashing after fixing Redis config

Possible causes:

Old configuration cached: Restart the entire deployment

kubectl rollout restart deployment/statistics-service-deployment -n k8s-namespace

Wrong Redis instance: Verify your app is connecting to the correct Redis

# Check environment variables
kubectl get deployment statistics-service-deployment -n k8s-namespace -o yaml | grep -i redis

🔄 Post-Fix Validation Checklist

Use this checklist after applying the fix:

Redis eviction policy is set to
noeviction
Pod starts successfully without CrashLoopBackOff
No “IMPORTANT! Eviction policy…” warning in logs
Application health checks are passing
Service is accepting traffic (test with curl/Postman)
Redis memory usage is monitored and alerted
Configuration change is documented in your infra repo
Terraform/IaC updated (if applicable)
Similar environments (dev/stage/prod) checked for same issue

🔴 The Problem: A Stubborn Pod

📊 The Root Cause Analysis

🎓 Redis Eviction Policies 101

The Complete Eviction Policy Catalog

The Complete Eviction Policy Catalog

🛡️ No Eviction Policy

1. noeviction (Recommended for Critical Data) ✅

📉 LRU (Least Recently Used) Policies

2. volatile-lru ⚠️ (Current Setting in Our Case)

3. allkeys-lru

🎯 LFU (Least Frequently Used) Policies

4. volatile-lfu

5. allkeys-lfu

🎲 Random Eviction Policies

6. volatile-random

7. allkeys-random

⏰ TTL-Based Policy

8. volatile-ttl

📊 Quick Reference Comparison

📊 Quick Reference Comparison

🔍 Why Our Service is Refusing to Start

The Architecture Decision

Scenario 1: The Silent Data Loss Problem

Scenario 2: The Race Condition

Scenario 3: The Debugging Nightmare

Why Fail-Fast is Good Engineering

💥 Understanding the CrashLoopBackOff Cycle

🚀 The Solution: Three Ways to Fix This

Option 1: Fix at Redis Instance Level (Recommended) ✅

If Using Google Cloud Memorystore

Option 2: Fix via Kubernetes ConfigMap

Option 3: Runtime Fix via Redis CLI (Temporary)

✅ Verification & Testing

Step 1: Check Redis Configuration

Step 2: Restart Your Application Pod

Step 3: Check Application Logs

Step 4: Verify Pod Status

🎯 Best Practices & Recommendations

For Infrastructure Teams

For Application Teams

For DevOps Teams

🎬 Conclusion

🐛 Troubleshooting Common Issues

Issue: “CONFIG SET is disabled”

Issue: “Connection refused” when connecting to Redis

Issue: Pod still crashing after fixing Redis config

🔄 Post-Fix Validation Checklist

📚 Additional Resources

1. noeviction
(Recommended for Critical Data) ✅

2. volatile-lru
⚠️ (Current Setting in Our Case)