Debugging Kubernetes CrashLoopBackOff: A Deep Dive into Redis Eviction Policies
By Jun Nguyen13 min read2797 words

Debugging Kubernetes CrashLoopBackOff: A Deep Dive into Redis Eviction Policies

Technology
Redis

TL;DR: Your pod is crashing because it’s refusing to start with an unsafe Redis configuration. This article explains why that’s actually good design and how to fix it.

🔴 The Problem: A Stubborn Pod

Picture this: You’ve just deployed your service, but it keeps crashing. The Kubernetes logs show a familiar nightmare:

stream closed: EOF for k8s-namespace/statistics-service-deployment-xxx

The pod enters

CrashLoopBackOff
, and before you can even SSH in to debug, it’s gone. Again. And again.

But buried in the logs, there’s a critical clue:

IMPORTANT! Eviction policy is volatile-lru. It should be "noeviction"

This isn’t just a warning—it’s your application taking a stand against configuration drift that could lead to silent data loss in production.

📊 The Root Cause Analysis

Our service connects to Redis during initialization and performs a configuration sanity check. When it detects

volatile-lru
instead of the expected
noeviction
policy, it makes a bold decision: refuse to start.

Why? Because running with the wrong eviction policy is like building a house on quicksand—it might work today, but eventually, something important will disappear beneath your feet.

🎓 Redis Eviction Policies 101

Before we dive deeper, let’s understand what Redis eviction policies are and why they matter.

When Redis reaches its configured memory limit (

maxmemory
), it needs to make a choice: evict (delete) some data, or refuse new writes? The eviction policy determines exactly how this decision is made.

Think of Redis as a parking lot with limited spaces. The eviction policy is your parking enforcement strategy:

  • Do you tow random cars? (
    allkeys-random
    )
  • Do you tow cars that have been there longest? (
    allkeys-lru
    )
  • Do you refuse entry when full? (
    noeviction
    )

The Complete Eviction Policy Catalog

The Complete Eviction Policy Catalog

Redis offers 8 distinct eviction policies, each with unique characteristics and trade-offs. Let’s explore them:


🛡️ No Eviction Policy

1.
noeviction
(Recommended for Critical Data)

What it does: Returns errors on write operations when memory limit is reached. Reads continue to work normally. What it does: Returns errors on write operations when memory limit is reached. Reads continue to work normally.

When to use:

  • When data loss is unacceptable
  • For databases or critical application state
  • When you want explicit control over memory pressure

Benefits:

  • Prevents silent data loss
  • Forces the application to handle memory pressure explicitly
  • More predictable behavior for critical data

Trade-off: Application must handle write errors and implement cleanup logic

Real-world analogy: Like a bouncer at a nightclub—when capacity is reached, new guests wait outside rather than randomly kicking people out.


📉 LRU (Least Recently Used) Policies

2.
volatile-lru
⚠️ (Current Setting in Our Case)

What it does: Evicts the least recently used keys that have an expiration (TTL) set

When to use:

  • Caching scenarios where only temporary data has TTLs
  • When you want to protect non-expiring keys

Risks:

  • Loss of cached data unexpectedly
  • Application errors if it expects certain keys to exist
  • Potential data inconsistency issues
  • If no keys have TTL, behaves like
    noeviction

Real-world analogy: Evicting cars with temporary parking permits, but only the ones that haven’t moved recently.

3.
allkeys-lru

What it does: Evicts the least recently used keys among ALL keys (regardless of TTL)

When to use:

  • Pure caching use cases
  • When all data is equally evictable
  • When you want automatic cache management

Benefits: Simple, effective for general-purpose caching

Risk: Any key can be evicted, including important ones

Real-world analogy: A true LRU cache—evict whatever hasn’t been accessed recently, no exceptions.


🎯 LFU (Least Frequently Used) Policies

4.
volatile-lfu

What it does: Evicts the least frequently used keys that have an expiration (TTL) set

When to use:

  • When access frequency matters more than recency
  • Caching hot/popular data

Benefits: Better retention of frequently accessed data

Trade-off: More complex tracking, slower than LRU

5.
allkeys-lfu

What it does: Evicts the least frequently used keys among ALL keys

When to use:

  • When you want to keep frequently accessed data regardless of when it was last used
  • Advanced caching scenarios

Benefits: Optimizes for access patterns over time

Trade-off: More CPU overhead for tracking

Real-world analogy: Like a library that keeps popular books readily available, even if they haven’t been checked out recently.


🎲 Random Eviction Policies

6.
volatile-random

What it does: Evicts a random key that has an expiration (TTL) set

When to use:

  • When you don’t care about access patterns
  • Low-overhead eviction
  • Testing or non-critical caching

Benefits: Very fast, minimal CPU overhead

Risk: May evict frequently used data

7.
allkeys-random

What it does: Evicts a random key from all keys

When to use:

  • Similar to volatile-random but for all keys
  • Simplest eviction strategy

Benefits: Fastest eviction, no tracking needed

Risk: Unpredictable behavior

Real-world analogy: Like playing Russian roulette with your data—fast but chaotic.


TTL-Based Policy

8.
volatile-ttl

What it does: Evicts keys with an expiration (TTL) set, choosing the ones with the shortest time-to-live first

When to use:

  • When you want to evict data that’s about to expire anyway
  • Time-sensitive caching (session data, temporary tokens)

Benefits: Natural eviction order based on expiration

Risk: May evict recently set keys with short TTLs over old keys with long TTLs

Real-world analogy: Like eating food in your fridge based on expiration dates—what’s about to spoil goes first.


📊 Quick Reference Comparison

PolicyScopeAlgorithmCPU OverheadBest For
noeviction
N/ANoneNoneCritical data, databases
volatile-lru
Keys with TTLLeast Recently UsedLowMixed data with TTLs
allkeys-lru
All keysLeast Recently UsedLowGeneral caching
volatile-lfu
Keys with TTLLeast Frequently UsedMediumPopular data with TTLs
allkeys-lfu
All keysLeast Frequently UsedMediumFrequency-based caching
volatile-random
Keys with TTLRandomVery LowNon-critical caching
allkeys-random
All keysRandomVery LowTesting, simple caches
volatile-ttl
Keys with TTLShortest TTL firstLowTime-sensitive data

📊 Quick Reference Comparison

PolicyScopeAlgorithmCPU OverheadBest ForPredictability
noeviction
N/ANoneNoneCritical data, databases⭐⭐⭐⭐⭐
volatile-lru
Keys with TTLLeast Recently UsedLowMixed data with TTLs⭐⭐⭐
allkeys-lru
All keysLeast Recently UsedLowGeneral caching⭐⭐⭐
volatile-lfu
Keys with TTLLeast Frequently UsedMediumPopular data with TTLs⭐⭐⭐⭐
allkeys-lfu
All keysLeast Frequently UsedMediumFrequency-based caching⭐⭐⭐⭐
volatile-random
Keys with TTLRandomVery LowNon-critical caching, testing
allkeys-random
All keysRandomVery LowTesting, simple caches
volatile-ttl
Keys with TTLShortest TTL firstLowTime-sensitive data⭐⭐⭐⭐

🔍 Why Our Service is Refusing to Start

Now that we understand eviction policies, let’s connect the dots:

Current State:

volatile-lru
(evicts least recently used keys with TTL)

Required State:

noeviction
(no automatic eviction, return errors instead)

The Architecture Decision

This isn’t a bug—it’s a feature. The development team implemented a fail-fast pattern with configuration validation at startup. Here’s why:

Scenario 1: The Silent Data Loss Problem

Imagine this production horror story:

09:00 - Service starts normally with volatile-lru 09:15 - Redis reaches memory limit 09:15 - Redis silently evicts critical booking state with TTL 09:16 - User completes checkout, app expects booking data 09:16 - Data is gone → Payment processed but booking lost 09:17 - Customer service nightmare begins 😱

Scenario 2: The Race Condition

Thread 1: Write booking data with 1-hour TTL Thread 2: Redis evicts the booking (memory pressure) Thread 1: Try to read booking → 404 Not Found Thread 1: Throws unhandled exception → Pod crashes

Scenario 3: The Debugging Nightmare

With

volatile-lru
, data disappears silently. No errors, no logs, just… gone. With
noeviction
, you get explicit
OOM
errors that you can monitor, alert on, and handle gracefully.

Why Fail-Fast is Good Engineering

The service implements the “fail loud, fail early” principle:

  1. Detection at startup - Problems caught before serving traffic
  2. Clear error message - No ambiguity about what’s wrong
  3. Prevents data loss - Won’t run in an unsafe configuration
  4. Forces proper fix - Can’t be ignored or worked around

This is production-ready design at its finest. 🎯


💥 Understanding the CrashLoopBackOff Cycle

Let’s trace what’s actually happening in your cluster:

1. Kubernetes starts the pod 2. Container starts, NestJS initializes 3. App connects to Redis and checks configuration 4. Detects: volatile-lru ≠ noeviction 5. Logs warning and exits with error code 6. Kubernetes sees exit code ≠ 0 7. Restart policy: Always → Restart the pod 8. Wait 10s, then 20s, then 40s... (exponential backoff) 9. Eventually: CrashLoopBackOff status 10. Go to step 1 (until you fix the Redis config)

The pod isn’t broken—it’s protecting you from running with dangerous configuration.


🚀 The Solution: Three Ways to Fix This

Let’s explore your options, from quick fixes to proper solutions:

Option 1: Fix at Redis Instance Level (Recommended) ✅

This is the proper, infrastructure-level fix that persists across restarts and deployments.

If Using Google Cloud Memorystore

Via gcloud CLI:

# List your Redis instances firstgcloud redis instances list --region=asia-southeast1 # Update the eviction policygcloud redis instances update blink-redis-test \ --region=asia-southeast1 \ --redis-config maxmemory-policy=noeviction # Verify the changegcloud redis instances describe blink-redis-test \ --region=asia-southeast1 \ --format="value(redisConfigs.maxmemory-policy)"

Via GCP Console:

  1. Navigate to Memorystore → Redis
  2. Select your instance (e.g.,
    blink-redis-test
    )
  3. Click Edit
  4. Scroll to Redis configuration
  5. Add/modify:
    maxmemory-policy = noeviction
  6. Click Save
  7. Wait for the update to apply (~2-5 minutes)

Pros:

  • ✅ Permanent fix
  • ✅ Survives instance restarts
  • ✅ Auditable via infrastructure-as-code
  • ✅ No manual intervention needed per pod

Cons:

  • ⏱️ Requires infrastructure access
  • ⏱️ May need approval for production changes

Option 2: Fix via Kubernetes ConfigMap

If Redis is deployed inside Kubernetes (not a managed service):

apiVersion: v1 kind: ConfigMap metadata: name: redis-config namespace: blink-test-qa data: redis.conf: | maxmemory 2gb maxmemory-policy noeviction save ""

Then reference it in your Redis deployment:

apiVersion: apps/v1 kind: Deployment metadata: name: redis spec: template: spec: containers: - name: redis image: redis:7-alpine command: - redis-server - /etc/redis/redis.conf volumeMounts: - name: config mountPath: /etc/redis volumes: - name: config configMap: name: redis-config

Apply with:

kubectl apply -f redis-config.yaml -n k8s-namespace kubectl rollout restart deployment/redis -n k8s-namespace

Pros:

  • ✅ Version controlled
  • ✅ GitOps friendly
  • ✅ Environment-specific configurations

Cons:

  • ⚠️ Only works for self-hosted Redis
  • ⚠️ Requires Redis restart

Option 3: Runtime Fix via Redis CLI (Temporary)

For immediate debugging or hot-fix scenarios:

# 1. Find the Redis pod (if in Kubernetes) kubectl get pods -n k8s-namespace | grep redis # 2. Connect to Redis kubectl exec -it redis-master-0 -n k8s-namespace -- redis-cli # 3. Or if using password authentication kubectl exec -it redis-master-0 -n k8s-namespace -- redis-cli -a "your-password" # 4. Set the policy 127.0.0.1:6379> CONFIG SET maxmemory-policy noeviction OK # 5. Verify 127.0.0.1:6379> CONFIG GET maxmemory-policy 1) "maxmemory-policy" 2) "noeviction" # 6. Persist to disk (if CONFIG REWRITE is supported) 127.0.0.1:6379> CONFIG REWRITE OK # 7. Exit 127.0.0.1:6379> exit

For managed services (Memorystore), connect via Cloud Shell:

# 1. Get Redis IP from your service configuration REDIS_IP="10.0.0.3" # 2. Install redis-cli if not available sudo apt-get install redis-tools # 3. Connect redis-cli -h $REDIS_IP # 4. Follow same commands as above

⚠️ Warning: This fix is NOT persistent for managed services unless you use

CONFIG REWRITE
(which may not be available).

Pros:

  • ✅ Immediate effect
  • ✅ No deployment needed
  • ✅ Great for testing

Cons:

  • ❌ May not persist across Redis restarts
  • ❌ Not infrastructure-as-code
  • ❌ Easy to forget and lose on next deployment

✅ Verification & Testing

After applying any fix, verify the change worked:

Step 1: Check Redis Configuration

# 1. Via kubectl (if Redis is in K8s) kubectl exec -it redis-master-0 -n k8s-namespace -- redis-cli CONFIG GET maxmemory-policy # 2. Via redis-cli (direct connection) redis-cli -h <redis-host> CONFIG GET maxmemory-policy # 3. Expected output: # 1) "maxmemory-policy" # 2) "noeviction"

Step 2: Restart Your Application Pod

# 1. kDelete the failing pod to trigger restart kubectl delete pod statistics-service-deployment-xxx -n k8s-namespace # 2. Or rollout restart the deployment kubectl rollout restart deployment statistics-service-deployment-xxx -n k8s-namespace # Watch the new pod come up kubectl get pods -n k8s-namespace -w

Step 3: Check Application Logs

# Watch the logs in real-time kubectl logs -f deployment/statistics-service-deployment -n k8s-namespace # Look for success indicators: # ✅ No "IMPORTANT! Eviction policy is volatile-lru" warning # ✅ Application starts successfully # ✅ Health checks passing

Step 4: Verify Pod Status

# Check pod status kubectl get pods -n k8s-namespace | grep statistics # Should show: Running (1/1) # NOT: CrashLoopBackOff or Error

🎯 Best Practices & Recommendations

For Infrastructure Teams

  1. Infrastructure as Code:

    # Terraform example for GCP Memorystore resource "google_redis_instance" "cache" { name = "redis-${var.environment}" tier = "STANDARD_HA" memory_size_gb = 4 redis_configs = { maxmemory-policy = "noeviction" # Add other configs as needed } }
  2. Document Redis dependencies in your service README:

    ## Redis Configuration Requirements - `maxmemory-policy`: Must be set to `noeviction` - `maxmemory`: Recommended 2GB minimum - Reason: Prevents data loss for critical booking state
  3. Add monitoring alerts:

    # Example Prometheus alert - alert: RedisMemoryHigh expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.9 for: 5m labels: severity: warning annotations: summary: "Redis memory usage above 90%" description: "Consider scaling Redis or implementing data cleanup"

For Application Teams

  1. Configuration validation is good! Keep that startup check—it saved you from production issues.

  2. Handle OOM errors gracefully:

    try { await redis.set(key, value); } catch (error) { if (error.message.includes("OOM")) { logger.error("Redis out of memory", { key }); // Implement fallback: cleanup old data, use alternative storage, etc. await cleanupOldEntries(); // Retry or fail gracefully } }
  3. Monitor Redis metrics:

    • Memory usage
    • Evicted keys (should be 0 with noeviction)
    • Connection errors
    • Command latency

For DevOps Teams

  1. Pre-deployment checks:

    #!/bin/bash # verify-redis-config.sh REDIS_HOST=$1 POLICY=$(redis-cli -h $REDIS_HOST CONFIG GET maxmemory-policy | tail -1) if [ "$POLICY" != "noeviction" ]; then echo "❌ Error: Redis eviction policy is $POLICY, expected noeviction" exit 1 fi echo "✅ Redis configuration validated"
  2. Environment parity: Ensure dev/staging/prod all use the same Redis configuration to catch issues early.

  3. Capacity planning: With

    noeviction
    , you need proper memory sizing. Monitor and scale proactively.


🎬 Conclusion

What looked like a frustrating CrashLoopBackOff was actually smart engineering preventing silent data loss. By understanding Redis eviction policies and implementing proper configuration validation, your team built a resilient system that fails safely rather than failing silently.

Key Takeaways:

  • noeviction
    prevents silent data loss
  • ✅ Configuration validation at startup is good practice
  • ✅ Fix the infrastructure, not the symptoms
  • ✅ Monitor Redis memory and plan capacity accordingly
  • ✅ Document configuration requirements clearly

Now go forth and configure that Redis instance properly! 🚀


Questions or issues? Check your specific environment:

  • Environment:
    k8s-namespace
    namespace
  • Service:
    statistics-service
  • Pod:
    statistics-service-deployment-xxx

🐛 Troubleshooting Common Issues

Issue: “CONFIG SET is disabled”

If you see this error when trying to change the configuration:

(error) ERR unknown command 'CONFIG', with args beginning with: 'SET' 'maxmemory-policy' 'noeviction'

Solution: This means your Redis instance has disabled the CONFIG command (common in managed services). You must use the cloud provider’s management interface (Option 1) instead.

Issue: “Connection refused” when connecting to Redis

Check 1: Network policies

# Test connectivity from your app pod kubectl exec -it statistics-service-deployment-xxx -n k8s-namespace -- nc -zv <redis-host> 6379

Check 2: Firewall rules (GCP)

gcloud compute firewall-rules list | grep redis

Issue: Pod still crashing after fixing Redis config

Possible causes:

  1. Old configuration cached: Restart the entire deployment

    kubectl rollout restart deployment/statistics-service-deployment -n k8s-namespace
  2. Wrong Redis instance: Verify your app is connecting to the correct Redis

    # Check environment variables kubectl get deployment statistics-service-deployment -n k8s-namespace -o yaml | grep -i redis

🔄 Post-Fix Validation Checklist

Use this checklist after applying the fix:

  • Redis eviction policy is set to
    noeviction
  • Pod starts successfully without CrashLoopBackOff
  • No “IMPORTANT! Eviction policy…” warning in logs
  • Application health checks are passing
  • Service is accepting traffic (test with curl/Postman)
  • Redis memory usage is monitored and alerted
  • Configuration change is documented in your infra repo
  • Terraform/IaC updated (if applicable)
  • Similar environments (dev/stage/prod) checked for same issue

📚 Additional Resources