Debugging Kubernetes CrashLoopBackOff: A Deep Dive into Redis Eviction Policies
TL;DR: Your pod is crashing because it’s refusing to start with an unsafe Redis configuration. This article explains why that’s actually good design and how to fix it.
🔴 The Problem: A Stubborn Pod
Picture this: You’ve just deployed your service, but it keeps crashing. The Kubernetes logs show a familiar nightmare:
stream closed: EOF for k8s-namespace/statistics-service-deployment-xxx
The pod enters
But buried in the logs, there’s a critical clue:
IMPORTANT! Eviction policy is volatile-lru. It should be "noeviction"
This isn’t just a warning—it’s your application taking a stand against configuration drift that could lead to silent data loss in production.
📊 The Root Cause Analysis
Our service connects to Redis during initialization and performs a configuration sanity check. When it detects
Why? Because running with the wrong eviction policy is like building a house on quicksand—it might work today, but eventually, something important will disappear beneath your feet.
🎓 Redis Eviction Policies 101
Before we dive deeper, let’s understand what Redis eviction policies are and why they matter.
When Redis reaches its configured memory limit (
Think of Redis as a parking lot with limited spaces. The eviction policy is your parking enforcement strategy:
- Do you tow random cars? (allkeys-random)
- Do you tow cars that have been there longest? (allkeys-lru)
- Do you refuse entry when full? (noeviction)
The Complete Eviction Policy Catalog
The Complete Eviction Policy Catalog
Redis offers 8 distinct eviction policies, each with unique characteristics and trade-offs. Let’s explore them:
🛡️ No Eviction Policy
1. noeviction (Recommended for Critical Data) ✅
What it does: Returns errors on write operations when memory limit is reached. Reads continue to work normally. What it does: Returns errors on write operations when memory limit is reached. Reads continue to work normally.
When to use:
- When data loss is unacceptable
- For databases or critical application state
- When you want explicit control over memory pressure
Benefits:
- Prevents silent data loss
- Forces the application to handle memory pressure explicitly
- More predictable behavior for critical data
Trade-off: Application must handle write errors and implement cleanup logic
Real-world analogy: Like a bouncer at a nightclub—when capacity is reached, new guests wait outside rather than randomly kicking people out.
📉 LRU (Least Recently Used) Policies
2. volatile-lru ⚠️ (Current Setting in Our Case)
What it does: Evicts the least recently used keys that have an expiration (TTL) set
When to use:
- Caching scenarios where only temporary data has TTLs
- When you want to protect non-expiring keys
Risks:
- Loss of cached data unexpectedly
- Application errors if it expects certain keys to exist
- Potential data inconsistency issues
- If no keys have TTL, behaves like noeviction
Real-world analogy: Evicting cars with temporary parking permits, but only the ones that haven’t moved recently.
3. allkeys-lru
What it does: Evicts the least recently used keys among ALL keys (regardless of TTL)
When to use:
- Pure caching use cases
- When all data is equally evictable
- When you want automatic cache management
Benefits: Simple, effective for general-purpose caching
Risk: Any key can be evicted, including important ones
Real-world analogy: A true LRU cache—evict whatever hasn’t been accessed recently, no exceptions.
🎯 LFU (Least Frequently Used) Policies
4. volatile-lfu
What it does: Evicts the least frequently used keys that have an expiration (TTL) set
When to use:
- When access frequency matters more than recency
- Caching hot/popular data
Benefits: Better retention of frequently accessed data
Trade-off: More complex tracking, slower than LRU
5. allkeys-lfu
What it does: Evicts the least frequently used keys among ALL keys
When to use:
- When you want to keep frequently accessed data regardless of when it was last used
- Advanced caching scenarios
Benefits: Optimizes for access patterns over time
Trade-off: More CPU overhead for tracking
Real-world analogy: Like a library that keeps popular books readily available, even if they haven’t been checked out recently.
🎲 Random Eviction Policies
6. volatile-random
What it does: Evicts a random key that has an expiration (TTL) set
When to use:
- When you don’t care about access patterns
- Low-overhead eviction
- Testing or non-critical caching
Benefits: Very fast, minimal CPU overhead
Risk: May evict frequently used data
7. allkeys-random
What it does: Evicts a random key from all keys
When to use:
- Similar to volatile-random but for all keys
- Simplest eviction strategy
Benefits: Fastest eviction, no tracking needed
Risk: Unpredictable behavior
Real-world analogy: Like playing Russian roulette with your data—fast but chaotic.
⏰ TTL-Based Policy
8. volatile-ttl
What it does: Evicts keys with an expiration (TTL) set, choosing the ones with the shortest time-to-live first
When to use:
- When you want to evict data that’s about to expire anyway
- Time-sensitive caching (session data, temporary tokens)
Benefits: Natural eviction order based on expiration
Risk: May evict recently set keys with short TTLs over old keys with long TTLs
Real-world analogy: Like eating food in your fridge based on expiration dates—what’s about to spoil goes first.
📊 Quick Reference Comparison
| Policy | Scope | Algorithm | CPU Overhead | Best For |
|---|---|---|---|---|
noeviction | N/A | None | None | Critical data, databases |
volatile-lru | Keys with TTL | Least Recently Used | Low | Mixed data with TTLs |
allkeys-lru | All keys | Least Recently Used | Low | General caching |
volatile-lfu | Keys with TTL | Least Frequently Used | Medium | Popular data with TTLs |
allkeys-lfu | All keys | Least Frequently Used | Medium | Frequency-based caching |
volatile-random | Keys with TTL | Random | Very Low | Non-critical caching |
allkeys-random | All keys | Random | Very Low | Testing, simple caches |
volatile-ttl | Keys with TTL | Shortest TTL first | Low | Time-sensitive data |
📊 Quick Reference Comparison
| Policy | Scope | Algorithm | CPU Overhead | Best For | Predictability |
|---|---|---|---|---|---|
noeviction | N/A | None | None | Critical data, databases | ⭐⭐⭐⭐⭐ |
volatile-lru | Keys with TTL | Least Recently Used | Low | Mixed data with TTLs | ⭐⭐⭐ |
allkeys-lru | All keys | Least Recently Used | Low | General caching | ⭐⭐⭐ |
volatile-lfu | Keys with TTL | Least Frequently Used | Medium | Popular data with TTLs | ⭐⭐⭐⭐ |
allkeys-lfu | All keys | Least Frequently Used | Medium | Frequency-based caching | ⭐⭐⭐⭐ |
volatile-random | Keys with TTL | Random | Very Low | Non-critical caching, testing | ⭐ |
allkeys-random | All keys | Random | Very Low | Testing, simple caches | ⭐ |
volatile-ttl | Keys with TTL | Shortest TTL first | Low | Time-sensitive data | ⭐⭐⭐⭐ |
🔍 Why Our Service is Refusing to Start
Now that we understand eviction policies, let’s connect the dots:
Current State:
Required State:
The Architecture Decision
This isn’t a bug—it’s a feature. The development team implemented a fail-fast pattern with configuration validation at startup. Here’s why:
Scenario 1: The Silent Data Loss Problem
Imagine this production horror story:
09:00 - Service starts normally with volatile-lru 09:15 - Redis reaches memory limit 09:15 - Redis silently evicts critical booking state with TTL 09:16 - User completes checkout, app expects booking data 09:16 - Data is gone → Payment processed but booking lost 09:17 - Customer service nightmare begins 😱
Scenario 2: The Race Condition
Thread 1: Write booking data with 1-hour TTL Thread 2: Redis evicts the booking (memory pressure) Thread 1: Try to read booking → 404 Not Found Thread 1: Throws unhandled exception → Pod crashes
Scenario 3: The Debugging Nightmare
With
Why Fail-Fast is Good Engineering
The service implements the “fail loud, fail early” principle:
- Detection at startup - Problems caught before serving traffic
- Clear error message - No ambiguity about what’s wrong
- Prevents data loss - Won’t run in an unsafe configuration
- Forces proper fix - Can’t be ignored or worked around
This is production-ready design at its finest. 🎯
💥 Understanding the CrashLoopBackOff Cycle
Let’s trace what’s actually happening in your cluster:
1. Kubernetes starts the pod ↓ 2. Container starts, NestJS initializes ↓ 3. App connects to Redis and checks configuration ↓ 4. Detects: volatile-lru ≠ noeviction ↓ 5. Logs warning and exits with error code ↓ 6. Kubernetes sees exit code ≠ 0 ↓ 7. Restart policy: Always → Restart the pod ↓ 8. Wait 10s, then 20s, then 40s... (exponential backoff) ↓ 9. Eventually: CrashLoopBackOff status ↓ 10. Go to step 1 (until you fix the Redis config)
The pod isn’t broken—it’s protecting you from running with dangerous configuration.
🚀 The Solution: Three Ways to Fix This
Let’s explore your options, from quick fixes to proper solutions:
Option 1: Fix at Redis Instance Level (Recommended) ✅
This is the proper, infrastructure-level fix that persists across restarts and deployments.
If Using Google Cloud Memorystore
Via gcloud CLI:
# List your Redis instances firstgcloud redis instances list --region=asia-southeast1 # Update the eviction policygcloud redis instances update blink-redis-test \ --region=asia-southeast1 \ --redis-config maxmemory-policy=noeviction # Verify the changegcloud redis instances describe blink-redis-test \ --region=asia-southeast1 \ --format="value(redisConfigs.maxmemory-policy)"
Via GCP Console:
- Navigate to Memorystore → Redis
- Select your instance (e.g., blink-redis-test)
- Click Edit
- Scroll to Redis configuration
- Add/modify: maxmemory-policy = noeviction
- Click Save
- Wait for the update to apply (~2-5 minutes)
Pros:
- ✅ Permanent fix
- ✅ Survives instance restarts
- ✅ Auditable via infrastructure-as-code
- ✅ No manual intervention needed per pod
Cons:
- ⏱️ Requires infrastructure access
- ⏱️ May need approval for production changes
Option 2: Fix via Kubernetes ConfigMap
If Redis is deployed inside Kubernetes (not a managed service):
apiVersion: v1 kind: ConfigMap metadata: name: redis-config namespace: blink-test-qa data: redis.conf: | maxmemory 2gb maxmemory-policy noeviction save ""
Then reference it in your Redis deployment:
apiVersion: apps/v1 kind: Deployment metadata: name: redis spec: template: spec: containers: - name: redis image: redis:7-alpine command: - redis-server - /etc/redis/redis.conf volumeMounts: - name: config mountPath: /etc/redis volumes: - name: config configMap: name: redis-config
Apply with:
kubectl apply -f redis-config.yaml -n k8s-namespace kubectl rollout restart deployment/redis -n k8s-namespace
Pros:
- ✅ Version controlled
- ✅ GitOps friendly
- ✅ Environment-specific configurations
Cons:
- ⚠️ Only works for self-hosted Redis
- ⚠️ Requires Redis restart
Option 3: Runtime Fix via Redis CLI (Temporary)
For immediate debugging or hot-fix scenarios:
# 1. Find the Redis pod (if in Kubernetes) kubectl get pods -n k8s-namespace | grep redis # 2. Connect to Redis kubectl exec -it redis-master-0 -n k8s-namespace -- redis-cli # 3. Or if using password authentication kubectl exec -it redis-master-0 -n k8s-namespace -- redis-cli -a "your-password" # 4. Set the policy 127.0.0.1:6379> CONFIG SET maxmemory-policy noeviction OK # 5. Verify 127.0.0.1:6379> CONFIG GET maxmemory-policy 1) "maxmemory-policy" 2) "noeviction" # 6. Persist to disk (if CONFIG REWRITE is supported) 127.0.0.1:6379> CONFIG REWRITE OK # 7. Exit 127.0.0.1:6379> exit
For managed services (Memorystore), connect via Cloud Shell:
# 1. Get Redis IP from your service configuration REDIS_IP="10.0.0.3" # 2. Install redis-cli if not available sudo apt-get install redis-tools # 3. Connect redis-cli -h $REDIS_IP # 4. Follow same commands as above
⚠️ Warning: This fix is NOT persistent for managed services unless you use
Pros:
- ✅ Immediate effect
- ✅ No deployment needed
- ✅ Great for testing
Cons:
- ❌ May not persist across Redis restarts
- ❌ Not infrastructure-as-code
- ❌ Easy to forget and lose on next deployment
✅ Verification & Testing
After applying any fix, verify the change worked:
Step 1: Check Redis Configuration
# 1. Via kubectl (if Redis is in K8s) kubectl exec -it redis-master-0 -n k8s-namespace -- redis-cli CONFIG GET maxmemory-policy # 2. Via redis-cli (direct connection) redis-cli -h <redis-host> CONFIG GET maxmemory-policy # 3. Expected output: # 1) "maxmemory-policy" # 2) "noeviction"
Step 2: Restart Your Application Pod
# 1. kDelete the failing pod to trigger restart kubectl delete pod statistics-service-deployment-xxx -n k8s-namespace # 2. Or rollout restart the deployment kubectl rollout restart deployment statistics-service-deployment-xxx -n k8s-namespace # Watch the new pod come up kubectl get pods -n k8s-namespace -w
Step 3: Check Application Logs
# Watch the logs in real-time kubectl logs -f deployment/statistics-service-deployment -n k8s-namespace # Look for success indicators: # ✅ No "IMPORTANT! Eviction policy is volatile-lru" warning # ✅ Application starts successfully # ✅ Health checks passing
Step 4: Verify Pod Status
# Check pod status kubectl get pods -n k8s-namespace | grep statistics # Should show: Running (1/1) # NOT: CrashLoopBackOff or Error
🎯 Best Practices & Recommendations
For Infrastructure Teams
-
Infrastructure as Code:
# Terraform example for GCP Memorystore resource "google_redis_instance" "cache" { name = "redis-${var.environment}" tier = "STANDARD_HA" memory_size_gb = 4 redis_configs = { maxmemory-policy = "noeviction" # Add other configs as needed } } -
Document Redis dependencies in your service README:
## Redis Configuration Requirements - `maxmemory-policy`: Must be set to `noeviction` - `maxmemory`: Recommended 2GB minimum - Reason: Prevents data loss for critical booking state -
Add monitoring alerts:
# Example Prometheus alert - alert: RedisMemoryHigh expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.9 for: 5m labels: severity: warning annotations: summary: "Redis memory usage above 90%" description: "Consider scaling Redis or implementing data cleanup"
For Application Teams
-
Configuration validation is good! Keep that startup check—it saved you from production issues.
-
Handle OOM errors gracefully:
try { await redis.set(key, value); } catch (error) { if (error.message.includes("OOM")) { logger.error("Redis out of memory", { key }); // Implement fallback: cleanup old data, use alternative storage, etc. await cleanupOldEntries(); // Retry or fail gracefully } } -
Monitor Redis metrics:
- Memory usage
- Evicted keys (should be 0 with noeviction)
- Connection errors
- Command latency
For DevOps Teams
-
Pre-deployment checks:
#!/bin/bash # verify-redis-config.sh REDIS_HOST=$1 POLICY=$(redis-cli -h $REDIS_HOST CONFIG GET maxmemory-policy | tail -1) if [ "$POLICY" != "noeviction" ]; then echo "❌ Error: Redis eviction policy is $POLICY, expected noeviction" exit 1 fi echo "✅ Redis configuration validated" -
Environment parity: Ensure dev/staging/prod all use the same Redis configuration to catch issues early.
-
Capacity planning: With
noeviction, you need proper memory sizing. Monitor and scale proactively.
🎬 Conclusion
What looked like a frustrating CrashLoopBackOff was actually smart engineering preventing silent data loss. By understanding Redis eviction policies and implementing proper configuration validation, your team built a resilient system that fails safely rather than failing silently.
Key Takeaways:
- ✅ noevictionprevents silent data loss
- ✅ Configuration validation at startup is good practice
- ✅ Fix the infrastructure, not the symptoms
- ✅ Monitor Redis memory and plan capacity accordingly
- ✅ Document configuration requirements clearly
Now go forth and configure that Redis instance properly! 🚀
Questions or issues? Check your specific environment:
- Environment: k8s-namespacenamespace
- Service: statistics-service
- Pod: statistics-service-deployment-xxx
🐛 Troubleshooting Common Issues
Issue: “CONFIG SET is disabled”
If you see this error when trying to change the configuration:
(error) ERR unknown command 'CONFIG', with args beginning with: 'SET' 'maxmemory-policy' 'noeviction'
Solution: This means your Redis instance has disabled the CONFIG command (common in managed services). You must use the cloud provider’s management interface (Option 1) instead.
Issue: “Connection refused” when connecting to Redis
Check 1: Network policies
# Test connectivity from your app pod kubectl exec -it statistics-service-deployment-xxx -n k8s-namespace -- nc -zv <redis-host> 6379
Check 2: Firewall rules (GCP)
gcloud compute firewall-rules list | grep redis
Issue: Pod still crashing after fixing Redis config
Possible causes:
-
Old configuration cached: Restart the entire deployment
kubectl rollout restart deployment/statistics-service-deployment -n k8s-namespace -
Wrong Redis instance: Verify your app is connecting to the correct Redis
# Check environment variables kubectl get deployment statistics-service-deployment -n k8s-namespace -o yaml | grep -i redis
🔄 Post-Fix Validation Checklist
Use this checklist after applying the fix:
- Redis eviction policy is set to noeviction
- Pod starts successfully without CrashLoopBackOff
- No “IMPORTANT! Eviction policy…” warning in logs
- Application health checks are passing
- Service is accepting traffic (test with curl/Postman)
- Redis memory usage is monitored and alerted
- Configuration change is documented in your infra repo
- Terraform/IaC updated (if applicable)
- Similar environments (dev/stage/prod) checked for same issue