Cloud SQL for PostgreSQL Production Maintenance — A Runbook for Zero Data Loss, Zero Corruption
Scope: Scale the compute tier (vCPUs/RAM) of a Cloud SQL for PostgreSQL instance that is heavily utilized by GKE microservices with Kafka-based upstream and downstream event streams.
Key constraint: Scaling the compute tier requires an instance restart. This is not a zero-downtime operation. The entire purpose of this runbook is to make the downtime safe — guaranteeing zero data loss and zero data corruption (no torn transactions, no split-brain, no desynced Kafka offsets).
Phase 0: Rehearsal & Operational Readiness
This phase is non-negotiable. Do not execute this runbook in production without completing it first.
0.1 — Dry Run in Staging
Execute the full runbook end-to-end in a staging environment that mirrors production as closely as possible. Use synthetic Kafka traffic to simulate realistic load. The goals are:
- Validate the shutdown and bring-up sequence with your actual deployment names and namespaces.
- Measure timing: how long does the Kafka drain take? How long does the Cloud SQL restart take at your database size?
- Identify any services not covered in the shutdown list (orphan consumers, scheduled jobs in other namespaces, third-party integrations writing directly to the DB).
- Train the on-call team on the procedure.
Document the staging results: total window duration, any surprises, and adjustments made. Attach this to the change request.
0.2 — Establish the War Room
| Role | Responsibility |
|---|---|
| Runbook Lead | Calls each phase gate, makes go/no-go decisions |
| Infra Operator | Executes Terraform/gcloud commands, monitors Cloud SQL |
| Kafka Operator | Monitors consumer lag, verifies offset commits |
| GKE Operator | Scales deployments, monitors pod health |
| Comms Lead | Updates status page, coordinates with stakeholders |
Open a dedicated Slack/Teams channel. All commands and observations are posted there in real time. No tribal knowledge.
0.3 — Shared Timeline
Publish a minute-by-minute timeline to the team 24 hours before the window. Example:
00:00 Phase 1 — Pre-flight checks begin 00:15 Phase 2 — Graceful shutdown begins 00:15 Step 1: Pause ingress 00:20 Step 2: Scale down producers 00:25 Step 3: Kafka drain (variable — budget 15-30 min) 00:55 Step 4: Scale down consumers and workers 01:00 Step 5: Verify DB idleness 01:05 Phase 3 — Scaling event (budget 10-15 min) 01:20 Phase 4 — Validation and bring-up 01:20 Step 1: DB health check 01:25 Step 2: Spin up consumers 01:30 Step 3: Spin up producers 01:35 Step 4: Restore ingress 01:40 Phase 5 — Post-scaling validation 02:00 Window closes
Adjust based on your staging dry-run measurements.
Phase 1: Pre-Flight Assessment & Security Checks
1.1 — Determine Target Capacity (IaC)
Manage infrastructure via Terraform, not ClickOps. Ensure the Terraform change has been reviewed, approved, and merged to your IaC repo.
Critical: Run
1.2 — Inventory All Database Clients
Before the window, produce a complete list of every service that connects to this Cloud SQL instance. Sources:
- GKE deployments and statefulsets across all namespaces (not just production).
- CronJobs and Jobs.
- Cloud Functions, Cloud Run services, or App Engine services in the same project.
- Any external services connecting via authorized networks or Cloud SQL Auth Proxy.
- Read replica consumers (see 1.3).
If you discover clients outside the shutdown list, add them. An unknown writer is the most common cause of data corruption during maintenance.
1.3 — Audit Read Replicas
If the primary instance has read replicas:
- Read replicas do not need to be scaled in the same window, but they will experience replication lag during and after the primary restart.
- Decide whether to pause replica-dependent services (analytics dashboards, reporting pipelines) or accept stale reads during the window.
- After the primary comes back, monitor cloudsql.googleapis.com/database/postgresql/replication/replica_byte_laguntil it returns to zero.
- If you also need to scale replicas, do so after the primary is confirmed healthy and replication lag is zero.
1.4 — Verify GKE Workload Identity & Connection Pooling
- Confirm pods connect via the Cloud SQL Auth Proxy using Workload Identity (no hardcoded passwords).
- Confirm a connection pooler (PgBouncer, either as a sidecar or a central GKE service) is in place. Without it, the connection storm during bring-up can exhaust max_connectionsor OOM the database.
1.5 — Create a Manual On-Demand Backup
gcloud sql backups create --instance=my-postgres-db \ --project=my-gcp-project \ --description="Pre-scale manual backup $(date -u +%Y%m%dT%H%M%SZ)"
Wait for the backup to complete before proceeding. Verify status:
gcloud sql backups list --instance=my-postgres-db \ --project=my-gcp-project \ --sort-by=~startTime --limit=1
The status must be
1.6 — Verify Kafka Consumer Offset Commit Strategy
For every consumer group that writes to this database:
- Determine whether the consumer uses auto-commit or manual (synchronous) commit.
- If auto-commit is enabled with a long interval (e.g., 5 seconds), there is a risk window: a consumer processes a message, gets scaled down before the offset commits, and reprocesses the message on restart.
- Preferred: Consumers should use manual synchronous offset commits after each successful DB write. If this is not the case, ensure your consumers implement idempotent writes (e.g., upserts keyed on the Kafka message ID or a deduplication table).
- If neither manual commits nor idempotent writes are in place, you must accept the risk of duplicate processing on restart, or fix the consumer code before the window.
1.7 — Confirm No Pending Schema Migrations
This maintenance window is for scaling compute only. If there are pending DDL changes (migrations), either:
- Run them in a separate, earlier window so you can isolate failures, or
- Explicitly decide to combine them and document the additional risk.
Do not let a migration slip in unplanned during the bring-up phase.
1.8 — Silence and Configure Alerts
- Silence PagerDuty/Opsgenie alerts for "DB Down," "Service 5xx," and "Pod CrashLoopBackOff" during the maintenance window.
- Keep active: "Kafka Consumer Lag > threshold after maintenance window ends" and "Cloud SQL instance not RUNNING after 20 minutes."
- Set up a manual alert or timer: if Cloud SQL has not returned to RUNNINGwithin 15 minutes of the scaling command, trigger the rollback plan (Phase 3B).
Phase 1 Gate: Go / No-Go
The Runbook Lead confirms:
- Terraform plan reviewed — in-place update, not destroy/recreate
- All database clients inventoried
- Read replica strategy decided
- Connection pooler verified
- Manual backup status: SUCCESSFUL
- Consumer offset commit strategy verified (manual commit or idempotent writes)
- No pending schema migrations (or explicitly combined)
- Alerts configured
- War room open, all operators present
If any item is not confirmed, do not proceed.
Phase 2: The Graceful Shutdown Sequence ("Quiet State")
The principle: shut off the water at the source, let the pipe drain completely, then work on the pipe.
Step 1 — Pause Ingress / Upstream Traffic
Route external traffic to a static "Maintenance" page via Cloud Load Balancing. This stops new data from entering the system.
- Update your Gateway API, Ingress object, or Cloud Armor policy to route to a maintenance backend.
- Verify by hitting the public endpoint and confirming the maintenance page is served.
Step 2 — Gracefully Terminate Upstream Producers
Scale down the GKE deployments that receive user requests and write to the DB or publish to Kafka.
kubectl scale deployment <upstream-producer-1> --replicas=0 -n production kubectl scale deployment <upstream-producer-2> --replicas=0 -n production # ... repeat for all upstream producers
Why: This caps the Kafka backlog. No new events are published after this point.
Verify: Confirm all producer pods have terminated (
Step 3 — Drain the Kafka Queues (Critical)
Do not shut down downstream consumers yet. Let them continue processing the remaining messages in Kafka and writing final states to PostgreSQL.
Monitor consumer lag for every consumer group that writes to this database. Use your Kafka monitoring (Prometheus + Grafana, Confluent Control Center, Cloud Monitoring with exported Kafka metrics, or
kafka-consumer-groups.sh --bootstrap-server <broker> \ --describe --group <consumer-group-name>
Do not proceed until lag is exactly
Budget 15–30 minutes for this step based on your staging measurements. If lag is not converging toward zero, investigate stuck consumers before proceeding.
Step 4 — Verify Consumer Offset Commits
Before scaling down consumers, ensure all offsets are committed:
# For each consumer group, verify CURRENT-OFFSET matches LOG-END-OFFSET kafka-consumer-groups.sh --bootstrap-server <broker> \ --describe --group <consumer-group-name>
If you are using manual commit, trigger a final commit by sending a graceful shutdown signal (SIGTERM) to each consumer pod and allowing the shutdown hook to commit offsets. Kubernetes does this by default when scaling to 0, provided your application handles SIGTERM correctly and commits offsets in its shutdown handler.
Step 5 — Scale Down Consumers, Workers, and CronJobs
kubectl scale deployment <downstream-consumer-1> --replicas=0 -n production kubectl scale deployment <downstream-consumer-2> --replicas=0 -n production kubectl scale deployment <background-db-workers> --replicas=0 -n production # Suspend CronJobs kubectl patch cronjobs <cron-1> -p '{"spec":{"suspend":true}}' -n production kubectl patch cronjobs <cron-2> -p '{"spec":{"suspend":true}}' -n production
Also scale down any services identified in Phase 1.2 that are outside the
Step 6 — Verify Database Idleness
Confirm no active connections or long-running transactions remain.
- Check Cloud Monitoring: cloudsql.googleapis.com/database/postgresql/num_connectionsshould drop to near zero (accounting for system/superuser connections).
- Optionally connect via Cloud SQL Proxy and run:
SELECT pid, usename, application_name, state, query_start, query FROM pg_stat_activity WHERE state != 'idle' AND pid != pg_backend_pid();
If long-running transactions exist, investigate. Do not proceed with active transactions — they will be killed during restart, causing partial writes.
Phase 2 Gate: Go / No-Go
- Ingress routed to maintenance page (verified)
- All producer pods terminated
- Kafka lag = 0 for all relevant consumer groups
- Consumer offsets committed
- All consumer, worker, and CronJob pods terminated
- External clients (Cloud Functions, Cloud Run, etc.) disabled
- Database connections near zero, no active transactions
Phase 3A: The Scaling Event
The database is now isolated and quiet.
Option A: Terraform (Recommended)
Update your
resource "google_sql_database_instance" "primary" { name = "my-postgres-db" database_version = "POSTGRES_15" settings { tier = "db-custom-16-65536" # New target tier # ... other settings unchanged } }
terraform plan # Review the diff one final time terraform apply # Execute
Option B: gcloud CLI
gcloud sql instances patch my-postgres-db \ --tier=db-custom-16-65536 \ --project=my-gcp-project
What to Expect
- The instance will restart. Typical duration: 3–10 minutes depending on database size and state.
- If Regional HA is enabled, the primary will fail over to the standby. This causes connection drops (typically 30–120 seconds for the failover itself, plus Cloud SQL Auth Proxy reconnection time). Connection drops are guaranteed even with HA.
- Monitor instance status:
watch -n 10 "gcloud sql instances describe my-postgres-db \ --project=my-gcp-project --format='value(state)'"
- Start a timer when the command is issued. If the instance has not returned to RUNNINGwithin 15 minutes, proceed to Phase 3B (Rollback).
Phase 3B: Rollback Plan
This phase is only executed if the scaling event fails.
Trigger conditions:
- Instance has not returned to RUNNINGwithin 15 minutes.
- Instance reports an error state.
- Instance comes back but Cloud Logging shows critical startup errors (e.g., WAL corruption, OOM during startup).
Rollback steps:
-
Revert the tier change:
gcloud sql instances patch my-postgres-db \ --tier=<ORIGINAL_TIER> \ --project=my-gcp-projectIf the instance is responsive enough to accept a patch, this will restart it again at the original tier.
-
If the instance is unresponsive or data appears corrupted, restore from the manual backup taken in Phase 1.5:
gcloud sql backups list --instance=my-postgres-db \ --project=my-gcp-project --sort-by=~startTime --limit=3 gcloud sql backups restore <BACKUP_ID> \ --restore-instance=my-postgres-db \ --project=my-gcp-projectWarning: Restoring from backup loses all data written between the backup and the failure. Because we drained all traffic before the scaling event, this loss should be zero or near-zero.
-
If restoring to the same instance fails, create a new instance from the backup and update DNS/connection strings. This is the nuclear option and requires repointing all Cloud SQL Auth Proxy configurations.
-
Regardless of rollback path, proceed to Phase 4 (bring-up sequence) once the database is confirmed healthy.
-
Conduct a post-incident review to determine the root cause before rescheduling the scaling attempt.
Phase 4: Validation & Bring-Up Sequence
Once Cloud SQL reports
Step 1 — Database Health Check
Connect via the Cloud SQL Auth Proxy or a bastion host:
-- Basic connectivity SELECT 1; -- Verify the new tier is active SELECT version(); -- Cross-reference with Cloud Console: instance should show the new vCPU/RAM -- Check for startup anomalies -- Review Cloud Logging: resource.type="cloudsql_database" severity>=WARNING
If anything looks wrong, do not proceed. Evaluate whether Phase 3B (Rollback) is needed.
Step 2 — Spin Up Downstream Consumers First
Start the services that read from Kafka and write to the DB:
kubectl scale deployment <downstream-consumer-1> --replicas=<desired> -n production kubectl scale deployment <downstream-consumer-2> --replicas=<desired> -n production kubectl scale deployment <background-db-workers> --replicas=<desired> -n production # Unsuspend CronJobs kubectl patch cronjobs <cron-1> -p '{"spec":{"suspend":false}}' -n production kubectl patch cronjobs <cron-2> -p '{"spec":{"suspend":false}}' -n production
Why consumers first: If producers emit messages immediately, consumers must already be alive to process them. This prevents a Kafka backlog spike.
Verify: Confirm pods reach
Also re-enable any external clients (Cloud Functions, Cloud Run) disabled in Phase 2.
Step 3 — Spin Up Upstream Producers
kubectl scale deployment <upstream-producer-1> --replicas=<desired> -n production kubectl scale deployment <upstream-producer-2> --replicas=<desired> -n production
Verify: Pods are
Step 4 — Restore Ingress
Remove the maintenance routing and let external traffic flow back in:
- Revert your Gateway API / Ingress object / Cloud Armor policy to production routing.
- Verify by hitting the public endpoint and confirming the real application is served.
Step 5 — Monitor the Rewarming Period (15–30 minutes)
After traffic is restored, closely watch for the first 15–30 minutes:
- Cloud SQL: CPU utilization, memory utilization, active connections, query latency (p50, p95, p99), replication lag (if replicas exist).
- Kafka: Consumer lag across all groups — it should stay near zero. If lag is growing, consumers may be struggling with the new DB or the connection pool is saturated.
- GKE: Pod restarts, OOMKills, readiness probe failures.
- Application: Error rates (5xx), latency from the load balancer, business-critical transaction success rates.
Phase 5: Post-Scaling Validation & Closeout
5.1 — Read Replica Health (if applicable)
Monitor
5.2 — Re-Enable Alerts
Un-silence all PagerDuty/Opsgenie alerts that were suppressed in Phase 1.8.
5.3 — Secret Rotation Opportunity
Because all pods were drained and restarted, this is an ideal time to rotate Cloud SQL passwords (if you are not fully on IAM Database Authentication). Ensure pods fetch credentials from Secret Manager on startup so a rotation only requires a rolling restart.
5.4 — Update Documentation
- Update your IaC repo (Terraform should already reflect the change).
- Update any capacity planning spreadsheets or runbooks that reference the old tier.
- Update monitoring dashboards and alert thresholds if the new tier changes baseline metrics.
5.5 — Post-Maintenance Report
Within 24 hours, the Runbook Lead files a brief report:
- Actual timeline vs. planned timeline.
- Any deviations from the runbook and why.
- Issues encountered and how they were resolved.
- Recommendations for next time (e.g., "Kafka drain took 40 minutes, budget 45 next time").
Quick Reference: Shutdown and Bring-Up Order
SHUTDOWN (top to bottom): 1. Ingress → maintenance page 2. Upstream producers → replicas=0 3. Kafka queues → drain to lag=0 4. Downstream consumers → replicas=0 5. Background workers / CronJobs → replicas=0 / suspend 6. Verify DB idle SCALING EVENT BRING-UP (bottom to top): 1. Verify DB healthy 2. Downstream consumers → replicas=N 3. Background workers / CronJobs → replicas=N / unsuspend 4. Upstream producers → replicas=N 5. Ingress → restore production routing
The metaphor: shut off the water at the source, let the pipe drain completely, expand the pipe, then turn the water back on — in reverse order.