March 29, 2026By Jun Nguyen13 min read2856 words

Cloud SQL for PostgreSQL Production Maintenance — A Runbook for Zero Data Loss, Zero Corruption

Technology

Database

Postgres

Kubernetes

GCP

Scope: Scale the compute tier (vCPUs/RAM) of a Cloud SQL for PostgreSQL instance that is heavily utilized by GKE microservices with Kafka-based upstream and downstream event streams.

Key constraint: Scaling the compute tier requires an instance restart. This is not a zero-downtime operation. The entire purpose of this runbook is to make the downtime safe — guaranteeing zero data loss and zero data corruption (no torn transactions, no split-brain, no desynced Kafka offsets).

Phase 0: Rehearsal & Operational Readiness

This phase is non-negotiable. Do not execute this runbook in production without completing it first.

0.1 — Dry Run in Staging

Execute the full runbook end-to-end in a staging environment that mirrors production as closely as possible. Use synthetic Kafka traffic to simulate realistic load. The goals are:

Validate the shutdown and bring-up sequence with your actual deployment names and namespaces.
Measure timing: how long does the Kafka drain take? How long does the Cloud SQL restart take at your database size?
Identify any services not covered in the shutdown list (orphan consumers, scheduled jobs in other namespaces, third-party integrations writing directly to the DB).
Train the on-call team on the procedure.

Document the staging results: total window duration, any surprises, and adjustments made. Attach this to the change request.

0.2 — Establish the War Room

Role	Responsibility
Runbook Lead	Calls each phase gate, makes go/no-go decisions
Infra Operator	Executes Terraform/gcloud commands, monitors Cloud SQL
Kafka Operator	Monitors consumer lag, verifies offset commits
GKE Operator	Scales deployments, monitors pod health
Comms Lead	Updates status page, coordinates with stakeholders

Open a dedicated Slack/Teams channel. All commands and observations are posted there in real time. No tribal knowledge.

0.3 — Shared Timeline

Publish a minute-by-minute timeline to the team 24 hours before the window. Example:

00:00  Phase 1 — Pre-flight checks begin
00:15  Phase 2 — Graceful shutdown begins
00:15    Step 1: Pause ingress
00:20    Step 2: Scale down producers
00:25    Step 3: Kafka drain (variable — budget 15-30 min)
00:55    Step 4: Scale down consumers and workers
01:00    Step 5: Verify DB idleness
01:05  Phase 3 — Scaling event (budget 10-15 min)
01:20  Phase 4 — Validation and bring-up
01:20    Step 1: DB health check
01:25    Step 2: Spin up consumers
01:30    Step 3: Spin up producers
01:35    Step 4: Restore ingress
01:40  Phase 5 — Post-scaling validation
02:00  Window closes

Adjust based on your staging dry-run measurements.

Phase 1: Pre-Flight Assessment & Security Checks

1.1 — Determine Target Capacity (IaC)

Manage infrastructure via Terraform, not ClickOps. Ensure the Terraform change has been reviewed, approved, and merged to your IaC repo.

Critical: Run

terraform plan

before the window and inspect the diff. A tier change should show an in-place update with restart, not a destroy-and-recreate. If Terraform plans to destroy the instance, stop immediately and fix the configuration.

1.2 — Inventory All Database Clients

Before the window, produce a complete list of every service that connects to this Cloud SQL instance. Sources:

GKE deployments and statefulsets across all namespaces (not just
production
).
CronJobs and Jobs.
Cloud Functions, Cloud Run services, or App Engine services in the same project.
Any external services connecting via authorized networks or Cloud SQL Auth Proxy.
Read replica consumers (see 1.3).

If you discover clients outside the shutdown list, add them. An unknown writer is the most common cause of data corruption during maintenance.

1.3 — Audit Read Replicas

If the primary instance has read replicas:

Read replicas do not need to be scaled in the same window, but they will experience replication lag during and after the primary restart.
Decide whether to pause replica-dependent services (analytics dashboards, reporting pipelines) or accept stale reads during the window.
After the primary comes back, monitor
cloudsql.googleapis.com/database/postgresql/replication/replica_byte_lag
until it returns to zero.
If you also need to scale replicas, do so after the primary is confirmed healthy and replication lag is zero.

1.4 — Verify GKE Workload Identity & Connection Pooling

Confirm pods connect via the Cloud SQL Auth Proxy using Workload Identity (no hardcoded passwords).
Confirm a connection pooler (PgBouncer, either as a sidecar or a central GKE service) is in place. Without it, the connection storm during bring-up can exhaust
max_connections
or OOM the database.

1.5 — Create a Manual On-Demand Backup

gcloud sql backups create --instance=my-postgres-db \
    --project=my-gcp-project \
    --description="Pre-scale manual backup $(date -u +%Y%m%dT%H%M%SZ)"

Wait for the backup to complete before proceeding. Verify status:

gcloud sql backups list --instance=my-postgres-db \
    --project=my-gcp-project \
    --sort-by=~startTime --limit=1

The status must be

SUCCESSFUL

. If it fails, investigate and retry. Do not proceed without a verified backup.

1.6 — Verify Kafka Consumer Offset Commit Strategy

For every consumer group that writes to this database:

Determine whether the consumer uses auto-commit or manual (synchronous) commit.
If auto-commit is enabled with a long interval (e.g., 5 seconds), there is a risk window: a consumer processes a message, gets scaled down before the offset commits, and reprocesses the message on restart.
Preferred: Consumers should use manual synchronous offset commits after each successful DB write. If this is not the case, ensure your consumers implement idempotent writes (e.g., upserts keyed on the Kafka message ID or a deduplication table).
If neither manual commits nor idempotent writes are in place, you must accept the risk of duplicate processing on restart, or fix the consumer code before the window.

1.7 — Confirm No Pending Schema Migrations

This maintenance window is for scaling compute only. If there are pending DDL changes (migrations), either:

Run them in a separate, earlier window so you can isolate failures, or
Explicitly decide to combine them and document the additional risk.

Do not let a migration slip in unplanned during the bring-up phase.

1.8 — Silence and Configure Alerts

Silence PagerDuty/Opsgenie alerts for "DB Down," "Service 5xx," and "Pod CrashLoopBackOff" during the maintenance window.
Keep active: "Kafka Consumer Lag > threshold after maintenance window ends" and "Cloud SQL instance not RUNNING after 20 minutes."
Set up a manual alert or timer: if Cloud SQL has not returned to
RUNNING
within 15 minutes of the scaling command, trigger the rollback plan (Phase 3B).

Phase 1 Gate: Go / No-Go

The Runbook Lead confirms:

Terraform plan reviewed — in-place update, not destroy/recreate
All database clients inventoried
Read replica strategy decided
Connection pooler verified
Manual backup status: SUCCESSFUL
Consumer offset commit strategy verified (manual commit or idempotent writes)
No pending schema migrations (or explicitly combined)
Alerts configured
War room open, all operators present

If any item is not confirmed, do not proceed.

Phase 2: The Graceful Shutdown Sequence ("Quiet State")

The principle: shut off the water at the source, let the pipe drain completely, then work on the pipe.

Step 1 — Pause Ingress / Upstream Traffic

Route external traffic to a static "Maintenance" page via Cloud Load Balancing. This stops new data from entering the system.

Update your Gateway API, Ingress object, or Cloud Armor policy to route to a maintenance backend.
Verify by hitting the public endpoint and confirming the maintenance page is served.

Step 2 — Gracefully Terminate Upstream Producers

Scale down the GKE deployments that receive user requests and write to the DB or publish to Kafka.

kubectl scale deployment <upstream-producer-1> --replicas=0 -n production
kubectl scale deployment <upstream-producer-2> --replicas=0 -n production
# ... repeat for all upstream producers

Why: This caps the Kafka backlog. No new events are published after this point.

Verify: Confirm all producer pods have terminated (

kubectl get pods -n production | grep <producer>

). Watch for pods stuck in

Terminating

— if a pod does not terminate within its

terminationGracePeriodSeconds

, investigate before proceeding.

Step 3 — Drain the Kafka Queues (Critical)

Do not shut down downstream consumers yet. Let them continue processing the remaining messages in Kafka and writing final states to PostgreSQL.

Monitor consumer lag for every consumer group that writes to this database. Use your Kafka monitoring (Prometheus + Grafana, Confluent Control Center, Cloud Monitoring with exported Kafka metrics, or

kafka-consumer-groups.sh

kafka-consumer-groups.sh --bootstrap-server <broker> \
    --describe --group <consumer-group-name>

Do not proceed until lag is exactly

for all relevant consumer groups and topics.

Budget 15–30 minutes for this step based on your staging measurements. If lag is not converging toward zero, investigate stuck consumers before proceeding.

Step 4 — Verify Consumer Offset Commits

Before scaling down consumers, ensure all offsets are committed:

# For each consumer group, verify CURRENT-OFFSET matches LOG-END-OFFSET
kafka-consumer-groups.sh --bootstrap-server <broker> \
    --describe --group <consumer-group-name>

If you are using manual commit, trigger a final commit by sending a graceful shutdown signal (SIGTERM) to each consumer pod and allowing the shutdown hook to commit offsets. Kubernetes does this by default when scaling to 0, provided your application handles SIGTERM correctly and commits offsets in its shutdown handler.

Step 5 — Scale Down Consumers, Workers, and CronJobs

kubectl scale deployment <downstream-consumer-1> --replicas=0 -n production
kubectl scale deployment <downstream-consumer-2> --replicas=0 -n production
kubectl scale deployment <background-db-workers> --replicas=0 -n production

# Suspend CronJobs
kubectl patch cronjobs <cron-1> -p '{"spec":{"suspend":true}}' -n production
kubectl patch cronjobs <cron-2> -p '{"spec":{"suspend":true}}' -n production

Also scale down any services identified in Phase 1.2 that are outside the

production

namespace (Cloud Functions, Cloud Run, etc.). Disable their triggers or set concurrency to 0 as appropriate.

Step 6 — Verify Database Idleness

Confirm no active connections or long-running transactions remain.

Check Cloud Monitoring:
cloudsql.googleapis.com/database/postgresql/num_connections
should drop to near zero (accounting for system/superuser connections).
Optionally connect via Cloud SQL Proxy and run:

SELECT pid, usename, application_name, state, query_start, query
FROM pg_stat_activity
WHERE state != 'idle' AND pid != pg_backend_pid();

If long-running transactions exist, investigate. Do not proceed with active transactions — they will be killed during restart, causing partial writes.

Phase 2 Gate: Go / No-Go

Ingress routed to maintenance page (verified)
All producer pods terminated
Kafka lag = 0 for all relevant consumer groups
Consumer offsets committed
All consumer, worker, and CronJob pods terminated
External clients (Cloud Functions, Cloud Run, etc.) disabled
Database connections near zero, no active transactions

Phase 3A: The Scaling Event

The database is now isolated and quiet.

Option A: Terraform (Recommended)

Update your

google_sql_database_instance

resource:

resource "google_sql_database_instance" "primary" {
  name             = "my-postgres-db"
  database_version = "POSTGRES_15"

  settings {
    tier = "db-custom-16-65536"  # New target tier
    # ... other settings unchanged
  }
}

terraform plan   # Review the diff one final time
terraform apply  # Execute

Option B: gcloud CLI

gcloud sql instances patch my-postgres-db \
    --tier=db-custom-16-65536 \
    --project=my-gcp-project

What to Expect

The instance will restart. Typical duration: 3–10 minutes depending on database size and state.
If Regional HA is enabled, the primary will fail over to the standby. This causes connection drops (typically 30–120 seconds for the failover itself, plus Cloud SQL Auth Proxy reconnection time). Connection drops are guaranteed even with HA.
Monitor instance status:

watch -n 10 "gcloud sql instances describe my-postgres-db \
    --project=my-gcp-project --format='value(state)'"

Start a timer when the command is issued. If the instance has not returned to
RUNNING
within 15 minutes, proceed to Phase 3B (Rollback).

Phase 3B: Rollback Plan

This phase is only executed if the scaling event fails.

Trigger conditions:

Instance has not returned to
RUNNING
within 15 minutes.
Instance reports an error state.
Instance comes back but Cloud Logging shows critical startup errors (e.g., WAL corruption, OOM during startup).

Rollback steps:

Revert the tier change:
```
gcloud sql instances patch my-postgres-db \
    --tier=<ORIGINAL_TIER> \
    --project=my-gcp-project
```
If the instance is responsive enough to accept a patch, this will restart it again at the original tier.
If the instance is unresponsive or data appears corrupted, restore from the manual backup taken in Phase 1.5:
```
gcloud sql backups list --instance=my-postgres-db \
    --project=my-gcp-project --sort-by=~startTime --limit=3

gcloud sql backups restore <BACKUP_ID> \
    --restore-instance=my-postgres-db \
    --project=my-gcp-project
```
Warning: Restoring from backup loses all data written between the backup and the failure. Because we drained all traffic before the scaling event, this loss should be zero or near-zero.
If restoring to the same instance fails, create a new instance from the backup and update DNS/connection strings. This is the nuclear option and requires repointing all Cloud SQL Auth Proxy configurations.
Regardless of rollback path, proceed to Phase 4 (bring-up sequence) once the database is confirmed healthy.
Conduct a post-incident review to determine the root cause before rescheduling the scaling attempt.

Phase 4: Validation & Bring-Up Sequence

Once Cloud SQL reports

RUNNING

, reverse the shutdown sequence. The order matters: consumers before producers.

Step 1 — Database Health Check

Connect via the Cloud SQL Auth Proxy or a bastion host:

-- Basic connectivity
SELECT 1;

-- Verify the new tier is active
SELECT version();
-- Cross-reference with Cloud Console: instance should show the new vCPU/RAM

-- Check for startup anomalies
-- Review Cloud Logging: resource.type="cloudsql_database" severity>=WARNING

If anything looks wrong, do not proceed. Evaluate whether Phase 3B (Rollback) is needed.

Step 2 — Spin Up Downstream Consumers First

Start the services that read from Kafka and write to the DB:

kubectl scale deployment <downstream-consumer-1> --replicas=<desired> -n production
kubectl scale deployment <downstream-consumer-2> --replicas=<desired> -n production
kubectl scale deployment <background-db-workers> --replicas=<desired> -n production

# Unsuspend CronJobs
kubectl patch cronjobs <cron-1> -p '{"spec":{"suspend":false}}' -n production
kubectl patch cronjobs <cron-2> -p '{"spec":{"suspend":false}}' -n production

Why consumers first: If producers emit messages immediately, consumers must already be alive to process them. This prevents a Kafka backlog spike.

Verify: Confirm pods reach

Running

state and are passing readiness probes. Check Cloud SQL connection count is rising as expected (but not spiking to

max_connections

— if it is, your PgBouncer config needs adjustment).

Also re-enable any external clients (Cloud Functions, Cloud Run) disabled in Phase 2.

Step 3 — Spin Up Upstream Producers

kubectl scale deployment <upstream-producer-1> --replicas=<desired> -n production
kubectl scale deployment <upstream-producer-2> --replicas=<desired> -n production

Verify: Pods are

Running

, readiness probes pass, Kafka topics are receiving messages.

Step 4 — Restore Ingress

Remove the maintenance routing and let external traffic flow back in:

Revert your Gateway API / Ingress object / Cloud Armor policy to production routing.
Verify by hitting the public endpoint and confirming the real application is served.

Step 5 — Monitor the Rewarming Period (15–30 minutes)

After traffic is restored, closely watch for the first 15–30 minutes:

Cloud SQL: CPU utilization, memory utilization, active connections, query latency (p50, p95, p99), replication lag (if replicas exist).
Kafka: Consumer lag across all groups — it should stay near zero. If lag is growing, consumers may be struggling with the new DB or the connection pool is saturated.
GKE: Pod restarts, OOMKills, readiness probe failures.
Application: Error rates (5xx), latency from the load balancer, business-critical transaction success rates.

Phase 5: Post-Scaling Validation & Closeout

5.1 — Read Replica Health (if applicable)

Monitor

cloudsql.googleapis.com/database/postgresql/replication/replica_byte_lag

until it returns to zero. If replicas need scaling too, execute that as a separate operation now.

5.2 — Re-Enable Alerts

Un-silence all PagerDuty/Opsgenie alerts that were suppressed in Phase 1.8.

5.3 — Secret Rotation Opportunity

Because all pods were drained and restarted, this is an ideal time to rotate Cloud SQL passwords (if you are not fully on IAM Database Authentication). Ensure pods fetch credentials from Secret Manager on startup so a rotation only requires a rolling restart.

5.4 — Update Documentation

Update your IaC repo (Terraform should already reflect the change).
Update any capacity planning spreadsheets or runbooks that reference the old tier.
Update monitoring dashboards and alert thresholds if the new tier changes baseline metrics.

5.5 — Post-Maintenance Report

Within 24 hours, the Runbook Lead files a brief report:

Actual timeline vs. planned timeline.
Any deviations from the runbook and why.
Issues encountered and how they were resolved.
Recommendations for next time (e.g., "Kafka drain took 40 minutes, budget 45 next time").

Quick Reference: Shutdown and Bring-Up Order

SHUTDOWN (top to bottom):
  1. Ingress → maintenance page
  2. Upstream producers → replicas=0
  3. Kafka queues → drain to lag=0
  4. Downstream consumers → replicas=0
  5. Background workers / CronJobs → replicas=0 / suspend
  6. Verify DB idle

SCALING EVENT

BRING-UP (bottom to top):
  1. Verify DB healthy
  2. Downstream consumers → replicas=N
  3. Background workers / CronJobs → replicas=N / unsuspend
  4. Upstream producers → replicas=N
  5. Ingress → restore production routing

The metaphor: shut off the water at the source, let the pipe drain completely, expand the pipe, then turn the water back on — in reverse order.