Runbook: Reconnect Customers After Carrier Outages

Operational runbook for carriers: orchestrate device and network reconnections, messaging, credits and telemetry after multi-hour wireless outages.

Hook: When eight-hour outages cost millions — a pragmatic runbook to get customers reconnected

Large-scale wireless outages in 2025–2026 (see major national incidents in January 2026) exposed a hard truth: restoring core services is only half the battle. The other half is orchestrating device reconnections, customer communication, automated crediting and telemetry to prove recovery. This operational runbook gives ISPs and carriers a step-by-step checklist to reduce MTTR, avoid call-center overload and protect revenue and reputation after multi-hour outages.

Why this matters now (2026 trends)

In late 2025 and early 2026 carriers faced wider blast-radius software failures rather than localized hardware or weather events. With 5G standalone (SA) deployments and tighter dependence on cloud-based network functions (CNFs), a single software rollout can create national scale impact. Regulators and customers now expect faster remediation, transparent compensation and forensic telemetry. This runbook is built for that reality: cloud-native cores, automated CI/CD, and an expectation of integrated remediation and billing automation.

Scope and assumptions

Scope: Multi-hour nationwide/regional wireless outage that has been resolved in the core, but subscriber devices still need to reattach or be compensated.
Assumptions: Core network (MME/AMF, HSS/UDM, PCRF/PCF, IMS) is declared healthy and stable. Billing and OSS/BSS systems are accessible. You have tooling for bulk messaging and telemetry (prometheus/splunk/ELK + OSS APIs).

High-level phases

Contain & confirm: Validate core recovery and prevent further configuration drift.
Pre-reconnect validation: Check signaling, authentication, and AAA behavior before telling customers to restart.
Device-side orchestration: Controlled reattach methods and customer guidance.
Network-side remediation: Targeted detach/reattach, policy pushes, and SIM reprovisioning.
Customer messaging & support scripts: Clear templates and escalation criteria.
Compensation & credits: Automated, auditable crediting workflow.
Telemetry & verification: Real-time dashboards, success rate KPIs and post-mortem evidence.
Contingency & legal: Fraud protection, regulatory notification, and public relations coordination.

Phase 1 — Contain & confirm (first 0–30 mins)

Before instructing customers to reboot or initiating mass changes, ensure the incident didn’t mask ongoing instability. Premature mass restarts can create a second outage spike.

Confirm stability of core control planes: Verify MME/AMF/SMF, HSS/UDM, and DNS are stable for at least one rolling minute with no exception spikes.
Check third-party dependencies: Authentication providers, cloud VPCs, and RIC/SMO orchestration. Validate that a recent software rollback is complete and health checks pass.
Enable verbose logging for attach/auth: Temporarily increase sampling of attach/auth flows in core controllers to capture 1–5% full traces for forensic analysis.
Lock CI/CD pipelines: Block further deployments into core domains until post-mortem concludes.

Phase 2 — Pre-reconnect validation (30–90 mins)

These tests prevent a wave of failed reattaches. Automate them and make results visible to support and leadership.

Sanity checks:
- Successful IMS/SIP REGISTER for test subs across regions.
- End-to-end voice/sms/data microflows from automation fleet (real devices and emulators).
- Authentication acceptance ratio <1% reject (expected baseline).
Attach stress tests: Ramp a small cohort (1–5k simulated attaches) to monitor control-plane capacity and attach error codes (e.g., EMM cause codes, 3GPP reject reasons).
Billing/Policy verification: Confirm that PCRF/PCF and billing connectors accept new sessions and that session records reach charging systems.
Approval gates: Only if attach success & authentication return to baseline for a 10-minute window do you advance to customer messaging/reconnect steps.

Phase 3 — Device-side orchestration (bulk reconnection)

Devices vary: smartphones, fixed wireless CPE, IoT. Use tiered reconnect approaches to avoid surges and to maximize success.

Step A — Segmented rollout

Segment population by priority: enterprise accounts, legacy prepaid, high-value subscribers, geographies.
Stagger messages and restart prompts by segment (e.g., 10% cohorts every 5 minutes).

Step B — Preferred reconnect methods

Customer device restart guidance (lowest friction):
Send an SMS + email + in-app banner asking the customer to power-cycle or toggle airplane mode. Provide a one-step script for non-technical users.

"We're back online — please restart your phone or toggle Airplane Mode ON then OFF to reconnect. If that doesn't work, open Settings → Network → Reset Network Settings."
Silent network triggers (network-initiated reattach):
For managed devices and CPE, send an OMA-DM or proprietary MDM command to reboot the device or refresh network config. Use verification callbacks to confirm success.
Forced reattach via detach:
Where supported, issue a network-initiated detach to a subset of IMSIs to force devices to reattach under healthy control-plane logic. Use vendor OSS CLI tools; example pseudo-command:
```
# Pseudo: force detach a cohort
network-cli detach --imsi-list cohort-a.imsi --reason operator-initiated
```
Always monitor attach result counts and error codes per cohort before progressing.

Step C — Fallbacks for stubborn devices

Offer an automated callback or scheduling for technician visits if a device remains unreachable after network-side attempts.
Provision temporary eSIM reactivation tokens for customers who cannot reattach due to corrupted provisioning state.
Escalate persistent attach failures to a SIM re-provisioning flow: re-ISSUED AKA keys or push a new subscription profile.

Phase 4 — Network-side remediation

These are targeted operations on the network to restore attach and service continuity.

Clear stale sessions: Sweep and clear orphaned S1/N2 sessions and stale GTP tunnels that prevent new attaches.
Restart affected CNFs selectively: Use blue/green or canary rollback patterns to avoid wholesale restarts. Document each service restart with timestamps and change IDs.
Adjust rate-limiting & queue thresholds: Temporarily relax non-critical throttles that could block legitimate reattaches (but monitor for abuse).
Sync provisioning caches: Ensure distributed UDM/HSS caches are invalidated or warmed; run a cache warmup script to sync subscriber records.
Throttle inbound reattach bursts: Use queue shaping to smooth incoming load from mass device restarts and avoid control-plane saturation.

Phase 5 — Customer messaging & support playbook

Messaging must be consistent across channels (SMS, email, push, web, IVR, social). Provide CSRs with concise scripts and escalation criteria.

Templates (shortened)

SMS (urgent, short):
"Service update: Our network is restored. Please restart your device or toggle Airplane mode. If still offline, visit carrier.example/reconnect or call 1-800-XXX-XXXX."
Email (detailed):
Include timeline, what was affected, steps to reconnect (with images), compensation policy, and links to self-help tools and live support chat.
CSR script:
1. Confirm customer's device & last seen tether.
2. Ask them to toggle Airplane Mode; if still failing, instruct network reset steps or schedule field support.
3. If eligible, initiate crediting flow and inform customer of timeframe (e.g., 3 billing cycles for visibility).

Phase 6 — Credits & compensation (auditable automation)

Compensation is both a customer-experience lever and a regulatory/financial risk. Automate credits but keep human oversight for exceptions.

Principles

Predictable policy: Flat credit per affected time window (example: $20 after a multi-hour outage as used by major carriers in Jan 2026) or pro-rata based on minutes lost.
Idempotent automation: Credits must be applied once per customer per incident. Use unique incident IDs and token-based reconciliation.
Verification: Only apply credits to accounts confirmed as affected via last-known attach timestamp and support logs to prevent fraud.
Audit trail: Log every credit with who/what triggered it, timestamps, and a reference to the outage incident ID.

Automated credit workflow (example)

Generate incident ID and declare credit policy (flat/pro-rata).
Query OSS for subscribers with last-attach < outage-start AND last-attach > outage-end OR with failed attach attempts recorded.
Run a dry-run to estimate credit amount and cost exposure; require billing ops sign-off above threshold.

Apply credits in bulk via secure billing APIs. Example pseudo-API call:

POST /billing/v1/credits
{
  "incident_id": "OUT-2026-01-xx",
  "amount": 20.00,
  "reason": "Service outage compensation",
  "subscribers": ["sub-IMSIs-or-IDs"...]
}

Send notification to affected subscribers with credit details and expected posting date.
Reconcile and publish KPIs on credit uptake, disputes and overall cost.

Phase 7 — Telemetry, KPIs and verification

Telemetry proves your recovery and feeds the post-mortem. Design dashboards ahead of time that you can switch to incident-mode.

Top KPIs to monitor (real-time)

Attach success rate: Percent of attempted attaches that succeed — goal: back to baseline within SLA window.
Time-to-first-successful-attach: Median time from restart prompt to successful attach.
Control-plane latency: Auth & session setup latency percentiles (p50/p95/p99).
Service health by region: Map of successful voice/data sessions per region.
Support volume & wait time: Calls, chats and tickets per minute and average handle time.
Credit reconciliation: Count of credits issued, failures and disputes.

Example PromQL & Splunk searches (pseudo)

# Prometheus: attach success rate
sum(rate(attach_success_total[5m]))/sum(rate(attach_attempt_total[5m]))

# Splunk: failed attach reason codes last 60m
index=network attach_status=failed | stats count by attach_cause_code

Phase 8 — Post-incident review & future prevention

A technical post-mortem must be accompanied by an operational and CX review. Tie outcomes to remediation projects and KPIs for the next quarter.

Cross-functional incident replay: Ops, network eng, security, billing, legal and PR review timeline and decisions.
Root cause and corrective actions: Identify systemic fixes (e.g., deployment gates, additional smoke tests, back-out procedures).
Runbook updates: Capture new scripts, CLI commands, API endpoints and lessons learned into version-controlled runbook docs.
Readiness drills: Schedule quarterly simulated nationwide reconnection drills using synthetic traffic to validate runbook effectiveness.

Cross-functional RACI (quick)

Network Ops: R for core checks, detach/reattach and CNF restarts.
Platform/SRE: A for automation scripts, telemetry dashboards, rate control.
Customer Support: R for messaging, CSR scripts, escalations and ticket handling.
Billing: A/R for credits workflow and reconciliation.
Legal & PR: C for public statements and regulatory notifications.

Operational examples & mini case study

Example: During the January 2026 national incident, a large carrier validated core health then rolled a segmented customer restart plan. They used a 10% cohort stagger to avoid control-plane overload and issued a flat $20 credit to affected customers — communicated via SMS and email. The approach reduced CSR calls by 45% and restored 85% of subscribers within 40 minutes of the first restart cohort being prompted. Documentation from that event shows this pattern — validate core, stagger reconnects, automate credits — consistently reduces MTTR and support load.

Security, fraud and compliance guardrails

Protect credit issuance: Use multi-factor authorization for bulk credit operations and pre-check lists to avoid mass mis-crediting.
Privacy: Ensure messages and automated calls don't leak sensitive profiling information. Follow CPNI/data protection rules.
Audit logs: Retain 1:1 mapping of who/what/when for all automated and manual remediation actions for regulatory audits.

Automation playbook snippets

Automate with short, idempotent jobs. Below are pseudo-examples to illustrate flow control — adapt to your tooling and vendor APIs.

1) Cohort selector (pseudo SQL)

-- select 10% cohort of affected subscribers
WITH affected AS (
  SELECT subscriber_id, last_attach_ts
  FROM subscriber_state
  WHERE last_attach_ts BETWEEN '2026-01-xxT00:00Z' AND '2026-01-xxT08:00Z'
)
SELECT subscriber_id
FROM affected
WHERE MOD(HASH(subscriber_id), 10) = 0;

2) Idempotent credit API (pseudo)

POST /billing/credits
{
  "incident_id":"OUT-2026-01-xx",
  "subscriber_id":"12345",
  "amount":20.00,
  "idempotency_key":"OUT-2026-01-xx-12345"
}

Checklist — Quick reference (printable)

Confirm core stability & lock deployments.
Run attach sanity & stress tests.
Approve segmented reconnect plan and messaging.
Execute cohort restarts (SMS + MDM/OMA-DM where available).
Throttle & monitor control-plane load.
Issue credits via idempotent billing API; notify customers.
Monitor attach rate, support volume and credit reconciliation dashboards.
Initiate post-incident review and update runbooks.

Final recommendations & predictions (2026+)

Expect regulators to increasingly demand demonstrable reconciliation of outages and clear customer remediation. Best-practice trends for 2026 include: automated, auditable credit workflows; pre-approved segmented reconnect policies; and integrated telemetry that ties control-plane state directly to customer-facing communications. Carriers investing in automated reconnect orchestration and incident-mode dashboards will see meaningful reductions in MTTR, CSR load and legal exposure in the year ahead.

Actionable takeaways

Do not instruct mass restarts until core health gates pass.
Use segmented rollouts (cohorts) to prevent control-plane surges.
Automate credits with idempotency keys and pre-flight verification to reduce disputes.
Instrument attach flows for immediate visibility — success-rate dashboards are your proof to regulators and customers.

Call to action

Need a ready-to-run template tailored to your OSS/BSS and CI/CD stack? Download the incident-ready reconnect automation pack from quickfix.cloud or schedule a 30-minute incident readiness review with our carrier ops team. Convert lessons from past outages into a deterministic, auditable remediation machine — before the next one hits.

Runbook: Customer Reconnection Steps After Large-Scale Wireless Outages

Hook: When eight-hour outages cost millions — a pragmatic runbook to get customers reconnected

Why this matters now (2026 trends)

Scope and assumptions

High-level phases

Phase 1 — Contain & confirm (first 0–30 mins)

Phase 2 — Pre-reconnect validation (30–90 mins)

Phase 3 — Device-side orchestration (bulk reconnection)

Step A — Segmented rollout

Step B — Preferred reconnect methods

Step C — Fallbacks for stubborn devices

Phase 4 — Network-side remediation

Phase 5 — Customer messaging & support playbook

Templates (shortened)

Phase 6 — Credits & compensation (auditable automation)

Principles

Automated credit workflow (example)

Phase 7 — Telemetry, KPIs and verification

Top KPIs to monitor (real-time)

Example PromQL & Splunk searches (pseudo)

Phase 8 — Post-incident review & future prevention

Cross-functional RACI (quick)

Operational examples & mini case study

Security, fraud and compliance guardrails

Automation playbook snippets

1) Cohort selector (pseudo SQL)

2) Idempotent credit API (pseudo)

Checklist — Quick reference (printable)

Final recommendations & predictions (2026+)

Actionable takeaways

Call to action

Related Topics

quickfix

Up Next

Postmortem Action Item Tracker: How to Prioritize and Close Reliability Work

Pre-Deployment Checklist for Safer Production Releases

Terraform vs Pulumi: Infrastructure as Code Comparison

Hook: When eight-hour outages cost millions — a pragmatic runbook to get customers reconnected

Why this matters now (2026 trends)

Scope and assumptions

High-level phases

Phase 1 — Contain & confirm (first 0–30 mins)

Phase 2 — Pre-reconnect validation (30–90 mins)

Phase 3 — Device-side orchestration (bulk reconnection)

Step A — Segmented rollout

Step B — Preferred reconnect methods

Step C — Fallbacks for stubborn devices

Phase 4 — Network-side remediation

Phase 5 — Customer messaging & support playbook

Templates (shortened)

Phase 6 — Credits & compensation (auditable automation)

Principles

Automated credit workflow (example)

Phase 7 — Telemetry, KPIs and verification

Top KPIs to monitor (real-time)

Example PromQL & Splunk searches (pseudo)

Phase 8 — Post-incident review & future prevention

Cross-functional RACI (quick)

Operational examples & mini case study

Security, fraud and compliance guardrails

Automation playbook snippets

1) Cohort selector (pseudo SQL)

2) Idempotent credit API (pseudo)

Checklist — Quick reference (printable)

Final recommendations & predictions (2026+)

Actionable takeaways

Call to action

Related Reading

Related Topics

quickfix

Up Next

Postmortem Action Item Tracker: How to Prioritize and Close Reliability Work

Pre-Deployment Checklist for Safer Production Releases

Terraform vs Pulumi: Infrastructure as Code Comparison