runbookautomotivetroubleshooting

Runbook: Troubleshooting Unexpected Timing Violations in AUTOSAR ECUs

UUnknown

2026-02-25

9 min read

Operational runbook to investigate AUTOSAR ECU timing violations using VectorCAST + RocqStat, with root-cause steps and automated remediation playbooks.

Hook: Stop firefighting timing violations — get to root cause fast

Unexpected timing violations in AUTOSAR ECUs spike MTTR, erode customer trust, and trigger expensive recalls. In 2026, with Vector's acquisition of RocqStat and tighter safety requirements, teams must move beyond ad-hoc fixes and adopt a repeatable runbook that combines static WCET analysis, runtime observability, and automated remediation playbooks. This operational runbook gives SREs, embedded engineers and test leads a step-by-step method to verify, triage and remediate ECU timing failures using VectorCAST + RocqStat outputs, runtime traces, and CI/CD enforcement.

What changed in 2026 and why it matters now

Late 2025 and early 2026 brought two industry shifts that change how timing violations should be handled:

Vector/StatInf (RocqStat) consolidation — Vector's acquisition of RocqStat (announced January 2026) means tighter coupling of static WCET tools into VectorCAST pipelines, enabling unified analysis and traceability across verification and timing domains (Automotive World, Jan 2026).
Observability-first embedded fleets — ECUs now ship richer telemetry (task histograms, execution traces, ISR durations) to cloud observability backends, so you can correlate static estimates with production behavior in near real-time.

Runbook overview — immediate goals

This runbook focuses on three operational goals:

Confirm that a timing violation is real and reproducible.
Root-cause it by correlating RocqStat/VectorCAST outputs with runtime traces and code changes.
Remediate using fast mitigations, CI gating, and automated playbooks that preserve safety and compliance.

Quick taxonomy: timing events you will see

Deadline miss — a task or runnable misses its real-time deadline.
Overrun — observed execution time > WCET or expected worst-case path.
Priority inversion / preemption issue — shorter job blocked by long-running lower-priority job or ISR.
System overload — CPU, DMA or bus contention causing system-wide delays.

Step 0 — Triage checklist (first 15 minutes)

Confirm the alert: check the original alert source (CAN trace, telemetry alert, Fleet monitoring dashboard).
Capture and snapshot all artifacts: ECU firmware version, build ID, VectorCAST test report, RocqStat timing report, trace logs, crash dumps.
Mark affected fleet subset and isolate: device IDs, VINs, build variants, and configuration flags.
Apply a short mitigation (if safety-critical): issue a configuration switch to disable affected feature or increase soft deadline if allowed by product spec.

Step 1 — Confirm: reproduce the violation

Don't assume production traces equal reproducible conditions. Use this order:

Replay traces on a local or cloud ECU simulator (Vector CANoe, VectorCAST runtime) to recreate the sequence.
Run the same workload on a hardware-in-the-loop (HIL) or passthrough ECU and capture task-level traces (Tracealyzer/Percepio or Vector tracing).
Compare runtime max observed execution time (MOET) with RocqStat WCET and VectorCAST test paths.

Artifacts to compare

RocqStat report: call-tree WCETs, path IDs, assumptions.
VectorCAST results: unit/integration tests, code coverage on routines flagged by RocqStat.
Runtime traces: task_start/end timestamps, ISR durations, CPU load samples.

Step 2 — Static vs Dynamic: where the mismatch is

Decide which side of the analysis is the root cause.

Static underestimation: RocqStat WCET is too low for actual runtime paths. Indicators: MOET consistently exceeds static WCET for specific call-tree IDs.
Dynamic anomaly: Runtime path includes unexpected behavior (new ISR, higher frequency messages, hardware contention). Indicators: spikes in ISR time, DMA stalls, or increased input rates.
Regression introduced in build: A recent commit affects timing (e.g., new algorithm, logging, or disabled compiler optimization).

Step 3 — Root-cause checklist (detailed detective work)

Work through these checks in parallel where possible:

Call-tree link: From the RocqStat report, identify the hot call-tree ID and map it to source files and functions. RocqStat gives per-path WCET annotations—use them to prioritize.
Instrumentation gaps: Ensure your runtime telemetry includes per-runnable timing. If not, instrument and redeploy to a subset of devices.
Scheduling and priorities: Check AUTOSAR OS config changes (OsTaskPriority, OsAlarm) and recent AUTOSAR modules changes (RTE, Com, PduR).
ISR analysis: Use trace to determine if a new or extended ISR is preempting tasks. Look for high-frequency CAN frames, sensor bursts, or watchdog triggers.
Compiler/runtime flags: Confirm compiler version and optimization flags. Some optimizations change instruction timing; RocqStat assumptions must match the actual toolchain.
Memory/cache effects: Cold-cache vs warm-cache runs can explain differences—pay attention to WCET bounding assumptions in RocqStat (cache model).
Hardware load: Bus contention, DMA throughput or peripheral issues can extend execution time indirectly.
Input-rate change: Validate sensor/input frequency from fleet telemetry; higher event rates can push worst-case activation sequences into reality.

Step 4 — Quick mitigations (reduce risk while investigating)

Implement temporary, safety-first mitigations with traceable change control:

Throttle input rates (sensor sampling) or apply an ECU configuration to drop non-essential messages.
Increase task deadlines or add temporary watchdog timeouts where allowed by safety specs.
Disable optional features that increase load (debug logs, non-critical services).
Deploy a canary firmware with extra logging to a small fleet segment for deeper gathering.

Step 5 — Fix patterns and permanent remediation

Common fixes, ordered from low-risk to higher-effort:

Configuration changes
- Raise task priority or adjust activation scheme in AUTOSAR OS if it doesn’t violate the timing analysis assumptions.
- Adjust scheduling windows or add cooperative yield points.
Runtime throttling and backpressure
- Add input filtering or rate limiting (sensor debouncing, CAN message suppression).
Code changes
- Refactor to reduce worst-case paths (split long runnables, use static buffers, avoid dynamic allocation in hot paths).
- Optimize hot functions (algorithmic improvements, reduce blocking I/O).
Toolchain adjustments
- Update compiler flags to match RocqStat assumptions, or re-run WCET analysis with the actual toolchain settings.
Testing & verification
- Run RocqStat with the updated code and VectorCAST regression tests to verify WCET margins. Use coverage-guided path synthesis to exercise worst-case paths.

Automated remediation playbooks — examples

Integrate remediation into CI/CD and incident management so fixes are repeatable and auditable. Below are practical playbook templates you can adapt.

1) CI gate: Block merge if RocqStat WCET increases

GitLab CI job that runs RocqStat and fails the pipeline if WCET delta > threshold.

# .gitlab-ci.yml snippet
stages:
  - build
  - wcet-check

wcet-check:
  stage: wcet-check
  script:
    - ./vectorcast/run_build.sh --build-id $CI_COMMIT_SHA
    - ./rocqstat/run_analysis.sh --binary build/app.elf --report wcet.json
    - python tools/compare_wcet.py --baseline artifacts/wcet_baseline.json --current wcet.json --threshold 5
  only:
    - merge_requests

2) Incident remediation playbook (Rundeck/Ansible style)

Automate safe steps: collect artifacts, throttle traffic, create ticket, deploy canary patch.

# Ansible-like pseudo-playbook
- name: ECU timing incident remediation
  hosts: fleet_segment
  tasks:
    - name: Collect trace and RocqStat report
      shell: /opt/tools/collect_trace.sh --since {{incident_start}}
    - name: Apply input-rate limit
      shell: /opt/tools/set_rate_limit.sh --sensor accel --rate 50Hz
    - name: Create ticket
      rtm_ticket:
        project: SW_TIMING
        summary: "Timing violation on {{ecu_id}}"
    - name: Deploy canary image with extra logging
      ota_deploy:
        image: repo/images/app:canary-{{build}}
        rollout: 1%

Observability: queries and alerts you should have

Make sure telemetry and alerting correlate static analysis and runtime. Example observability rules:

Prometheus alert rule (task overrun)

groups:
- name: ecu_timing.rules
  rules:
  - alert: ECU_Task_Overrun
    expr: histogram_quantile(0.99, sum(rate(task_exec_seconds_bucket[5m])) by (le, task)) > 0.020
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Task {{ $labels.task }} 99th percentile execution > 20ms"

Elastic/Splunk query (find traces exceeding static WCET)

event.type:trace AND task.name:MyTask AND task.duration_ms > rocqstat.wcet_ms
| sort - task.duration_ms desc
| limit 50

Automated correlation: map RocqStat IDs to runtime traces

Best practice: include the RocqStat path ID and VectorCAST test id in build metadata. When runtime traces upload, enrich them with build metadata so automated tooling can map MOET to static call-trees.

Example metadata JSON embedded in builds:

{
  "build": "1.2.3",
  "rocqstat_report": "sha:abc123",
  "path_index_map": "/artifacts/rocqstat/path_map.json"
}

Audit, safety and compliance considerations

When applying remediations, you must preserve traceability required by ISO 26262 and supplier contracts:

Record all analysis artifacts (RocqStat reports, VectorCAST logs, runtime traces) in your ticketing system.
Sign and verify firmware images for OTA deployments; maintain rollback images and safety justification.
If altering scheduling or task priorities, run a full SIL/PIL test and update the safety case with the new timing evidence.

Example forensic case study (short)

Situation: A Tier-1 fleet reported occasional braking assist latency — task BA_Runnable missed deadline under peak sensor load. MOET spikes logged to 35ms; RocqStat WCET for the path was 28ms.

Investigation: Traces showed a burst of CAN messages from a new radar firmware version increasing ISR time. RocqStat path mapping pointed to a library function that exercised a heavy loop only when a specific flag was set.

Remediation: Short-term: applied runtime filter to drop low-priority radar frames and deployed canary to 2% of fleet. Mid-term: refactored the library to remove blocking calls, re-ran RocqStat and verified WCET margin increased to 32ms. All artifacts were logged and the safety case updated.

Operationalizing for scale — 6 tactical recommendations

Automate RocqStat runs in CI and fail merges on significant WCET regressions.
Embed RocqStat path IDs in telemetry so run-time traces correlate to static analysis automatically.
Create a standard incident playbook that every on-call engineer can run from a single dashboard: collect, throttle, canary, fix.
Use canary OTA flows to limit blast radius while gathering better observability.
Maintain a timing baseline per ECU variant and make it part of the signed release artifacts.
Train teams on interpretation — Devs, test and SREs must read RocqStat call trees, VectorCAST coverage and runtime traces.

Future-proofing: trends to track in 2026+

Integrated toolchains — Expect VectorCAST + RocqStat to provide tighter automation (single reports that include WCET, coverage and test traces).
ML-assisted path synthesis — Tools will propose likely worst-case activation scenarios based on fleet telemetry patterns.
Edge observability libs — Standardized lightweight timing telemetry shipped in ECU SDKs to close static-dynamic gaps.
Policy-as-code for timing — Gate releases with machine-enforced policies like “no WCET regression > X% without mitigating plan.”

Actionable takeaways

Always collect RocqStat and VectorCAST artifacts at the time of incident — they are required to prove regressions and safety compliance.
Map static path IDs to runtime traces to accelerate root cause from hours to minutes.
Automate WCET checks in CI and use canary OTAs for production safety while you iterate on fixes.
Preserve traceability and update your safety case for every scheduling or code fix.

Call to action

If your team is still handling timing violations manually, start by integrating RocqStat or your WCET tool into CI and adding path IDs to telemetry. Need a hands-on template? Download our VectorCAST+RocqStat CI starter pack and the incident remediation playbook to reduce MTTR and automate safe rollouts. Contact your tooling owner or ops lead and schedule a 30-minute runbook workshop this week.

Reference: Vector's acquisition of RocqStat (StatInf) announced Jan 2026; integration with VectorCAST promises unified timing and verification workflows — Automotive World, Jan 2026.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.