Thursday, February 19, 2026

AI Prompt - Observability / Resilience Testing

AI Prompt

"List resilience/chaos test cases for [service]. Include dependency outages, slow responses, retries/backoff, circuit breakers, fallbacks, and logging/metrics/alerts verification."

Applying Critical Thinking

·         Dependency map: List every outbound dependency (DB, cache, other APIs, message bus); each can fail or slow.

·         Failure modes: Hard failure (connection refused, 5xx), slow (high latency), partial (degraded or stale), and flaky (intermittent).

·         Observability as requirement: If we can’t see it, we can’t fix it; tests should assert logs, metrics, and alerts, not only behavior.

·         Graceful degradation: Define acceptable fallback (cached data, default value, or user-visible error) and test it.

Generate Test Cases for Each Feature

·         Dependency outage: Kill or block DB, cache, downstream API; verify service returns 503 or fallback, no cascade crash; recovery when dependency returns.

·         Slow responses: Add latency (e.g. 5s) to dependency; verify timeouts and no thread exhaustion; user sees loading or timeout message.

·         Retries/backoff: Dependency returns 5xx then success; verify retry count and backoff; no unbounded retries; idempotency where needed.

·         Circuit breaker: After N failures, circuit opens; no calls to dependency for period; half-open and close after success; verify metrics.

·         Fallbacks: When dependency fails: stale cache, default value, or friendly error; no raw exception to user; fallback path tested.

·         Logging: Errors logged with context (request id, dependency name); no secrets in logs; structured fields for querying.

·         Metrics: Latency, error rate, and dependency call counts; circuit state; saturation (queue depth, connection pool).

·         Alerts: Trigger failure; verify alert fires and has correct severity; runbook or playbook referenced.

·         Chaos: Random kill/latency on one dependency at a time; verify no full outage and observability intact.

Questions on Ambiguities

·         What is the SLA for [service] (e.g. 99.9%) and which dependencies are in the critical path?

·         What timeout is configured for each dependency, and what should the user see when it hits?

·         Retry policy: max attempts, backoff (linear/exponential), and which status codes trigger retry?

·         Circuit breaker thresholds (failure count, open duration) and who can change them?

·         What fallback is acceptable per dependency (none, cache, default, or “maintenance” message)?

·         Who is on-call and what alerts are routed to them; are runbooks up to date?

Areas Where Test Ideas Might Be Missed

·         Multiple failures: two dependencies down at once; order of failure and recovery.

·         Partial failure: dependency returns 200 but empty or malformed body; service should handle, not assume success.

·         Resource exhaustion: connection pool or thread pool full due to slow dependency; no deadlock or silent hang.

·         Clock/skew: dependency slow to respond; timeouts based on wrong clock (e.g. NTP skew).

·         Log volume: under failure, log volume doesn’t overwhelm storage or hide root cause.

·         Alert fatigue: repeated failures don’t spam alerts; deduplication or severity escalation.

·         Deployment during failure: new version deployed while dependency is down; startup and health checks still correct.

Output Template

Context: [system/feature under test, dependencies, environment]

Assumptions: [e.g., auth method, data availability, feature flags]

Test Types: [observability, resilience, chaos]

Test Cases:

ID: [TC-001]

Type: [resilience/chaos]

Title: [short name]

Preconditions/Setup: [data, env, mocks, flags]

Steps: [ordered steps or request details]

Variations: [inputs/edges/negative cases]

Expected Results: [responses/UI states/metrics]

Cleanup: [teardown/reset]

Coverage notes: [gaps, out-of-scope items, risk areas]

Non-functionals: [perf targets, security considerations, accessibility notes]

Data/fixtures: [test users, payloads, seeds]

Environments: [dev/stage/prod-parity requirements]

Ambiguity Questions:

- [Question 1 about unclear behavior]

- [Question 2 about edge case]

Potential Missed Ideas:

- [Suspicious area where tests might still be thin]

AI in Software Testing: How Artificial Intelligence Is Transforming QA

For years, software testing has lived under pressure: more features, faster releases, fewer bugs, smaller teams. Traditional QA has done her...