AI Prompt
"List resilience/chaos test cases for
[service]. Include dependency outages, slow responses, retries/backoff,
circuit breakers, fallbacks, and
logging/metrics/alerts verification."
Applying Critical Thinking
·
Dependency map:
List every outbound dependency (DB, cache, other APIs, message bus);
each can fail or slow.
·
Failure modes:
Hard failure (connection refused, 5xx), slow (high latency), partial (degraded
or stale), and flaky (intermittent).
·
Observability
as requirement: If we can’t see it, we can’t fix it; tests should
assert logs, metrics, and alerts, not only behavior.
·
Graceful
degradation: Define acceptable fallback (cached data, default value,
or user-visible error) and test it.
Generate Test Cases for Each Feature
·
Dependency outage:
Kill or block DB, cache, downstream API; verify service returns 503 or
fallback, no cascade crash; recovery when dependency returns.
·
Slow responses:
Add latency (e.g. 5s) to dependency; verify timeouts and no thread
exhaustion; user sees loading or timeout message.
·
Retries/backoff:
Dependency returns 5xx then success; verify retry count and backoff;
no unbounded retries; idempotency where needed.
·
Circuit
breaker: After N failures, circuit opens; no calls to dependency for
period; half-open and close after success; verify metrics.
·
Fallbacks:
When dependency fails: stale cache, default value, or friendly error;
no raw exception to user; fallback path tested.
·
Logging:
Errors logged with context (request id, dependency name); no secrets
in logs; structured fields for querying.
·
Metrics: Latency,
error rate, and dependency call counts; circuit state; saturation
(queue depth, connection pool).
·
Alerts:
Trigger failure; verify alert fires and has correct severity; runbook or
playbook referenced.
·
Chaos:
Random kill/latency on one dependency at a time; verify no full outage and
observability intact.
Questions on Ambiguities
·
What is the SLA for [service] (e.g. 99.9%)
and which dependencies are in the critical path?
·
What timeout is configured for each
dependency, and what should the user see when it hits?
·
Retry policy: max attempts, backoff
(linear/exponential), and which status codes trigger retry?
·
Circuit breaker thresholds (failure count,
open duration) and who can change them?
·
What fallback is acceptable per dependency
(none, cache, default, or “maintenance” message)?
·
Who is on-call and what alerts are routed to
them; are runbooks up to date?
Areas Where Test Ideas Might Be Missed
·
Multiple failures: two dependencies down at
once; order of failure and recovery.
·
Partial failure: dependency returns 200 but
empty or malformed body; service should handle, not assume success.
·
Resource exhaustion: connection pool or thread
pool full due to slow dependency; no deadlock or silent hang.
·
Clock/skew: dependency slow to respond; timeouts
based on wrong clock (e.g. NTP skew).
·
Log volume: under failure, log volume doesn’t
overwhelm storage or hide root cause.
·
Alert fatigue: repeated failures don’t spam
alerts; deduplication or severity escalation.
·
Deployment during failure: new
version deployed while dependency is down; startup and health checks still
correct.
Output Template
Context: [system/feature under test, dependencies, environment]
Assumptions: [e.g., auth method, data availability, feature flags]
Test Types: [observability, resilience, chaos]
Test Cases:
ID: [TC-001]
Type: [resilience/chaos]
Title: [short name]
Preconditions/Setup: [data, env, mocks, flags]
Steps: [ordered steps or request details]
Variations: [inputs/edges/negative cases]
Expected Results: [responses/UI states/metrics]
Cleanup: [teardown/reset]
Coverage notes: [gaps, out-of-scope items, risk areas]
Non-functionals: [perf targets, security considerations, accessibility notes]
Data/fixtures: [test users, payloads, seeds]
Environments: [dev/stage/prod-parity requirements]
Ambiguity Questions:
- [Question 1 about unclear behavior]
- [Question 2 about edge case]
Potential Missed Ideas:
- [Suspicious area where tests might still be thin]