Showing posts with label Best practices. Show all posts
Showing posts with label Best practices. Show all posts

Thursday, February 19, 2026

AI Prompt - ML / AI-Specific Testing

AI Prompt

"Create test cases for [ML model/pipeline]. Include data quality checks, drift detection, edge inputs, fairness/bias slices, latency under load, and rollback if metrics regress."

Applying Critical Thinking

·         Data is part of the contract: Model behavior depends on input distribution; tests should cover data quality, schema, and known edge distributions.

·         Slices matter: Aggregate metrics can hide regressions for subgroups; define slices (demographic, region, product) and test them.

·         Drift and lifecycle: Concept and data drift over time; tests should include drift detection and criteria for retrain or rollback.

·         Determinism and reproducibility: Where possible, fixed seeds and fixtures; document non-deterministic areas and their impact on assertions.

Generate Test Cases for Each Feature

·         Data quality: Schema validation (types, ranges, nulls); missing or corrupt features; duplicates; label quality (if supervised); train/serve skew (same preprocessing).

·         Drift detection: Input distribution drift (stats, histograms); concept drift (label distribution or performance over time); thresholds and alerts; dashboard or pipeline step.

·         Edge inputs: Empty input, all nulls, out-of-range values, very long text, special characters; model doesn’t crash and returns safe default or error.

·         Fairness/bias slices: Performance (accuracy, F1, etc.) per slice (e.g. demographic, region); disparity metrics; minimum performance bar per slice; bias mitigation checks.

·         Latency under load: P50/P95/P99 latency at target RPS; batch vs online; GPU/CPU utilization; timeout and degradation behavior under overload.

·         Rollback / regression: When metrics regress (e.g. accuracy drop, fairness violation): rollback to previous model, feature flag, or fallback; pipeline and alerts tested.

·         Reproducibility: Same input + version → same output (where applicable); training run reproducible from config and data version.

·         Adversarial / robustness: Known adversarial or worst-case inputs; model doesn’t fail badly or expose unsafe behavior.

Questions on Ambiguities

·         What metrics define “good” (e.g. accuracy, F1, fairness parity) and what regression threshold triggers rollback?

·         Which slices are required for fairness reporting (e.g. age, region, product type) and what data is available?

·         How is drift defined (statistical test, threshold on distribution) and who acts on drift alerts?

·         What is acceptable latency (online vs batch) and what is the fallback when the model is slow or down?

·         Are explainability or audit logs required (e.g. feature contributions, request/response logging)?

·         Who approves model promotions and rollbacks (ML team, product, compliance)?

Areas Where Test Ideas Might Be Missed

·         Label and annotation quality: wrong or inconsistent labels in training/eval; impact on reported metrics and fairness.

·         Preprocessing parity: train vs serve preprocessing (tokenization, normalization, feature store); subtle skew.

·         Cold start and rare categories: new users or rare items; model behavior and fallback.

·         Feedback loops: model predictions influence future data (e.g. recommendations); long-term bias or collapse.

·         Versioning and A/B: multiple model versions in production; routing and metric attribution per version.

·         Security of model artifact: tampering, extraction, or inversion; not always in scope but worth noting.

·         Cost and resource: inference cost per request; GPU memory; batch size vs latency tradeoff under load.

Output Template

Context: [system/feature under test, dependencies, environment]

Assumptions: [e.g., auth method, data availability, feature flags]

Test Types: [ML, data quality, fairness, performance]

Test Cases:

ID: [TC-001]

Type: [ML/AI]

Title: [short name]

Preconditions/Setup: [data, env, mocks, flags]

Steps: [ordered steps or request details]

Variations: [inputs/edges/negative cases]

Expected Results: [responses/UI states/metrics]

Cleanup: [teardown/reset]

Coverage notes: [gaps, out-of-scope items, risk areas]

Non-functionals: [perf targets, security considerations, accessibility notes]

Data/fixtures: [test users, payloads, seeds]

Environments: [dev/stage/prod-parity requirements]

Ambiguity Questions:

- [Question 1 about unclear behavior]

- [Question 2 about edge case]

Potential Missed Ideas:

- [Suspicious area where tests might still be thin]

AI Prompt - Observability / Resilience Testing

AI Prompt

"List resilience/chaos test cases for [service]. Include dependency outages, slow responses, retries/backoff, circuit breakers, fallbacks, and logging/metrics/alerts verification."

Applying Critical Thinking

·         Dependency map: List every outbound dependency (DB, cache, other APIs, message bus); each can fail or slow.

·         Failure modes: Hard failure (connection refused, 5xx), slow (high latency), partial (degraded or stale), and flaky (intermittent).

·         Observability as requirement: If we can’t see it, we can’t fix it; tests should assert logs, metrics, and alerts, not only behavior.

·         Graceful degradation: Define acceptable fallback (cached data, default value, or user-visible error) and test it.

Generate Test Cases for Each Feature

·         Dependency outage: Kill or block DB, cache, downstream API; verify service returns 503 or fallback, no cascade crash; recovery when dependency returns.

·         Slow responses: Add latency (e.g. 5s) to dependency; verify timeouts and no thread exhaustion; user sees loading or timeout message.

·         Retries/backoff: Dependency returns 5xx then success; verify retry count and backoff; no unbounded retries; idempotency where needed.

·         Circuit breaker: After N failures, circuit opens; no calls to dependency for period; half-open and close after success; verify metrics.

·         Fallbacks: When dependency fails: stale cache, default value, or friendly error; no raw exception to user; fallback path tested.

·         Logging: Errors logged with context (request id, dependency name); no secrets in logs; structured fields for querying.

·         Metrics: Latency, error rate, and dependency call counts; circuit state; saturation (queue depth, connection pool).

·         Alerts: Trigger failure; verify alert fires and has correct severity; runbook or playbook referenced.

·         Chaos: Random kill/latency on one dependency at a time; verify no full outage and observability intact.

Questions on Ambiguities

·         What is the SLA for [service] (e.g. 99.9%) and which dependencies are in the critical path?

·         What timeout is configured for each dependency, and what should the user see when it hits?

·         Retry policy: max attempts, backoff (linear/exponential), and which status codes trigger retry?

·         Circuit breaker thresholds (failure count, open duration) and who can change them?

·         What fallback is acceptable per dependency (none, cache, default, or “maintenance” message)?

·         Who is on-call and what alerts are routed to them; are runbooks up to date?

Areas Where Test Ideas Might Be Missed

·         Multiple failures: two dependencies down at once; order of failure and recovery.

·         Partial failure: dependency returns 200 but empty or malformed body; service should handle, not assume success.

·         Resource exhaustion: connection pool or thread pool full due to slow dependency; no deadlock or silent hang.

·         Clock/skew: dependency slow to respond; timeouts based on wrong clock (e.g. NTP skew).

·         Log volume: under failure, log volume doesn’t overwhelm storage or hide root cause.

·         Alert fatigue: repeated failures don’t spam alerts; deduplication or severity escalation.

·         Deployment during failure: new version deployed while dependency is down; startup and health checks still correct.

Output Template

Context: [system/feature under test, dependencies, environment]

Assumptions: [e.g., auth method, data availability, feature flags]

Test Types: [observability, resilience, chaos]

Test Cases:

ID: [TC-001]

Type: [resilience/chaos]

Title: [short name]

Preconditions/Setup: [data, env, mocks, flags]

Steps: [ordered steps or request details]

Variations: [inputs/edges/negative cases]

Expected Results: [responses/UI states/metrics]

Cleanup: [teardown/reset]

Coverage notes: [gaps, out-of-scope items, risk areas]

Non-functionals: [perf targets, security considerations, accessibility notes]

Data/fixtures: [test users, payloads, seeds]

Environments: [dev/stage/prod-parity requirements]

Ambiguity Questions:

- [Question 1 about unclear behavior]

- [Question 2 about edge case]

Potential Missed Ideas:

- [Suspicious area where tests might still be thin]

AI Prompt - Data Integrity / Migration Testing

AI Prompt

"Generate test cases for migrating data from [source] to [target]. Cover mapping rules, null/invalid handling, rounding, duplicates, referential integrity, rollback, and reconciliation checks."

Applying Critical Thinking

·         Define the golden source: Which system is authoritative for each entity? What is “correct” when source and target disagree?

·         Map every field: Document type, range, nullability, and transformation; explicit handling for unknown or invalid values.

·         Relationships first: Parent-child and foreign keys; order of migration and dependency graph; what happens to orphans.

·         Rollback is part of the design: Test rollback as a first-class scenario; ensure idempotency or clear “migrated” markers so re-runs are safe.

Generate Test Cases for Each Feature

·         Mapping rules: Each field: source → target mapping; default values; derived fields; conditional logic (e.g. if status=X then target flag Y).

·         Null/invalid: Source null → target null or default; invalid enum/date/number → reject, default, or error row; empty string vs null.

·         Rounding/type: Decimals: precision and rounding mode; dates: timezone and truncation; strings: length limits and encoding; integers: overflow.

·         Duplicates: Same business key in source multiple times: first wins, last wins, merge, or reject; how duplicates are logged.

·         Referential integrity: Parent migrated before child; FKs valid; orphans: fail, skip, or default parent; circular refs handled.

·         Rollback: After partial/full run: rollback script leaves target in consistent state; re-run after rollback behaves as defined.

·         Reconciliation: Record counts (per table/type); checksums or hashes for critical fields; spot checks: sample IDs in source vs target; delta report.

·         Idempotency: Running migration twice: no duplicate rows, no double application of side effects; safe resume from checkpoint.

Questions on Ambiguities

·         What is the authoritative definition of each entity (source vs target) after go-live?

·         How should invalid or legacy-bad data be handled: reject row, write to quarantine table, or apply default and log?

·         What precision and rounding apply to money and percentages (e.g. half-up, bank rounding)?

·         Duplicate key strategy: which duplicate wins, and are others logged for manual review?

·         What is the order of migration (tables/entities) and the rollback order; are there cross-system dependencies (e.g. message queue)?

·         Who signs off on reconciliation (counts vs full checksum vs sampling) and what is the go/no-go criterion?

Areas Where Test Ideas Might Be Missed

·         Concurrent writes: source or target updated during migration; locking or snapshot strategy.

·         Large objects / blobs: size limits, streaming, and checksum for binaries; timeouts.

·         Soft deletes and history: migrate only active rows vs full history; deleted parent and child handling.

·         Encoded/encrypted fields: decrypt in source, transform, re-encrypt in target; key rotation during migration.

·         Audit and metadata: created_at, updated_at, migrated_at; preserving vs overwriting.

·         Feature flags or tenant scope: migrate only certain tenants or segments; rest stay on source until phased.

·         Downstream consumers: after cutover, do downstream systems see consistent data (e.g. cache invalidation, event replay).

Output Template

Context: [system/feature under test, dependencies, environment]

Assumptions: [e.g., auth method, data availability, feature flags]

Test Types: [data integrity, migration]

Test Cases:

ID: [TC-001]

Type: [data integrity/migration]

Title: [short name]

Preconditions/Setup: [data, env, mocks, flags]

Steps: [ordered steps or request details]

Variations: [inputs/edges/negative cases]

Expected Results: [responses/UI states/metrics]

Cleanup: [teardown/reset]

Coverage notes: [gaps, out-of-scope items, risk areas]

Non-functionals: [perf targets, security considerations, accessibility notes]

Data/fixtures: [test users, payloads, seeds]

Environments: [dev/stage/prod-parity requirements]

Ambiguity Questions:

- [Question 1 about unclear behavior]

- [Question 2 about edge case]

Potential Missed Ideas:

- [Suspicious area where tests might still be thin]

AI Prompt - Usability testing

AI Prompt

"Provide accessibility/usability test cases for [page/flow]. Include keyboard-only navigation, screen reader announcements, focus management, color contrast, error messaging clarity, and timeouts."

Applying Critical Thinking

·         User-diverse: Test as keyboard-only user and as screen reader user; avoid “we use mouse so it’s fine”.

·         Focus and order: Focus order should match visual order and task flow; focus must not be lost in modals, dropdowns, or dynamic content.

·         Meaning not just presence: ARIA and semantics must convey meaning (e.g. “button”, “alert”, “current step”), not just labels.

·         Errors and timeouts: Messages must be clear, associated with fields, and not rely on color alone; timeouts should warn and allow extension where possible.

Generate Test Cases for Each Feature

·         Keyboard-only: Tab through all interactive elements; no trap; Enter/Space activate buttons/links; Escape closes modals; arrow keys in menus/listboxes; skip link works.

·         Screen reader: Landmarks and headings; button/link names and roles; form labels and errors; live regions for dynamic updates; table headers and scope; no redundant announcements.

·         Focus management: Focus moves to modal when opened and returns on close; focus visible (outline/ring); focus not lost after AJAX or route change; first focusable in view on load.

·         Color contrast: Text and UI components meet contrast ratio (e.g. 4.5:1 normal, 3:1 large); focus indicators visible; don’t rely on color alone for required/error/state.

·         Error messaging: Errors are announced (live region or aria-describedby); message text clear and actionable; associated with field; success/error distinguishable without color only.

·         Timeouts: Session timeout: warning before expiry, option to extend; long operations: progress or status announced; no silent failure.

·         Usability: Labels and instructions clear; destructive actions confirmed; consistent patterns (e.g. submit always same place).

Questions on Ambiguities

·         What level are we targeting (WCAG 2.1 A, AA, AAA) and for which pages/flows?

·         Which screen readers and browsers are in scope (e.g. NVDA + Firefox, VoiceOver + Safari, JAWS)?

·         How should session timeout behave: warning at N minutes, extend button, and what happens to in-progress form data?

·         Are error messages written in plain language and reviewed by support/copy?

·         Do we support reduced motion and prefers-color-scheme (dark/light), and are they part of this test set?

·         Who is responsible for remediation (dev vs design) when contrast or focus order fails?

Areas Where Test Ideas Might Be Missed

·         Dynamic content: injected lists, infinite scroll, SPA route changes: focus and announcements after load.

·         Third-party widgets: chat, video, payment iframes: keyboard access and screen reader support inside iframe.

·         CAPTCHA / auth challenges: alternative (e.g. audio CAPTCHA) or exemption path for assistive tech users.

·         Complex widgets: custom combo boxes, date pickers, tree views: full keyboard and ARIA pattern (e.g. roving tabindex).

·         Mobile screen readers: VoiceOver (iOS), TalkBack (Android): gestures and focus different from desktop.

·         RTL and localization: focus order in RTL; translated labels and errors; font size scaling.

·         Timeout during data entry: user types in form; session expires mid-field; ensure data loss is communicated and recovery path exists.

Output Template

Context: [system/feature under test, dependencies, environment]

Assumptions: [e.g., auth method, data availability, feature flags]

Test Types: [usability, accessibility, UI]

Test Cases:

ID: [TC-001]

Type: [usability/accessibility]

Title: [short name]

Preconditions/Setup: [data, env, mocks, flags]

Steps: [ordered steps or request details]

Variations: [inputs/edges/negative cases]

Expected Results: [responses/UI states/metrics]

Cleanup: [teardown/reset]

Coverage notes: [gaps, out-of-scope items, risk areas]

Non-functionals: [perf targets, security considerations, accessibility notes]

Data/fixtures: [test users, payloads, seeds]

Environments: [dev/stage/prod-parity requirements]

Ambiguity Questions:

- [Question 1 about unclear behavior]

- [Question 2 about edge case]

Potential Missed Ideas:

- [Suspicious area where tests might still be thin]

AI in Software Testing: How Artificial Intelligence Is Transforming QA

For years, software testing has lived under pressure: more features, faster releases, fewer bugs, smaller teams. Traditional QA has done her...