Thursday, February 19, 2026

AI Prompt - ML / AI-Specific Testing

AI Prompt

"Create test cases for [ML model/pipeline]. Include data quality checks, drift detection, edge inputs, fairness/bias slices, latency under load, and rollback if metrics regress."

Applying Critical Thinking

·         Data is part of the contract: Model behavior depends on input distribution; tests should cover data quality, schema, and known edge distributions.

·         Slices matter: Aggregate metrics can hide regressions for subgroups; define slices (demographic, region, product) and test them.

·         Drift and lifecycle: Concept and data drift over time; tests should include drift detection and criteria for retrain or rollback.

·         Determinism and reproducibility: Where possible, fixed seeds and fixtures; document non-deterministic areas and their impact on assertions.

Generate Test Cases for Each Feature

·         Data quality: Schema validation (types, ranges, nulls); missing or corrupt features; duplicates; label quality (if supervised); train/serve skew (same preprocessing).

·         Drift detection: Input distribution drift (stats, histograms); concept drift (label distribution or performance over time); thresholds and alerts; dashboard or pipeline step.

·         Edge inputs: Empty input, all nulls, out-of-range values, very long text, special characters; model doesn’t crash and returns safe default or error.

·         Fairness/bias slices: Performance (accuracy, F1, etc.) per slice (e.g. demographic, region); disparity metrics; minimum performance bar per slice; bias mitigation checks.

·         Latency under load: P50/P95/P99 latency at target RPS; batch vs online; GPU/CPU utilization; timeout and degradation behavior under overload.

·         Rollback / regression: When metrics regress (e.g. accuracy drop, fairness violation): rollback to previous model, feature flag, or fallback; pipeline and alerts tested.

·         Reproducibility: Same input + version → same output (where applicable); training run reproducible from config and data version.

·         Adversarial / robustness: Known adversarial or worst-case inputs; model doesn’t fail badly or expose unsafe behavior.

Questions on Ambiguities

·         What metrics define “good” (e.g. accuracy, F1, fairness parity) and what regression threshold triggers rollback?

·         Which slices are required for fairness reporting (e.g. age, region, product type) and what data is available?

·         How is drift defined (statistical test, threshold on distribution) and who acts on drift alerts?

·         What is acceptable latency (online vs batch) and what is the fallback when the model is slow or down?

·         Are explainability or audit logs required (e.g. feature contributions, request/response logging)?

·         Who approves model promotions and rollbacks (ML team, product, compliance)?

Areas Where Test Ideas Might Be Missed

·         Label and annotation quality: wrong or inconsistent labels in training/eval; impact on reported metrics and fairness.

·         Preprocessing parity: train vs serve preprocessing (tokenization, normalization, feature store); subtle skew.

·         Cold start and rare categories: new users or rare items; model behavior and fallback.

·         Feedback loops: model predictions influence future data (e.g. recommendations); long-term bias or collapse.

·         Versioning and A/B: multiple model versions in production; routing and metric attribution per version.

·         Security of model artifact: tampering, extraction, or inversion; not always in scope but worth noting.

·         Cost and resource: inference cost per request; GPU memory; batch size vs latency tradeoff under load.

Output Template

Context: [system/feature under test, dependencies, environment]

Assumptions: [e.g., auth method, data availability, feature flags]

Test Types: [ML, data quality, fairness, performance]

Test Cases:

ID: [TC-001]

Type: [ML/AI]

Title: [short name]

Preconditions/Setup: [data, env, mocks, flags]

Steps: [ordered steps or request details]

Variations: [inputs/edges/negative cases]

Expected Results: [responses/UI states/metrics]

Cleanup: [teardown/reset]

Coverage notes: [gaps, out-of-scope items, risk areas]

Non-functionals: [perf targets, security considerations, accessibility notes]

Data/fixtures: [test users, payloads, seeds]

Environments: [dev/stage/prod-parity requirements]

Ambiguity Questions:

- [Question 1 about unclear behavior]

- [Question 2 about edge case]

Potential Missed Ideas:

- [Suspicious area where tests might still be thin]

AI in Software Testing: How Artificial Intelligence Is Transforming QA

For years, software testing has lived under pressure: more features, faster releases, fewer bugs, smaller teams. Traditional QA has done her...