AI Prompt
"Create test cases for [ML model/pipeline].
Include data quality checks, drift detection, edge inputs,
fairness/bias slices, latency under load, and rollback if metrics
regress."
Applying Critical Thinking
·
Data is part
of the contract: Model behavior depends on input distribution; tests
should cover data quality, schema, and known edge distributions.
·
Slices matter:
Aggregate metrics can hide regressions for subgroups; define slices
(demographic, region, product) and test them.
·
Drift and
lifecycle: Concept and data drift over time; tests should include
drift detection and criteria for retrain or rollback.
·
Determinism
and reproducibility: Where possible, fixed seeds
and fixtures; document non-deterministic areas and their impact on
assertions.
Generate Test Cases for Each Feature
·
Data quality:
Schema validation (types, ranges, nulls); missing or corrupt features;
duplicates; label quality (if supervised); train/serve skew (same
preprocessing).
·
Drift
detection: Input distribution drift (stats, histograms); concept
drift (label distribution or performance over time); thresholds and alerts;
dashboard or pipeline step.
·
Edge inputs:
Empty input, all nulls, out-of-range values, very long text, special
characters; model doesn’t crash and returns safe default or error.
·
Fairness/bias
slices: Performance (accuracy, F1, etc.) per slice (e.g.
demographic, region); disparity metrics; minimum performance bar per slice;
bias mitigation checks.
·
Latency under
load: P50/P95/P99 latency at target RPS; batch vs online;
GPU/CPU utilization; timeout and degradation behavior under overload.
·
Rollback /
regression: When metrics regress (e.g. accuracy
drop, fairness violation): rollback to previous model, feature flag,
or fallback; pipeline and alerts tested.
·
Reproducibility:
Same input + version → same output (where applicable); training run reproducible
from config and data version.
·
Adversarial /
robustness: Known adversarial or worst-case inputs;
model doesn’t fail badly or expose unsafe behavior.
Questions on Ambiguities
·
What metrics define “good” (e.g. accuracy,
F1, fairness parity) and what regression threshold triggers rollback?
·
Which slices are required for fairness reporting
(e.g. age, region, product type) and what data is available?
·
How is drift defined (statistical test,
threshold on distribution) and who acts on drift alerts?
·
What is acceptable latency (online vs batch) and
what is the fallback when the model is slow or down?
·
Are explainability or audit logs required (e.g.
feature contributions, request/response logging)?
·
Who approves model promotions and rollbacks (ML
team, product, compliance)?
Areas Where Test Ideas Might Be Missed
·
Label and annotation quality: wrong or
inconsistent labels in training/eval; impact on reported metrics and fairness.
·
Preprocessing parity: train vs serve
preprocessing (tokenization, normalization, feature store); subtle skew.
·
Cold start and rare categories: new users
or rare items; model behavior and fallback.
·
Feedback loops: model predictions influence
future data (e.g. recommendations); long-term bias or collapse.
·
Versioning and A/B: multiple model versions in
production; routing and metric attribution per version.
·
Security of model artifact: tampering,
extraction, or inversion; not always in scope but worth noting.
·
Cost and resource: inference cost per request;
GPU memory; batch size vs latency tradeoff under load.
Output Template
Context: [system/feature under test, dependencies, environment]
Assumptions: [e.g., auth method, data availability, feature flags]
Test Types: [ML, data quality, fairness, performance]
Test Cases:
ID: [TC-001]
Type: [ML/AI]
Title: [short name]
Preconditions/Setup: [data, env, mocks, flags]
Steps: [ordered steps or request details]
Variations: [inputs/edges/negative cases]
Expected Results: [responses/UI states/metrics]
Cleanup: [teardown/reset]
Coverage notes: [gaps, out-of-scope items, risk areas]
Non-functionals: [perf targets, security considerations, accessibility notes]
Data/fixtures: [test users, payloads, seeds]
Environments: [dev/stage/prod-parity requirements]
Ambiguity Questions:
- [Question 1 about unclear behavior]
- [Question 2 about edge case]
Potential Missed Ideas:
- [Suspicious area where tests might still be thin]