Keep a Human in the Loop - AI in Test Automation

Human oversight still required, even as AI in testing soars

Use of AI in testing digital products is soaring, but human input remains hugely important. That’s the key finding of the Applause State of Digital Quality in Functional Testing in 2025 report. More than 2,100 software testing and development professionals participated in the survey, which is now available online.

In the 12 months since the previous report, AI in testing has doubled, with 60% of respondents deploying generative and agentic solutions, as opposed to 30% in 2024. The most common uses involve developing test cases (70%), optimizing testing (55%) and automating test scripts (48%).

With AI adoption accelerating, 80% of organizations find themselves hindered by a lack of in-house testing expertise. Specific areas of concern include

The need to accompany rapidly changing requirements (92%), with a third of respondents needing the assistance of a testing partner.
Unstable testing environments (87%)
Insufficient time for adequate testing (85%).

Given these responses, the report concludes that keeping humans in the loop, e.g., through crowd-testing, remains an effective solution to ensure quality.

“To meet increasing user expectations while managing AI risks, it’s critical to assess and evaluate the tools, processes and capabilities we’re using for quality assurance on an ongoing basis – before even thinking about testing the apps and websites themselves,” opined Applause CTO

Other Voices in AI Draw Similar Conclusions

OpenAI co-founder Greg Brockman, and Codex engineering lead Thibault Sottiaux, on a recent episode of “The OpenAI Podcast”, debated what happens when AI becomes a true coding collaborator, Brockman emphasized the importance of humans "in the driver’s seat,” alongside Codex. Sottiaux added that their commitment to maintaining a safe environment involves knowing when the agent needs humans to steer or approve actions.

“We're going to be continuing to invest a lot in making the environment safe, invest in understanding when humans need to steer, when humans need to approve certain actions and in giving more and more permissions so that your agent has its own set of permissions that you allow it to use,” he said.

Key Reasons for Human Oversight

In software testing

Trust and explainability: AI-driven testing is probabilistic and can flag or miss issues in non-obvious and opaque ways; humans are needed to interpret failures, validate AI reasoning, and decide whether a build is truly safe to ship.

Awareness of risk and impact: Testers understand domain-specific consequences (e.g., security, compliance, user harm) and can prioritize or override AI judgments when the model’s confidence or logic is questionable.

Ambiguity and edge cases: Humans can imagine and design tests for unruly, real-world edge cases and socio-technical scenarios that AI may not see in training data.

Preventing and detecting bias: Human review is needed to spot biased data, unfair scenarios, or tests that ignore certain user groups, which pure automation often amplifies rather than fixes.

Continuous improvement: Human feedback on false positives/negatives and poor test suggestions is essential to retrain models and reduce hallucinations or flaky AI-generated tests over time.

In hardware and system testing

Safety and high-risk operations: Where failures have physical consequences (devices, embedded systems, safety-critical hardware), regulations and emerging standards implicitly and explicitly expect human oversight to minimize risks to health and safety.

Real-world judgment: Humans are better at assessing whether a hardware or system behavior is acceptable under noisy, imperfect conditions (temperature, interference, wear) that may not match lab or training assumptions.

Intervention and overrides: HITL setups allow engineers to halt autonomous tests, adjust parameters, or override AI decisions when the system reaches risk thresholds or behaves unexpectedly.

Ethics and governance

Accountability: Organizations meed to show who is responsible for go/no‑go decisions and cannot delegate legal or ethical accountability entirely to an AI system.

Standards compliance: Emerging frameworks (e.g., ISO/IEC AI management systems and the EU AI Act) explicitly call for human oversight to detect anomalies, manage automation bias, and protect fundamental rights of designers, testers and consumers of the products under test.

Overcoming automation bias: Recent studies have show that humans tend to over-trust automation. Structured oversight roles, review gates, and training are needed so testers critically evaluate AI outputs instead of accepting those outputs sight-unseen, effectively rubber‑stamping them.

On the practical side

Humans are still needed to guard against AI misbehavior. AI tools can be manipulated (prompt injection, adversarial inputs) or can simply produce incorrect or inappropriate code/tests. Human reviewers are needed to detect malicious or nonsensical actions before they hit CI/CD or production labs. Moreover, human testers provide strategic direction—what quality means for a product, which risks merit priority, and how to trade off coverage vs. time—while AI handles scale and routine generation/execution.

Then there is the issue of "socio-technical" fit. Human testers act as “trust architects,” ensuring that AI-augmented testing stays aligned with organizational values, user expectations, and team workflows, rather than letting the tooling dictate practice.

Conclusion - Shaping the Loop

Designers and curators of AI‑generated tests need to spend more time reviewing, refining, and selecting AI-generated cases for inclusion in automated suites vs. prompting and accepting AI outputs as-is, near ready for production. Key activities for supervisors and test auditors include monitoring AI-driven pipelines, investigating anomalies, and auditing AI decisions and metrics over time. Ultimately, the "loop" is shaped through iterative feedback, integrating corrections, labeling process and sub-processes, and feeding clarifications back into training and evaluation pipelines.