Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

AI tools in healthcare often work in isolation but stall in practice. Reflections from Oxford's AI in Healthcare event on why implementation, not algorithms, is the real bottleneck – and what primary care research brings to the problem.

Illustrated sketchnote by Scriberia summarising discussions from the patient and public involvement day at the AI in Healthcare event. The hand-drawn visual captures key themes in speech bubbles and vignettes, including 'What is AI?', 'Hospital at Home: Getting it Right', 'The Future of Primary Care', 'Machine Learning and Proteomics for Complex Disease', and public perspectives on communicating AI in healthcare. Logos for the Computational Health Informatics Lab, Department of Engineering Science, Nuffield Department of Primary Care Health Sciences, and the University of Oxford appear at the top.
Visual summary of the event's patient and public involvement day, illustrated by Scriberia. The blog post covers themes from all three days.

Give a state-of-the-art computer vision model a photograph of a cat instead of a lung x-ray and ask it to assess for COVID-19. One hundred per cent certainty of infection.

This wasn’t a hypothetical. Dr Bartek Papiez showed us this result on the second day of the Computational Health Informatics Lab’s AI in Healthcare event at Worcester College. The model was, by the standards of its training data, performing well. It had simply learned the wrong thing.

I mention this not as a gotcha – anyone working in AI knows the genre. But, perhaps surprisingly, the cat was the exception. The more common story across two days was its mirror image: tools that mostly worked as designed, algorithms that performed well, and a question that kept surfacing – from radiologists, obstetricians, critical care doctors, and social scientists alike – that was harder to answer.

Now what?

I’m not a clinician or a computer scientist. I’m the department’s senior communications manager. But I’ve spent the past few years building practical AI tools, writing policy on their use, and thinking about how these technologies change work. Sitting between disciplines – not an expert in any one speaker’s field but engaged with all of them – turned out to be a useful vantage point. Here’s what I saw.

The tools work; the system doesn’t

The moment that solidified this for me came from Dr Alex Novak, an emergency doctor with the Oxford Clinical AI Research (OxCAIR) group who also sits on the NICE diagnostics committee. He described a trial he’s leading that tested an AI tool for reading CT head scans in emergency departments. The algorithm performed well. It did what it was supposed to do: speed up the process of finding the needle in the haystack – the actual brain injury among the thousands of scans that show nothing.

But when they measured the thing that actually matters – turnaround time for patients – nothing changed. The tool worked. The pathway around it didn’t.

Dr Lucy Mackillop, an obstetrician who has lived through the full cycle of developing an app, running an RCT, and attempting commercialisation – “and has the scars to prove it”, as she put it – made the same point from a different angle. Technology is never going to deliver the big wins alone. Human factors, workflow redesign, interoperability, post-deployment surveillance, regulatory approvals: these are the difference between a clever tool and a safer service.

AI in maternity care could save lives. Foetal monitoring alone, Lucy told us, could prevent roughly half the 1,100 annual cases of brain injury or death in term babies. But detecting the problem is only one piece. You need the follow-up pathway ready. You need staff trained. You need the system to move.

On the event’s third day – dedicated to patient and public involvement – Prof Andrew Farmer from our department described the Whole System Demonstrator, a large-scale telehealth trial from 2008. Six thousand patients, three sites, substantial Department of Health funding. It failed on logistics: equipment that needed an engineer to install, schedules that didn’t fit patients’ lives, nurses chasing missing data instead of providing care. No benefit. Worse, the nurses who lived through it were left disillusioned – not just about that system, but about any new technology that promised to “help”.

Nearly two decades later, the pattern persists.

Dr David Clarke, also of OxCAIR, offered a characteristically pragmatic prescription: start boring. Start with operations and admin. Build a safety platform and earn trust. Then do the interesting stuff underneath.

I kept hearing this theme and initially framed it as change management. By the second day, I’d sharpened it. The better term – or at least the one that’s within throwing distance – is implementation science: the study of how interventions that work in trials get adopted, or don’t, in real-world settings. We have researchers who study exactly this, though they might describe it in terms of adoption, spread, sustainability, and the ‘sociotechnical complexity of health systems’.

The bottleneck is the stuff around the algorithm. And “the stuff around the algorithm” is a substantial part of what this department does.

Two kinds of data

Prof David Clifton, who hosted the event, drew a distinction that stayed with me.

Much of the national research infrastructure is built around large retrospective datasets like OpenSAFELY, ORCHID, and CPRD. These are powerful, hard-won resources and the Goldacre Report’s secure data environments (SDEs) make sense in that world. The logic is epidemiological: what happened to this population? What can it tell us about risk?

But there is another kind of question – what is happening to this patient, right now, and what should we do next? – that requires different data entirely. Vital signs streaming at 75Hz. Wearables. The real-time signals that could guide clinical decisions in the moment, not population level policies weeks, months, or years later.

You can't ask the local hospital for real-time vital signs linked to a patient's longitudinal record through a standard data request. And as Prof Richard Hobbs noted, SDEs also exclude the unstructured clinical text recorded in every GP consultation – precisely the kind of data large language models can now work with. As David put it, SDEs are predicated on an inference model that may not match what's actually needed at the bedside or in a consultation.

His lab works across both worlds – using retrospective datasets and building real-time clinical AI systems. The two approaches aren’t competing. They need each other. But they require different infrastructure, different governance, different skills. An SDE is built like a vault. Real-time clinical AI requires something closer to a nervous system. The two must connect, but they can’t be built the same way.

The human who stops writing

Prof John Powell presented on ambient scribes – the AI tools now entering general practice that listen to consultations, summarise them, code them, and write the clinical notes. His title, a nod to Michael Balint’s The Doctor, His Patient and the Illness, signalled that this was about the consultation relationship, not just the technology.

John raised important concerns: accuracy, regulation, the effect on workloads, and what he called “the art of practising medicine”. These are well-evidenced worries. But the thing I couldn’t let go of was a question about cognition.

There is a principle in learning science called elaboration: processing information in your own words deepens understanding. William Zinsser put it more simply – writing is thinking. When a clinician writes their own notes, they’re not just recording what happened. They’re processing it, making sense of it, integrating it with what they know.

If the AI writes the notes, does the clinician lose that cognitive step? Does the consultation become something that happened to them rather than something they actively understood?

I don’t know the answer. But it sits at the intersection of health experiences research, qualitative methods, and digital health – all areas where this department has people doing serious work.

David Clarke’s experience with ambient scribes on critical care wards was more immediately sobering. In practice: multivoice confusion, with half of one patient’s history ending up in another’s record, confabulated data, and notes four to five times longer than a clinician would write. Nobody used it for family consultations.

There were some positives, colleagues with English as a second language found it helpful, for example, but David remains unconvinced.

Where the bridges need building

Two other threads from the event connect to places the department is already heading.

The first is global health. Prof Guy Thwaites gave a vivid account of healthcare in Vietnam and Indonesia – systems that spend roughly £150 per person per year (in the UK it’s north of £3,000), where fewer than 4 per cent of records are digital, but where the absence of legacy infrastructure means the potential to adopt new technology faster than the UK can is real. Prof Louise Thwaites described nine years of painstaking work building monitoring in Ho Chi Minh City that made AI-assisted prediction of tetanus deterioration possible – using low-cost wearables and pulse oximetry on cheap tablets in settings where the nearest ventilator might be a seven-hour drive away.

David Clifton connected this to a technical opportunity: federated learning, where the model moves to the data rather than the other way around. Pre-train on NHS data, fine-tune on Vietnamese patient data, without either dataset crossing a border.

The second thread is trust – and its limits. Regulators may accept digital twin simulations as evidence. Models may drift after routine software updates. And explanations can mislead: human-like language can make clinicians trust a system more than it deserves.

And then there was Catherine Pope’s 911 call – the image that brought this into sharpest relief. A woman phoning to order a pizza – except she wasn’t. She was in a domestic abuse situation, speaking in code, hoping the operator would understand. The operator did. An algorithm, trained on patterns, would hear a misdial. The human heard a life.

It’s an edge case. Most calls are not like this. But medicine is full of edge cases – and the question of whether a system can recognise the ones it can’t handle may be the most important trust problem of all.

What this means for us

This department builds things – the Bennett Institute’s OpenSAFELY platform is substantial software engineering, and the Clinical Trials Unit runs studies at national scale. But the distinctive contribution, the thing I heard the need for in almost every session, is the work that makes those tools matter: running trials to see if they change outcomes, understanding what patients experience, studying why effective interventions stall at the point of adoption, evaluating whether a policy that sounds good on paper works in a consulting room.

Andrew Farmer put it plainly on the PPI day. The reason most hospital-at-home services still lack remote monitoring, despite years of policy emphasis, is simple: “we don’t understand the problems of the people we’re trying to look after”. The technology exists. What’s still often missing is the research that starts with patients, carers, and communities – the people who can tell you what safe or workable looks like from the inside.

Prof Sir Aziz Sheikh made a broader version of this point in his opening keynote. Drawing on two decades – from evaluating the National Programme for IT (he "helped write the obituary for it," as he put it) to building Scotland's linked COVID cohort – he arrived at a conclusion that the rest of the event spent two days confirming: the technology is rarely the hard part. The sociotechnical system around it is.

The event was itself an example of something the department’s strategy calls for: partnership across disciplinary boundaries. Engineers from David Clifton’s lab, clinicians from OUH, social scientists, researchers working in Vietnam and Indonesia, patients and members of the public who gave up a Wednesday morning to help shape the questions – and, for reasons I’m still slightly surprised by, a communications manager. The conversations between talks were as instructive as the talks themselves. That kind of cross-pollination doesn’t happen by accident. It happens because someone builds the room and opens the door.

The algorithms work. Most of them, most of the time. Somewhere in Ho Chi Minh City, a pulse oximeter on a cheap tablet is watching for the moment a tetanus patient starts to deteriorate, seven hours from the nearest ventilator.

The question, for a department like ours, was never whether we could build that sensor. It was always whether the systems and people around it would be ready when it spoke.

 

The AI in Healthcare event (23–25 February 2026) was hosted by the Computational Health Informatics Lab, Department of Engineering Science, at Worcester College. The event included a dedicated day for patient and public involvement. I'd like to thank Dr Lei Clifton, co-organiser and lead of the departments Applied Digital Health MSc, for inviting me.

Any mistakes in figures or facts are my own, from my own hasty notes.

Opinions expressed are those of the author/s and not of the University of Oxford. Readers' comments will be moderated - see our guidelines for further information.

 

Add comment

Please add your comment in the box below.

Please answer the question below, this is to make sure that you are a human, rather than a computer.

Question: What is 4 + 4 ?

Your answer: