animal health consulting

"AI" chatbots

how reliable are they for equine topics?

Christine King  BVSc, MANZCVS (equine), MVetClinStud

The other day I came across the summary of a study that examined how accurate some commonly used artificial so-called intelligence ("AI") chatbots are when it comes to equine topics:


Accuracy of large language model-based artificial intelligence tools for equine topics. Journal of Equine Veterinary Science, 2026; May 1:105924.


Turns out, not very!


Here is the abstract:


Background  Artificial intelligence (AI) platforms are becoming increasingly popular as resources for equine information. However, these platforms generate responses from a wide range of sources and do not always distinguish between fact and opinion.


Aims/objectives  The objective of this study was to assess the accuracy and quality of AI-generated answers to equine-related questions. Researchers hypothesized that AI platforms could answer basic equine questions effectively but would perform poorly on complex topics or questions.


Methods  Forty questions were written covering the following categories:

*  general horse care

* facilities management

* nutrition

* genetics

* reproduction


Each question was categorized by difficulty level:

* beginner

* intermediate

* advanced

* trending


Three AI platforms were tested:

* ChatGPT (CGPT)

* Microsoft Copilot (MicCP)

* ExtensionBot (ExtBot)


FYI: "ExtensionBot is the Extension Foundation’s purpose-built AI platform for the Cooperative Extension System — designed to deliver trustworthy, research-based answers to the public without the high hallucination risk and generic output that characterizes general-purpose AI.”


Responses were scored for the following features:

* accuracy

* relevance

* thoroughness

* source quality


Each feature was given a score of up to 5 points each, for a total score of up to 20 points.


Data were analyzed using PROC GLM in SAS (v. 9.4).


Results  Total score was affected by level (P = 0.002). Intermediate questions had the highest total score (15.95 ± 1.99).


Accuracy was affected by platform (P < 0.001), level (P < 0.001), and topic (P = 0.015). CGPT (4.18 ± 0.93) and MicCP (4.08 ± 0.83) outperformed ExtBot (3.26 ± 1.21).


Relevance was affected by platform (P = 0.042) and level (P < 0.001).


Thoroughness was affected by platform (P < 0.001).


Source quality differed by platform (P = 0.037).


Conclusion  AI platforms could be resources; currently they fall short of the knowledge that Equine Extension Specialists can offer. AI platforms had difficulty addressing complex topics and demonstrated inconsistent performance across criteria.


* * *


Let me tease that apart for you with a few key points:


1. Overall performance was 80%, at best.


At best, these chatbots scored an average of 16/20 (total score) on various equine topics. That's an average of 80%. At best.


As the authors themselves wrote [emphasis added]:


If we interpret the total score as a percent, the highest scoring level (Int.) received a "grade" of 80%. Although the perception of a grade is often subjective, it is generally accepted that a grade of 80% indicates room for improvement. This is important to note because AI tools are frequently perceived as impressive and "all-knowing" due to their ability to access massive amounts of information in a matter of seconds.


But that's just the average, or 'mean' in statistical parlance. The standard deviation of the mean was +/- 2 points.


The standard deviation of the mean describes about 68% of all data points being plotted on the classic bell curve.


Classic 'bell curve' formed by normally distributed data.

The figures along the bottom (X axis) represent 1, 2, and 3 standard deviations on either side of the mean (0).

One standard deviation from the mean in both directions (+/-) accounts for 68.2% of all data points.



So, while the average total score for the chatbots was 80%, it actually ranged from 70% to 90% for the majority of queries.


When we look at two standard deviations from the mean, which describes about 95% of all data points, the total score ranged from a low of 60% to a high of 100%.


That's quite a range!


But that was just for the intermediate-level questions. It was significantly lower for beginner, advanced, and ‘trending' levels.


2. Overall performance was lower for beginner-level questions.


More concerning to me was the fact that overall performance was lower for the beginner-level questions than for the intermediate (the pinnacle of the bots’ dubious success).


New horse owners are the ones most in need of reliable information, and arguably the ones most likely to consult a chatbot for information about horses and horse care — and the ones least likely to know which information is reliable and which is not!


3. Accuracy averaged 77% across all three platforms.


As for accuracy as a specific component of total score, ChatGPT and Copilot were the most accurate, although they averaged just 83.6% (ChatGPT) and 81.6% (Copilot). In comparison, ExtensionBot had an average accurace of only 65.2%.


I’ve been trying to think of a situation in which I would be happy with an accuracy of 84%. Horseshoes and hand grenades perhaps?


I would never be happy with an accuracy of 65% for an information-generating system.


* * *


And those are just the highlights. Only the most supportive (positive, persuasive, statistically significant) data make it into the abstract.


The repeated refrain by proponents of “AI” is that these things just need a little more time, a bit more road testing, a few more tweaks — that is, when they’re not disclaiming that “it was never intended to be used for that.”


In my considered opinion, these things are fatally flawed, and no amount of time or tweaking will render them fit for purpose.


I’m also reminded of the saying I first heard applied to social media platforms: If the product is free, then you are the product.


User beware.


* * *


If you're interested in this topic, I explore the limitations of "AI" in other articles you can access for free here.



© Christine M. King, 2026. All rights reserved.


back