Should you trust AI?

Depends on the model

Sep 04, 2024

There’s an ongoing debate about whether (or rather when) we should and shouldn’t trust artificial intelligence. The Public LLM leaderboard on GitHub is a fantastic resource to see how the various publicly available models perform. ChatGPT 4o was the king of this regularly updated dataset for a long time, but as I checked it today, a new champion emerged, with the catchy name of GLM-4-9B-Chat. This new, open-source model hallucinates 0.2 percentage points less compared to 4o - on the other hand, casual users will have a hard time deciphering the how-to-start-using instructions.

All in all, large language models have come a long way in the past year. When I started checking the leaderboard an under-5% performance was exceptional. Now we are close to a factual consistency of 99%. So, should you trust AI, you ask? First, as you see, it depends on the model, one will come up with a fabricated answer in almost a third of the cases, while others will almost always produce a factually sound result. Not telling a lie and being highly useful or informative are not the same thing though, this is worth remembering when interacting with these models.

Andrea’s Substack

Discussion about this post