There’s an ongoing debate about whether (or rather when) we should and shouldn’t trust artificial intelligence. The Public LLM leaderboard on GitHub is a fantastic resource to see how the various publicly available models perform. ChatGPT 4o was the king of this regularly updated dataset for a long time, but as I checked it today, a new champion emerged, with the catchy name of GLM-4-9B-Chat. This new, open-source model hallucinates 0.2 percentage points less compared to 4o - on the other hand, casual users will have a hard time deciphering the how-to-start-using instructions.
All in all, large language models have come a long way in the past year. When I started checking the leaderboard an under-5% performance was exceptional. Now we are close to a factual consistency of 99%. So, should you trust AI, you ask? First, as you see, it depends on the model, one will come up with a fabricated answer in almost a third of the cases, while others will almost always produce a factually sound result. Not telling a lie and being highly useful or informative are not the same thing though, this is worth remembering when interacting with these models.