ChatKPI 3 NL Leugens

You have Lies, Big Lies, and Artificial Intelligence.

ChatKPI, our future application for formulating Key Performance Indicators using Chat, must of course produce accurate KPIs. Accuracy is considered a basic expectation. Nobody wants to find out that 35% of the meat at the butcher is spoiled, occasionally not receive a salary, or have their children come home unexpectedly from school on random days because there happened to be no classes.

Of course, something can go wrong, but that should be the exception.

However, making mistakes is actually quite common with AI.

Given the enormous hype around AI and ChatGPT, you might not expect that. But those are simply the facts.

Microsoft Build 2023: Finding the Right Sleeping Bag

At Microsoft Build 2023, great effort is made to convince developers worldwide of the excellence of their products. The event uses examples for this purpose. These examples are very carefully selected in advance, as they must serve as supporting evidence for the idea of simplicity, reliability, and immediate accessibility of the technology.

This year was all about generative AI, with a focus on ChatGPT. And Microsoft had selected a brilliant example for this.

What was this example?

Someone wanted to buy a sleeping bag on a website and asked the AI machine: ‘Which sleeping bag do I need if I’m going to Chile?’. And the AI machine provided a list of sleeping bags.

1700039891298?e=1717632000&v=beta&t=mi kIkNR0upHMU gZr1TOk84IELM0kY6KvMoLJnDNU

The idea behind this example was that the machine knew it could be cold in Chile, and based on that, selected the right sleeping bag

35% of the Answers Were Wrong

The idea behind this example was that the machine knew it could be cold in Chile, and based on that, selected the right sleeping bags.

35% of the Answers Were Wrong

Unfortunately, this list was not correct. A Microsoft speaker revealed that as much as 35% of the answers were wrong: you got the wrong sleeping bag.

Great, if you like shivering from cold in your tent at night.

And this was an example carefully selected in advance

Microsoft is very honest and clear about this, almost to the point of annoyance, emphasizing “Grounding”: the need to always check if an AI answer is correct before using it.

But does this also apply to ChatKPI?

Our lessons from this experience made it clear: it is not wise to use AI to automate something that is essentially human work, like selecting items in a webshop.

And frankly, it would have surprised us – with 30 years of experience in building AI software – if it had worked well. With experience, you get a reasonably good insight into what is and isn’t feasible.

Therefore, we were much less ambitious with ChatKPI: we limited ourselves to producing computer code. This is much easier for a machine because computer language is artificial and very tightly formalized.

And ChatKPI seemed to work well, see our previous article for a precise account of how we did it.

But would it always work? How reliable is generative AI for producing computer code? What could go wrong?

The answer is: a lot.

The Science: Stanford University Found 52% Errors

Stanford University conducted research to determine how well ChatGPT could solve a standard collection of simple programming problems.

Stanford discovered that as much as half of the answers were simply wrong.

ChatGPT gets code questions wrong 52% of the time • The Register

That was a significant disappointment.

The Hard Reality: Stack Overflow Bans Generative AI

We were not discouraged and continued our research. While the Stanford questions involve simple programming problems, they seem artificial. They are not real-world problems, and that’s what we’re really interested in.

Where do you find real-world problems?

For this answer, Stack Overflow immediately comes to mind. There is no more prestigious platform in the world. On this platform, professional programmers ask each other various technical questions about programming issues. To ensure quality, they use a strict evaluation system for both the answers and each other

1700039985472?e=1717632000&v=beta&t=5x2in Jpxxnj6YeVKYluuT1m 4HS1EBevhrftA1pw8

And it turns out, Stack Overflow, after some experience, quickly and strictly banned the use of AI. They go quite far in this regard: if you dare to use ChatGPT, you risk being banned from the platform for life.

Stack Overflow bans ChatGPT as ‘substantially harmful’ • The Register

“High error rates mean thousands of AI answers need checking by humans”

“The average rate of getting correct answers from ChatGPT is too low”

AI is Based on Statistics, and That’s Where It Goes Wrong

If you think about it more deeply, it makes sense that AI is often unreliable. After all, it is essentially a statistical analysis. And to paraphrase Mark Twain with his “There are three kinds of lies: lies, damned lies, and statistics“: ‘You have lies, big lies, and Artificial Intelligence’.

How Do We Proceed with ChatKPI?

For our projects, we always use this curve. Horizontally, you see time; vertically, the confidence, the enthusiasm for the project. Rarely does a project follow the green curve; almost always, it follows the blue curve. And here too, with ChatKPI 1 and 2, we were potentially still on the green curve, but now we suddenly find ourselves on the blue curve.

1700040102688?e=1717632000&v=beta&t=Ym j5ZpybTz94F4lWuI4pJEWsJ5HAf5OPKFi3W2zE9A
The enthusiasm curve for projects

Our conclusion: It is irresponsible to integrate AI into our software at this time. We’ll wait a year before we proceed with artificial intelligence. A miracle might happen, but our 30 years of experience with AI tell us it won’t. At some point, you reach a plateau, and it takes a tremendous amount of effort to make any improvements, if it’s even possible. So, even a year from now, AI will not be reliable enough. However, the realization that AI is not reliable may sink in, and our users might understand that they need to be cautious when using it.

(Disclaimer: Microsoft is sponsor van Actoren)

Scroll to Top