The advancements in artificial intelligence have changed various industries, making tasks more efficient and accessible. However, recent findings suggest that the performance of the highly acclaimed GPT-4 model may not be as reliable as previously believed.
In this article, we explore a study that sheds light on GPT-4’s performance degradation and the potential implications it could have on AI-driven applications.
Table Of Contents 👉
A Shift in GPT-4’s Performance
Over time, many users have reported a noticeable decline in the quality of responses generated by GPT-4. Until now, these concerns were largely anecdotal. However, a recent study has provided objective evidence of this decline, comparing the performance of GPT-4 in March and June versions.
A Prime Example of Degradation
To evaluate the model’s performance, a team used a dataset of 500 problems, where GPT-4 had to determine whether a given integer was prime.
In the March version, GPT-4 correctly answered 488 out of 500 questions, showcasing an impressive 97.6% success rate. However, the June version’s performance plummeted, with only 12 correct answers out of 500, resulting in a mere 2.4% success rate.
Chain-of-Thought Failure
Another technique used to enhance model responses is Chain-of-Thought, which helps the model reason step by step. Unfortunately, the latest version of GPT-4 failed to generate intermediate steps, leading to incorrect responses like a simple “No” when asked about the primality of a number.
Code Generation Woes
For tasks involving code generation, GPT-4’s performance also showed a significant drop. A dataset comprising 50 easy problems from LeetCode was used to evaluate the model. The March version achieved a satisfactory 52% success rate in providing answers without any changes, while the June version’s success rate plummeted to a paltry 10%.
Possible Reasons Behind the Decline
The exact reasons for GPT-4’s declining performance remain unclear, but it is speculated that the quality decreases due to the safety alignment and fine-tuning. Take a look at the below-given video (more specifically, “the strange case of the unicorn” segment), where it’s shown how GPT-4’s quality decreases after fine-tuning it for safety.
The other rumors suggest that OpenAI might be using smaller and specialized GPT-4 models to handle user queries more efficiently, deciding which model to send the query to based on cost and speed considerations.
Implications for AI Applications
The deterioration in GPT-4’s performance is concerning for developers and businesses relying on this technology to power their applications. AI-driven systems that exhibit unpredictable behavior over time can lead to significant consequences in various industries, including healthcare, finance, and customer service.
Related Stories: