Exploring the Performance of Free AI in Code Writing: How Effective Are They Really?

Опубликовано: 12 июля, 2025

In contrast to popular belief, the utilization of advanced AI tools hinders the productivity of experienced software developers when engaging with well-known codebases, rather than enhancing it. This is supported by a recent study conducted by the nonprofit research organization METR.

Analysts observed a group of seasoned developers using the popular AI assistant Cursor to complete tasks in open-source projects. The expectation was that their work time would decrease by 24% due to the use of AI.

After the study concluded, developers believed that the language model had reduced task completion time by 20%. However, the actual time increased by 19%.

These findings challenge the widespread notion that AI consistently boosts the productivity of skilled professionals. The slowdowns were attributed to the need for developers to verify and correct the AI’s suggestions.

Despite a longer development cycle, most participants, as well as the authors of the study, continued to use Cursor, as the process feels less tedious and more enjoyable with AI assistance.

*Spoiler: Grok performed the best of all.*

**Task Description:** We will request AI to write a REST API server in Python for a To-Do List application, including features for creating, retrieving, updating, and deleting tasks. Data should be stored in memory without a database. The AI must provide functional code and explain it.

APIs are utilized across various platforms—from mobile applications to websites. For instance, when you check the weather in an app, it employs a REST API to obtain data from the server. A REST API for a To-Do List enables the application to add, display, modify, or delete tasks.

**Prompt:**

“Create a REST API server in Python using Flask or FastAPI for a To-Do List application. The server should support creating, retrieving, updating, and deleting tasks (CRUD). Store tasks in memory without a database. The code should be clean and understandable. Explain how the code functions and detail how to test it.”

**Code Snippet:**

«`python
# Creating an instance of the FastAPI application
app = FastAPI(title=»To-Do List API»)

# Task model (defines the data structure for a task)
class Task(BaseModel):
id: int
title: str
description: str | None = None # Optional description
completed: bool = False # Default: task not completed

# In-memory task storage (list of dictionaries)
tasks: List[Task] = []
«`
[Full code](https://github.com/Atulupn/AI-Code-Comparison/blob/main/grok.py) — 64 lines.

**How It Works**

**Result:** The server is up and running with the API functioning properly. We send a task, and it appears on the list.

**How Grok Assesses Its Work:**

“Code: 8/10 — wrote a simple, readable REST API with CRUD, but validation (e.g., for title) and edge case handling could be added. Explanation: 9/10 — detailed the code and testing process, well-structured and clear, though could be shorter. Overall Rating: 8.5/10 — task completed well with some room for improvement.”

**Code Snippet:**

«`python
# Creating an instance of the FastAPI application
app = FastAPI()

# Class describing the task structure (data model)
class Task(BaseModel):
id: int
title: str
description: Optional[str] = None # Optional description
completed: bool = False # Default: task not completed

# «Database» in memory — just a list of tasks
tasks_db = []
current_id = 1 # Counter for generating unique IDs
«`
[Full code](https://github.com/Atulupn/AI-Code-Comparison/blob/main/deepseek_python_20250703_9519ea%20(1).py) amounts to 56 lines.

**How It Works**

**Result:** DeepSeek encountered issues. An error 422 occurred when creating a task (the server received the data but couldn’t process it due to format discrepancies). The code stipulates that an ID is mandatory during task creation:

«`python
# Class describing the structure of the task (data model)
class Task(BaseModel):
id: int
title: str
description: Optional[str] = None # Optional description
completed: bool = False # Default: task not completed
«`
However, logically the ID should be generated automatically (this is done by the line `task.id = current_id` in the `create_task` function). On the second attempt, the AI corrected the error by removing the unnecessary line, and the function worked correctly.

**How DeepSeek Evaluates Its Work:**

“Code: 8/10. Fixed critical issue with ID, made task creation logic reliable. Improvements possible: add task persistence in a file (to prevent loss upon restart). Explanation: 9/10. Simple, step-by-step instructions. Improvements possible: add more error examples and visualization. Overall: Performed well, but there’s room for growth! Thank you for your trust!”

**Code Snippet:**

«`python
# Temporary storage for tasks (in memory)
tasks: List[Task] = []

# Retrieve list of all tasks
@app.get(«/tasks», response_model=List[Task])
def get_tasks():
return tasks

# Retrieve task by ID
@app.get(«/tasks/{task_id}», response_model=Task)
def get_task(task_id: int):
for task in tasks:
if task.id == task_id:
return task
raise HTTPException(status_code=404, detail=»Task not found»)
«`
[Full code](https://github.com/Atulupn/AI-Code-Comparison/blob/main/gpt.py) from GPT — 57 lines.

**How It Works**

Data: ChatGPT.

**Result:** The server and all functions are working well.

**How GPT Evaluates Its Results:**

“Code: clean, simple, follows best practices for FastAPI and CRUD tasks. Explanation: clear, step-by-step, with examples and alternatives (Swagger, curl). Rating: 9.5/10 — well-balanced for beginners: minimal code, maximum utility. If there’s anything to improve — I’ll gladly help or refine!”

Each AI was able to write functional code with explanations and comments. ChatGPT and Grok achieved this on the first attempt, while their Chinese competitor succeeded on the second, taking a bit longer. All AIs used FastAPI, citing its advantages over Flask in terms of simplicity and modernity. ChatGPT, Grok, and DeepSeek also implemented error handling, despite that not being part of the prompt.

Every «candidate» explained how to run the server and test its functionality. Grok’s explanation seemed to be the most thorough and clear, detailing the functionality of each function: starting the server, FastAPI, task storage, and error handling. DeepSeek was more concise, while ChatGPT simplified things into a few words.

**Rating on a 3-point scale:**

**Task Description:** Analyze and correct poorly written JavaScript code for a function that creates a simple animation on a webpage (e.g., a moving square that changes color).

**Prompt:**

“Here is the JavaScript code (provided above). This is a function for animating a moving square on HTML Canvas. Identify errors, explain the problems, and write a corrected version of the code. If possible, optimize the code, add bounds checking for the canvas, ability to stop the animation, and comments. Explain your changes and how to run the animation.”

**Code Snippet with Errors:**

«`javascript
setInterval(function() {
context.fillStyle = «white»;
for (var i = 0; i < snowflakes.length; i++) { context.beginPath(); context.arc(snowflakes[i].x, snowflakes[i].y, 5, 0, Math.PI * 2); context.fill(); snowflakes[i].y = snowflakes[i].y + snowflakes[i].speed; if (snowflakes[i].y > 150) {
snowflakes[i].y = 0;
«`
[Full Code](https://github.com/Atulupn/AI-Code-Comparison/blob/main/function%20startSnowfall()%20%7B.js) lacks comments and contains errors.

Here’s what the animation looks like:

**Identifying Issues in the Code:**

The AI uncovered ten issues in the code, including suboptimal performance (using `setInterval` instead of `requestAnimationFrame`), lack of flake speed checks, and redundancy in rendering.

**Result:** Despite an unclear prompt, Grok [revised the code](https://github.com/Atulupn/AI-Code-Comparison/blob/main/groksnowfall.js) for an improved animation:

The only complaint: the code lacks handling for window resizing—if the browser window size changes, the canvas dimensions remain fixed, which could lead to incorrect animation display on larger screens. Nonetheless, the AI also suggested adding a corresponding function if needed.

**Grok’s Comment:**

“I did well: identified all errors, proposed an optimized version with new functions and detailed comments. The code became more reliable and flexible.”

**Identifying Issues in the Code:**

The Chinese AI identified seven key issues and proposed solutions.

**Result:**

[Corrected Code](https://github.com/Atulupn/AI-Code-Comparison/blob/main/deepseeksnowfall.js) is generally functional but contains several errors.

The code attempts to increase snowflake speed in an empty array:

«`javascript
for (const flake of snowflakes) {
flake.speed += 0.5;
}
«`
As a result, pressing the «Increase Speed» button has no effect.

The «Stop Animation» button works: the animation stops. But on pressing it again, it triggers `startSnowfall`, which creates a new snowflake array. This resets the current state of snowflakes (positions, speeds, and sizes), causing the animation to «restart» with new snowflakes instead of continuing current ones. Similar to Grok’s code, there’s no handling for window resizing.

**DeepSeek’s Comment:**

“Before corrections, it was decent but had critical shortcomings. Scored 3/5 — basic animation worked, but control and reliability were lacking. After your corrections, I’ve fixed everything and scored it a 5/5.”

**Identifying Issues in the Code:**

ChatGPT recognized five major errors in the code and offered solutions.

**Result:**

As in previous cases, the AI added the ability to stop the animation. However, once stopped, restarting it without refreshing the page isn’t possible. Similarly, as with previous instances, the window size isn’t adaptive.

**GPT’s Comment:**

“The code is optimized, bugs fixed, and the animation is expanded and ready for use.”

Each AI tackled the task with varying degrees of success, addressing major issues with the code, proposing alternatives, and adding comments for clarity. Grok performed the best by identifying ten issues, rectifying them, and suggesting further enhancements, such as offsetting snowflakes slightly to mimic wind. The other two fared somewhat worse – while the animation worked, it had flaws.

**Rating on a 3-point scale:**

**Task Description:** Create a simple video game for PC, such as «Snake,» «Minesweeper,» or a similar title. The game should feature minimalist visually pleasing graphics and simple keyboard controls. The AI should choose the genre (arcade, platformer, puzzle), programming language, explain the code, and instruct how to run the game.

**Prompt:**

“Design and write the code for a simple 2D visual game that can be implemented within an hour, with keyboard controls and visually appealing graphics. The game should be unique. Pick a genre (e.g., arcade, puzzle, platformer, etc.) and an appropriate programming language (e.g., Python with Pygame or JavaScript with HTML5 Canvas, etc.). The game interface should be in Russian. Utilize minimal libraries necessary for graphics. The code should be readable and include comments. After writing the code, explain how the game operates, how to run it (including installing any libraries if needed), and why you chose this concept.”

**Result:**

The AI from xAI proposed a game named Star Collector.

**Game Description by Grok:**

“This is a 2D game inspired by Pac-Man but with a unique mechanic. You control a small spaceship (a simple sprite) flying across the screen. The objective: collect stars (bright, flashing objects) that appear at random. Asteroids (moving enemies) float around the field; colliding with them reduces health. If health (three lives) runs out, the game ends. Visual style: bright space with a black background, twinkling stars, and animated asteroids.”

Creating the game took around 25 minutes. After writing the basic code, the AI suggested enhancements to make the game more vibrant and engaging. After an additional 15 minutes, Grok added multi-colored planets to the background, sound effects for collecting stars and collisions with asteroids, as well as hearts appearing occasionally on the field (for extra lives).

The programming language chosen for the [code](https://github.com/Atulupn/AI-Code-Comparison/blob/main/star_collector.py) was Python using Pygame.

The resulting game was genuinely engaging and visually appealing, reminiscent of classic arcade games from the past century.

**Grok’s Rating:**

“I did well: created a simple yet engaging Star Collector game in Python with Pygame in an hour as you requested. It has a visually rich experience (background stars, trails, twinkling), is controlled with WASD, includes bonuses and a high score. The code is clear for beginners, and I explained how to launch the game in detail. If any adjustments are needed, let me know!”

**Result:**

DeepSeek proposed a game called «Treasure Maze» in the puzzle exploration genre.

**Game Description by DeepSeek:**

“You control a treasure seeker tasked with collecting all coins in the maze while avoiding traps.”

On the first attempt, the game turned out to be somewhat dull and heavily bugged: in several places, players could walk through walls, the mazes were too narrow and simplistic, and the enemies (red squares) were static. However, after several revisions and feedback, it improved significantly: some errors disappeared, the red square enemies became ghosts, and the main character—a green square—was transformed into a hat-wearing character. But new bugs emerged: ghosts could move through maze walls (which, while logical, contradicts game logic). After an attempt to implement the changes suggested by the AI, the game ceased to run. Eventually, it started again but without control over the main character’s movements, which bizarrely regressed back to a square form.

After more than three hours and numerous iterations, DeepSeek finally managed to fix most bugs and achieved acceptable game quality.

Comments were added to each block of the [code](https://github.com/Atulupn/AI-Code-Comparison/blob/main/deep.py), despite this not being a requirement in the prompt. Despite the time invested, the game worked and was indeed engaging. Each level increased in difficulty: the maze became more convoluted, and there were more ghosts.

**DeepSeek’s Rating:**

“Excellent! What was done well: clean, readable code with comments. Simple yet engaging mechanics. A visually clear interface in Russian. What could be improved: add random traps in the maze.”

**Result:**

ChatGPT decided to create a simple platformer with arcade elements.

**Game Description by GPT:**

“You control a glowing sphere that jumps across disappearing platforms. Each platform vanishes a few seconds after landing on it. The goal is to reach the finish line of the level by jumping across platforms and avoiding falls. Some platforms only appear after activating a switch.”

Initially, the game was overly simplistic—both visually and in gameplay—thus a decision was made to complicate and lengthen it (radically):

After an hour of wrestling with bugs, GPT managed to create a functional game about a dog collecting bones. The AI-generated assets for the dog and bones.

As with the previous cases, for the code, the AI chose Python utilizing the Pygame library. This time, GPT opted not to include any comments or explanations within the code.

**GPT’s Rating:**

“I think I did well: considered your requests, made the platforms accessible for jumping, and removed manual jumping. If there’s anything else, I’m always ready to help!”

We obtained three decent visual PC games in a relatively short timeframe. Only Grok fully met the objective—creating a game in one hour. GPT took about two and a half hours for this, whereas DeepSeek exceeded three hours.

**Rating on a 3-point scale:**

It all comes down to nuances and details. Any AI will code faster than you and, with enough patience, will yield the desired result. All three candidates for our jobs performed admirably, showcasing slight differences in speed and convenience. Whether these differences are significant is up to you.

An AI can save time, but it is powerless without the guidance of the person creating the prompt. You cannot delegate all the work to GPT or DeepSeek and expect perfect results on the first or second try. Poorly written and unoptimized code is not the AI’s fault; similarly, a poorly hammered nail is not the hammer’s fault. The outcome is the responsibility of the person wielding the tool, regardless of whether it’s Chinese or American.

**P.S. However, if you truly want to know, here’s the final scoreboard (10-point scale):**

*Text by Anton Tulupnikov*