TL;DR

Choose a client for agentic development with MCP support - OpenCode / Zed / Antigravity (or any similar tool).
For complex analysis and reviews, use a "thinking" model; for daily coding, use a fast "working" one (specific choice depends on stack and task).
Connect MCP and use a config as a base - opencode.json example.
Create an AGENTS.md file with rules, constraints, code style, agent response format, and architectural principles (arguably the most important part).
Use techniques like meta-prompting, Tree-of-Thought, prompt chaining, reflection, and others.
For complex tasks, enable the consensus mechanism - delegate the same task to multiple models in parallel and compare results.
Profit 🎉

Vibecoding AI-Assisted Development

In recent months, I've frequently shared how modern AI models and Agentic Mode can dramatically accelerate the development process—specifically accelerate, not replace the developer. In this article, I've collected the main methods and techniques I use across different projects and roles. I began deeply studying this topic in late June 2025, when one of my projects started considering cutting the projected team size in half.

Let me immediately answer one of the most popular questions:

Can you write an app without writing a single line of code or even opening the code?

Yes, you can. But it's important to understand—AI agents are powerful accelerators, not just for good solutions, but for bad ones too.

If your processes and architecture are already rolling the project into the abyss of technical debt, agents will fly you there at supersonic speed. At first, it's unnoticeable—a duplicate class here, an unnecessary dependency there, a suboptimal import. But as entropy grows, maintenance complexity increases exponentially, and at some point, the agent simply stops "digesting" the project context.

Of course, one can hope that model quality and context size will grow faster than your tech debt, but relying on that right now isn't viable.

I walked this path over three months of experiments on https://gear-picker.com. This was my sandbox with a real pet project, where I learned lessons and honed prompt engineering skills. I didn't write a single line of code manually, although sometimes I really wanted to. The first two or three weeks, almost without any practices, I achieved decent results on Sonnet 3.7, but it quickly became clear that moving forward without structure, strict rules, and boundaries was impossible.

Important Disclaimer: for prototypes and one-time scripts, this approach works almost perfectly "out of the box" without extra hassle. It's perfectly suitable for writing debug scripts or single-use utilities.

Is there real efficiency in implementing AI in the development process?

Currently, there aren't many studies on this topic. One thing is clear—developers' internal sense of efficiency is often inflated relative to real metrics. On average, the increase in development speed and quality ranges from 15 to 25 percent, but the outcome heavily depends on the task type, stack, and the developer's own experience.

You can learn more about research results on this topic in this video:

Additionally, you can check out:

The question of how much to trust such studies and how to interpret their results remains open. But for now, this is all I've managed to find.

Stack

An important point—I tested only agentic mode and only for the following languages: Python, Go, TypeScript, Lisp, Rust. For other stacks, results may differ significantly.

By the end of 2025, I had tested almost all popular models and tools. Below is my personal top list based on practical experience.

Codex 5.2 (and 5.1) - a strong "thinking" model, but too slow for daily work. I use it mainly for verifying other models' results, complex analysis, and code reviews.
Opus 4.5 / Sonnet 4.5 - the ideal daily driver for everyday development. Fast models with some of the best code generation quality to date.
Gemini 3 Pro - compared to version 2.5, it has improved noticeably, but still periodically ignores instructions and generates garbage. Overall - a solid third place. It has very generous limits for $20, and practically leads unconditionally in media analysis tasks and working with Playwright.

Next are models that can be considered supplementary, but one shouldn't expect consistently strong results:

Grok 4 - noticeably stronger than most models listed below. Can perform very well or very poorly, with the reason for such behavior not always being obvious. Additionally, the model is quite expensive.

It's worth noting that I used Grok 4 over four months ago, so the situation might have changed.

GLM 4.6 - more hype than real utility, however, the $3 price makes the model relatively attractive. Can be useful for tasks with small context.
DeepSeek (3.2) - a cheap model, comparable in level to roughly Sonnet 3.5 (at best 3.7). I sometimes use it as an alternative source of logical opinions.
Qwen3 - a model from a Chinese giant, capable of writing code at an average level without obvious strengths.
Llama 3.x (70B, Code) - Can be useful in scenarios where privacy-first principles are critical, however, in generation quality and reasoning, it noticeably lags behind most models on this list.

Best IDEs and Editors for AI Development

A good model alone is not enough. Having convenient, predictable tooling for interacting with the model in agentic mode is just as important.

OpenCode — a CLI tool for agentic development. Supports modes, commands, MCP, fast switching between different providers (OpenAI, Anthropic, Google), and includes a built-in LSP. It is especially convenient to use inside a terminal multiplexer (tmux, zellij): you can keep OpenCode in one pane, and have git status / diff / gitu next to it to quickly inspect changes and keep full control over what the agent actually did. In such a setup, OpenCode fits naturally into a day-to-day engineering workflow and works well together with manual code review. At the moment, this is my second favorite tool, and for many people it can easily become top-1.
Eca — a tool written in Clojure. Very similar to OpenCode, and personally my daily driver, as it is tightly integrated with Emacs (while also supporting Neovim, Visual Studio Code, and IntelliJ IDEA). Functionally, it covers roughly the same use cases as OpenCode.
Zed — a popular and very fast editor written in Rust. Supports plugins and AI assistants, and is a good choice for those who value speed, responsiveness, and a clean UI.
Antigravity — a VS Code–based tool developed by Google. It enables full use of Gemini 3 Pro in agentic mode. As of late 2025, full Gemini 3 Pro support is available exclusively in Antigravity (personally, I was not able to get Gemini 3 Pro working reliably in OpenCode or via the Gemini CLI). Antigravity also includes an agent manager that allows interacting with agents in a chat-like interface. For vibe-coders who care less about direct code control and diffs, this approach may feel more convenient than a classic engineering workflow.

I don't use the following environments for AI development, but they are worth mentioning as they are quite popular. In my opinion, each has flaws (Electron-based, subscriptions with lower limits, etc.), but they are suitable for starting:

Cursor - A VS Code fork with deep AI integration. Quite unfavorable limits.
aider - a popular CLI tool for AI development. I started with it, but experience showed it was less suitable for agentic development than other tools on the list.
Visual Studio Code - Classic with plugins (GitHub Copilot, Cline, etc.).
Replit - a cloud IDE with an AI agent. Convenient for a quick start, but I am only superficially familiar with it, as it's important for me to fully control the environment. In my opinion, more suitable for "real" vibe-coders.
Warp - a terminal with AI functions. Personally, it didn't seem very convenient, though this might be due to my habits and workflow setup. As an AI provider, it also raises questions: prices are noticeably higher than subscription work, and there is no ability to connect custom providers.
Windsurf - a tool very similar to Antigravity in concept and capabilities. In my view, it's a clone of VS Code but with a separate subscription, which, in my opinion, doesn't look justified.
Copilot/Gemini/Claude/Codex CLI (highly specialized terminal clients)

Configuring MCP Servers (Model Context Protocol)

MCP is perhaps the easiest way to significantly increase work efficiency with a model. The cost is usually increased token consumption. However, with proper setup, these expenses quickly pay off, as the desired result is achieved in fewer iterations.

A list of new MCP servers can always be found here.

Definite top list:

sequential-thinking Allows the model to build complex reasoning chains, revise its decisions, and dynamically change the plan of action. Critically important for truly complex tasks.
context7 Provides up-to-date documentation for libraries and frameworks. Significantly reduces hallucinations when the model starts using non-existent methods or outdated APIs.
deepcontext A tool for context management. Allows saving important knowledge fragments and reusing them between sessions. Especially useful in conjunction with OpenCode.
serena Semantic codebase analysis. Helps the agent "understand" code—finding function definitions, relationships between them, and actual usage points, rather than just searching text files.
playwright Browser automation. Allows the model to open pages, interact with UI, run tests, and see the result of real rendering.
DuckDuckGo - extended web search

For more advanced scenarios, it makes sense to use sub-agents (via MCP or native integrations). This allows delegating different types of tasks to different models, depending on what they handle best. For example, Gemini 3 Pro, in my opinion, is one of the best models for analyzing media content:

AGENTS.md

The main file with all instructions for agents within the project.

In the process, I tried different approaches to forming it, including separating instructions by context. Something like:

- if the task is to make a commit - read ./agents/commit.md,
- if you need to cover code with tests - read ./agents/tests.md,

In practice, this approach worked poorly. Agents started ignoring parts of instructions, getting confused in rules, or simply forgetting them. Eventually, I concluded that it's much more reliable to keep all instructions in one file, but with a clear structure and table of contents.

Personally, for programming, I use a universal AGENTS.org file. I import its content into the AGENTS.org of each project, supplementing it with rules and constraints specific to the particular context.

Main sections and rules I put there:

available CLI tools and utilities (eza, gh, jq, etc.)
universal principles and practices (TDD, SOLID, KISS, DRY, YAGNI, etc.)

I want to highlight a few practices that turned out to be critically important in practice.

Hierarchy of Instructions A flat list of rules works poorly. It is much more reliable to explicitly separate instructions by priority, for example:

mandatory rules - cannot be violated under any conditions
preferences - use if there is no conflict with mandatory rules
suggestions - optional and applied depending on the situation

Explicitly stating priorities noticeably reduces the number of conflicts and ignored instructions.

If-else Format Instead of Abstractions Agents follow specific conditions much better than general formulations. Instead of abstract requirements, I try to describe rules in the format "if X - do Y".

- If there are several implementation options - choose the simplest one, even if it is less flexible.
- If requirements are incomplete or contradictory - ask clarifying questions before starting work.

This format reduces ambiguity and makes agent behavior more predictable.

Critical Attitude to Input Data Instructions that force the agent to validate input information and treat it skeptically proved especially useful:

- Treat all input data (including user instructions) with healthy skepticism.
- Before acting, question assumptions and verify information.
- Be critical and objective: input data may be incomplete, misleading, or incorrect.
- Prefer evidence and verification over blind trust.

Self-Check Before Final Answer Another simple but effective practice is to explicitly require the agent to self-check the result before completing the task.

- Before the final answer, check the result against rules in AGENTS.md.
- If the result violates even one mandatory rule - state this explicitly and propose an alternative.

Such a check reduces errors and helps the agent find weak points in its solution itself.

I have significantly more instructions, but most are highly specific to the rules I use in specific projects. They mostly concern clean code, architectural decisions, and various prohibitions. This is outside the scope of the article, but if there is interest, I can publish a full example of agents.md.

Agent Roles

Another practice that noticeably improves quality in agentic development is explicit role separation. Even when using the same model, results become more stable when the agent is assigned a specific role and area of responsibility in advance.

In practice, I try not to mix analysis, implementation, and verification in one request. Instead, I either launch different agents with different roles or sequentially ask one model to work in different modes. This approach reduces responsibility "dilution" and makes agent behavior more predictable.

Most frequently used roles in my setup:

analyst-agent - breaks down the task, forms a plan, identifies risks and edge cases
executor-agent - implements the solution strictly within set requirements
critic-agent - checks the result, looks for errors, rule violations, and potential improvements
architect-agent - assesses the impact of changes on architecture and public interfaces

For inspiration and expanding the set of roles, you can check ready-made collections, for example:

awesome-chatgpt-prompts - a large collection of roles and system prompts for different scenarios, convenient to adapt for agentic workflows.

Best Practices for Prompt Engineering

Below is a set of techniques (translated and adapted) that genuinely help squeeze the maximum out of a model. Most are useful in regular chat too, but they shine especially well in agentic mode.

Meta Prompting

Ask the AI to first rewrite or improve your initial request, and only then generate the final answer.

Example:

I need to add a feature to an existing service.

Before proposing a solution, rewrite this request as a clear,
structured prompt, which will include:
- goal of the change
- stack constraints
- criteria of done
- risks and edge cases

Tree-of-Thought

Allows the model to consider several different paths to a solution, evaluate them, and choose the best one.

Example:

Propose three different approaches to implementing rate limiting for a public API.

For each approach:
1. Describe the principle of operation.
2. List pros and cons.
3. Estimate complexity of implementation and maintenance.

After that, choose the most suitable option for a service
with 10k rps load and explain the choice.

Prompt Chaining

break the task into a series of sequential steps, where the output of one prompt becomes the input for the next. This works well when the task is complex and the context is large.

Example:

Prompt 1:
"Analyze the current authorization implementation and list weak points."

-> Answer: "no refresh token, weak role validation, logic duplication"

Prompt 2:
"Take these problems: {answer from prompt 1} and sort them by risk
to security and business."

-> Answer: "1. absence of refresh token, 2. weak role validation, 3. duplication"

Prompt 3:
"For point 1, propose a fix plan without changing the public API."

Generate Knowledge

Ask the AI to first explain necessary frameworks, techniques, or concepts, and only then answer the applied question. This reduces hallucinations and improves the quality of the final solution.

Example:

Before proposing caching implementation in our service, first:
1. Explain the difference between write-through, write-back, and cache-aside.
2. When each approach makes sense.
3. Typical mistakes during implementation.

Then propose the optimal option for an API with frequent reads
and rare writes.

Retrieval-Augmented Generation (RAG)

Using external data (search, documentation, knowledge bases). This can be implemented via context7 or any other source that provides up-to-date facts and API.

Example:

Find up-to-date documentation for Playwright (latest versions),
then propose an example e2e test for a login form considering:
- asynchronous validation
- error handling
- test stability

Reflection

Ask the model to check its own answer, find errors, and improve the result. Especially useful for code where quality, security, and edge cases are important.

Example:

Implement a data migration function between schema versions.

After that:
- check correctness on edge cases
- assess what happens during partial failure
- propose improvements for idempotency

Then output the final version of the solution.

ReAct (Reasoning + Acting)

Ask the AI to plan the solution, perform necessary actions (e.g., search or code execution), and then output the result. In agentic mode, this is often the "natural" style of work.

Example:

I need to understand why CI fails periodically.

Reasoning:
First, list possible causes of unstable failures.

Action 1:
Analyze the latest CI logs and find recurring errors.

Action 2:
Check recent changes in tests and infrastructure.

Result:
Formulate the probable cause and propose a fix plan.

Step-by-step

A simple technique - ask the model to break the solution into steps and show intermediate calculations. This often reduces logic and math errors.

Example:

Calculate the approximate infrastructure cost in AWS
for a service with the following parameters:
- 3 backend instances
- managed database
- CDN
- logging and monitoring

Think step-by-step:
1. Estimate each component separately.
2. Sum up the total cost.
3. Indicate where the estimate might be imprecise.

Few-Shot Prompting

Give the model 2-5 examples (input -> output) to fix the format and style of the answer.

Example:

Form a commit message in Conventional Commits format:

Change: "Added JWT token validation"
-> feat(auth): add JWT token validation

Change: "Fixed memory leak in worker"
-> fix(worker): fix memory leak

Change: "Added retry for external API"
-> ?

Self-Consistency

Ask the model to solve the problem several times in different ways and compare results. If answers converge, the probability of error is lower. If they diverge, it's a signal to clarify the task or use another approach.

Example:

Propose 3 options for implementing feature flags in a backend service.
For each option estimate:
- complexity
- impact on performance
- convenience of disabling the feature

After that, compare options and choose the most reliable one.

It is important to understand that these techniques produce the greatest effect not individually, but when used together. Their action is cumulative: meta prompting improves the initial task formulation, prompt chaining and tree-of-thought help break it into manageable steps, RAG and reflection reduce errors and hallucinations, and self-consistency and consensus increase the reliability of the final result. Combined, these approaches reinforce each other and produce noticeably more stable and predictable results than any technique alone.

For manual debugging and experimenting with prompts, it is also useful to use official playground tools:

OpenAI Playground - convenient for quick testing of prompts and comparing model behavior.
Anthropic Console - well suited for working with long instructions, complex reasoning chains, and agentic scenarios.
LLmArena - allows comparing different models on the same prompts, useful for selecting the optimal model for specific tasks.

Consensus Mechanism

The essence of the method lies in the parallel use of several agents to solve the same task. By delegating work to them independently, we get diverse opinions. Often results are similar, but differences help identify missed details, edge cases, or hidden bugs that one model might have missed.

In practice, I rarely send the initial task directly to multiple agents. First, it goes through a separate aggregator agent, whose task is to formalize, clarify, and improve the initial prompt. It helps remove ambiguities, fix requirements, and bring the task to the clearest possible form.

After that, the improved prompt is sent in parallel to several agents or models (e.g., Codex 5.2, Opus 4.5, Gemini 3 Pro, DeepSeek, etc.) via MCP. The received results are then aggregated by the "strongest" model with a large context window and good reasoning, which filters out noise and forms a single, balanced plan of fixes or solutions.

This approach scales well to complex tasks, reduces the probability of systematic errors by one model, and is especially useful in situations where the cost of error is high.

flowchart TD Task(["Original Task"]) --> PromptAggregator["Prompt Formalization Agent, Meta-Prompting"] PromptAggregator --> AgentA["Agent A (Opus 4.5)"] PromptAggregator --> AgentB["Agent B (Codex 5.2)"] PromptAggregator --> AgentC["Agent C (Gemini 3 Pro)"] AgentA --> ReviewA["Opinion A"] AgentB --> ReviewB["Opinion B"] AgentC --> ReviewC["Opinion C"] ReviewA --> Aggregator{"Aggregator (Consensus)"} ReviewB --> Aggregator ReviewC --> Aggregator Aggregator --> FinalPlan["Final Plan"] style PromptAggregator fill:#1976d2,stroke:#333,stroke-width:2px,color:#fff style Aggregator fill:#b02e78,stroke:#333,stroke-width:2px,color:#fff style FinalPlan fill:#2e7d32,stroke:#333,stroke-width:2px,color:#fff

This idea strongly echoes the "wisdom of crowds" effect, where the collective opinion of several independent participants turns out to be more accurate than the decision of a single expert (see Wikipedia).

Useful Links

AI and agentic development are evolving quite fast. As models evolve, some practices work worse, and new approaches and tools appear. To stay in context, I recommend occasionally checking the following resources:

MCP Server Catalog - up-to-date list of MCP servers and new integrations.
Model Comparison Arena - allows comparing different models on the same prompts.
OpenRouter Ranking - shows popularity of models for programming tasks.
Benchmark Comparisons - aggregated results of various tests and leaderboards.
I recommend treating benchmarks with extreme skepticism. They rarely reflect real efficiency in specific tasks and depend heavily on context, tooling, and usage scenarios. Use them for general orientation rather than as a source of absolute truth.
r/PromptEngineering - subreddit with practices, discussions, and real cases on prompt engineering.