Testing and Debugging Agent Skills Before You Deploy

Testing agent skills isn’t just about checking if your JavaScript function works. You can have a perfect function that fails every time because your model’s description is trash. It’s a two-part problem. First, does the code run? Second, does the model actually know when and how to call it? I’ve wasted hours debugging “broken” tools only to realize I forgot to tell the model that a parameter was required. If you aren’t testing the tool definition itself, you’re just guessing. This guide covers how to unit test functions, mock AI calls, and verify tool schemas in Node.js.

What are the three levels of agent testing?

Skills have three layers. Bugs can hide in any of them. If you only test one, you’re going to have a bad time in production.

Level 1 — The function: Does the code actually work when you pass it raw data?
Level 2 — The tool definition: Does the AI understand the description and schema?
Level 3 — The agent loop: Does the whole cycle work without burning through your API budget?

Most devs stop at Level 1. They assume the AI is magic. It isn’t.

The Scenario: You built a tool to “cancel subscription.” You tested the function with a dummy ID and it worked. But when you give it to the agent, the model keeps passing the user’s name instead of the ID because your description was vague. The function crashes, and the user gets a “something went wrong” message.

How do I unit test the tool function?

This is standard stuff. Forget the AI for a second. Just treat your skill like a normal function and throw some assertions at it.

// weather.js — the tool function
export async function get_weather({ city }) {
  if (!city) return { error: "City is required" };

  try {
    const geo = await fetch(
      `https://geocoding-api.open-meteo.com/v1/search?name=${encodeURIComponent(city)}&count=1`
    ).then(r => r.json());

    if (!geo.results?.length) return { error: `City not found: ${city}` };

    const { latitude, longitude, name, country } = geo.results[0];
    const weather = await fetch(
      `https://api.open-meteo.com/v1/forecast?latitude=${latitude}&longitude=${longitude}&current_weather=true`
    ).then(r => r.json());

    return {
      city: `${name}, ${country}`,
      temperature: `${weather.current_weather.temperature}°C`,
      condition: "Clear"
    };
  } catch (err) {
    return { error: err.message };
  }
}

Testing this is easy. Use Node’s built-in assert and just run it.

// weather.test.js
import assert from "node:assert/strict";
import { get_weather } from "./weather.js";

// Test 1: valid city
const result = await get_weather({ city: "Mumbai" });
assert.ok(!result.error, `Should not error: ${result.error}`);
assert.ok(result.temperature, "Should return temperature");

// Test 2: missing input
const missing = await get_weather({});
assert.equal(missing.error, "City is required");

The Scenario: You’re running late for a demo. You make a quick change to the weather API URL. Without a unit test, you don’t realize you typed api-meteo instead of api.meteo. The agent just looks like it’s “thinking” forever while your code throws a 404 in the background.

How do I verify the model understands the tool definition?

This is the part everyone skips. You need to know if the model actually connects the user’s words to your tool’s schema. I usually send a controlled prompt to the API and check the tool_use block.

import Anthropic from "@anthropic-ai/sdk";
import assert from "node:assert/strict";

const client = new Anthropic();

async function assertToolCall(prompt, expectedTool, expectedInputKeys) {
  const response = await client.messages.create({
    model: "claude-3-5-sonnet-latest",
    max_tokens: 256,
    tools: [weatherTool],
    messages: [{ role: "user", content: prompt }]
  });

  const toolCall = response.content.find(b => b.type === "tool_use");
  assert.ok(toolCall, `Expected a tool call for: "${prompt}"`);
  assert.equal(toolCall.name, expectedTool);
}

This costs money, but it’s cheaper than shipping a broken agent. It tells you if your “description” field is doing its job.

The Scenario: You ask your agent, “Do I need a jacket?” If it doesn’t call the get_weather tool, your description is too narrow. It shouldn’t just say “gets weather”; it should say “use this to decide what to wear outside.”

Can I test the agent loop without spending tokens?

Yes. Don’t call the real API every time you run a test. Mock the response. You know what a tool_use block looks like—just hardcode it.

// mock-loop.test.js
function makeSyntheticToolUse(toolName, input) {
  return {
    id: "msg_mock_001",
    type: "message",
    role: "assistant",
    stop_reason: "tool_use",
    content: [
      {
        type: "tool_use",
        id: "toolu_mock_001",
        name: toolName,
        input
      }
    ]
  };
}

Inject this into your loop. It lets you verify that your dispatch logic correctly handles the tool’s output and passes it back to the AI. It’s fast, free, and works in CI.

The Scenario: You’re on a train with spotty Wi-Fi. You want to make sure your new “summarize” skill doesn’t break the agent’s memory. With a mock, you can run the whole loop a hundred times without needing a connection or a credit card.

How do I fix a tool that the model keeps ignoring?

If the model won’t call your tool, your description is the problem. It’s usually too vague or uses jargon the model doesn’t care about.

Log everything: Print exactly what the model sends in the tool_use block.
Add examples: Put specific phrases in the description. “Use this for questions like ‘Is it raining?’ or ‘What’s the temp?’”
Check the schema: If a field is optional but the model thinks it’s required, it might get confused and skip the call entirely.

The Scenario: You have a tool called fetch_data. The model never uses it because “data” means nothing. You rename it to get_user_purchase_history and suddenly it works perfectly. Naming matters more than the code sometimes.

Expert Tip: The “Negative” Test

Always test what the model shouldn’t do. Ask it “What’s 2+2?” and make sure it doesn’t try to call your get_weather tool. If it does, your description is “greedy” and will waste tokens on unrelated questions.