I built a GitHub issue creator skill. Called it directly with mock inputs — it worked perfectly every time. Deployed it to the agent, gave it a natural language request. The model kept passing null for the repo field, which my code didn’t handle.
The bug wasn’t in the skill code. It was in the description field — the model didn’t understand it needed to provide a repo. I’d never tested how the model would interpret the tool definition.
That distinction matters: testing the function is not the same as testing the skill.
Three levels of testing
Agent skills have three distinct layers, and bugs can exist independently at each one.
Level 1 — The function
Does get_weather({ city: "Mumbai" }) return the right data?
Level 2 — The tool definition
Does the model understand when to call get_weather and what to pass?
Level 3 — The agent loop
Does the full dispatch cycle work end-to-end without spending tokens every time?
Most developers only test Level 1 and assume the rest works. That’s how you end up debugging in production.
Level 1 — Unit testing the function
This is standard — just test the JavaScript function directly. No AI involved.
// weather.js — the tool function
export async function get_weather({ city }) {
if (!city) return { error: "City is required" };
try {
const geo = await fetch(
`https://geocoding-api.open-meteo.com/v1/search?name=${encodeURIComponent(city)}&count=1`
).then(r => r.json());
if (!geo.results?.length) return { error: `City not found: ${city}` };
const { latitude, longitude, name, country } = geo.results[0];
const weather = await fetch(
`https://api.open-meteo.com/v1/forecast?latitude=${latitude}&longitude=${longitude}¤t_weather=true`
).then(r => r.json());
return {
city: `${name}, ${country}`,
temperature: `${weather.current_weather.temperature}°C`,
condition: "Clear"
};
} catch (err) {
return { error: err.message };
}
}
// weather.test.js — unit tests with Node's built-in assert
import assert from "node:assert/strict";
import { get_weather } from "./weather.js";
// Test 1: valid city
const result = await get_weather({ city: "Mumbai" });
assert.ok(!result.error, `Should not error: ${result.error}`);
assert.ok(result.temperature, "Should return temperature");
assert.ok(result.city.includes("Mumbai"), "Should return city name");
console.log("✅ Valid city:", result);
// Test 2: missing input
const missing = await get_weather({});
assert.equal(missing.error, "City is required");
console.log("✅ Missing input handled:", missing);
// Test 3: city that doesn't exist
const notFound = await get_weather({ city: "Atlantis12345" });
assert.ok(notFound.error, "Should return error for unknown city");
console.log("✅ Unknown city handled:", notFound);
Run with:
node weather.test.js
No test framework needed. Add vitest if you want watch mode and better output:
npm install -D vitest
npx vitest run weather.test.js
Level 2 — Testing the tool definition
This is the level most people skip. You’re testing whether the model understands your tool’s description and input_schema well enough to call it correctly.
The test: send a carefully controlled prompt to the real API and assert what the model decided to call and with what arguments.
// test-definition.js
import Anthropic from "@anthropic-ai/sdk";
import assert from "node:assert/strict";
const client = new Anthropic();
const weatherTool = {
name: "get_weather",
description:
"Get current weather conditions for a city. " +
"Use this when the user asks about weather, temperature, rain, " +
"or what to wear outdoors.",
input_schema: {
type: "object",
properties: {
city: { type: "string", description: "The city name, e.g. 'Mumbai'" }
},
required: ["city"]
}
};
async function assertToolCall(prompt, expectedTool, expectedInputKeys) {
const response = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 256,
tools: [weatherTool],
messages: [{ role: "user", content: prompt }]
});
const toolCall = response.content.find(b => b.type === "tool_use");
// Assert the model called the right tool
assert.ok(toolCall, `Expected a tool call for: "${prompt}"`);
assert.equal(toolCall.name, expectedTool, `Wrong tool called`);
// Assert required input keys are present and not null
for (const key of expectedInputKeys) {
assert.ok(toolCall.input[key] != null, `Missing input: ${key} for prompt: "${prompt}"`);
}
console.log(`✅ "${prompt}"`);
console.log(` → ${toolCall.name}(${JSON.stringify(toolCall.input)})`);
return toolCall;
}
// Test cases
await assertToolCall("What's the weather in Delhi?", "get_weather", ["city"]);
await assertToolCall("Will it rain in Mumbai today?", "get_weather", ["city"]);
await assertToolCall("Should I bring a jacket in Chennai?", "get_weather", ["city"]);
// Negative test: the model should NOT call weather for this
const noToolResponse = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 256,
tools: [weatherTool],
messages: [{ role: "user", content: "What is the capital of France?" }]
});
const noToolCall = noToolResponse.content.find(b => b.type === "tool_use");
assert.ok(!noToolCall, "Should NOT call weather for a geography question");
console.log("✅ No false trigger for unrelated questions");
This costs a few API calls but tells you exactly how your description reads from the model’s perspective. If the model passes null for city on a particular phrasing, you know your description needs to be more specific.
Level 3 — Mocking the AI to test the dispatch loop
Testing the full loop without spending tokens every time. Intercept the API client and return synthetic responses.
// mock-loop.test.js
import assert from "node:assert/strict";
import { get_weather } from "./weather.js";
// A synthetic tool_use response — exactly what Claude returns when it wants to call a tool
function makeSyntheticToolUse(toolName, input) {
return {
id: "msg_mock_001",
type: "message",
role: "assistant",
stop_reason: "tool_use",
content: [
{ type: "text", text: "Let me check the weather for you." },
{
type: "tool_use",
id: "toolu_mock_001",
name: toolName,
input
}
]
};
}
// A synthetic final response — what Claude returns after receiving the tool result
function makeSyntheticFinalResponse(text) {
return {
id: "msg_mock_002",
type: "message",
role: "assistant",
stop_reason: "end_turn",
content: [{ type: "text", text }]
};
}
// Minimal mock client
function makeMockClient(toolName, toolInput, finalText) {
let callCount = 0;
return {
messages: {
create: async () => {
callCount++;
if (callCount === 1) return makeSyntheticToolUse(toolName, toolInput);
return makeSyntheticFinalResponse(finalText);
}
}
};
}
// The agent loop (same logic as production)
async function runLoop(client, tools, toolFunctions, userMessage) {
const messages = [{ role: "user", content: userMessage }];
let response = await client.messages.create({ model: "claude-sonnet-4-6", max_tokens: 1024, tools, messages });
while (response.stop_reason === "tool_use") {
const toolBlock = response.content.find(b => b.type === "tool_use");
const fn = toolFunctions[toolBlock.name];
const result = fn ? await fn(toolBlock.input) : { error: "Unknown tool" };
messages.push(
{ role: "assistant", content: response.content },
{ role: "user", content: [{ type: "tool_result", tool_use_id: toolBlock.id, content: JSON.stringify(result) }] }
);
response = await client.messages.create({ model: "claude-sonnet-4-6", max_tokens: 1024, tools, messages });
}
return response.content[0].text;
}
// Test: verify the loop calls get_weather correctly and uses the result
const mockClient = makeMockClient(
"get_weather",
{ city: "Mumbai" },
"Mumbai is currently 31°C and partly cloudy."
);
const tools = [{
name: "get_weather",
input_schema: { type: "object", properties: { city: { type: "string" } }, required: ["city"] }
}];
const result = await runLoop(mockClient, tools, { get_weather }, "What's the weather?");
assert.ok(result.includes("31°C") || result.includes("Mumbai"), "Loop should use tool result in response");
console.log("✅ Agent loop works correctly:", result);
No API calls, no tokens spent. You can run this in CI on every commit.
Debugging bad tool descriptions
When the model calls your tool with wrong or missing arguments, the issue is almost always the description. Here’s how to diagnose it:
Step 1 — Log what the model actually sent
Add this immediately before executing the tool:
console.log("[Tool call]", toolBlock.name, JSON.stringify(toolBlock.input, null, 2));
Run your agent with a real prompt. See exactly what the model passed. If you see { city: null } when you expected a city name, your description didn’t tell the model where to find it.
Step 2 — Compare your description to the test prompt
If the prompt says “What will it be like outside in Ahmedabad tomorrow?” and your description only mentions “current weather” — the model might not call the tool because “tomorrow” doesn’t match “current.”
Update the description to be explicit:
"Get current or forecasted weather for a city. Use this when the user asks about
weather, temperature, rain, sunshine, what to wear, or whether to bring an umbrella
— for any time frame (today, tomorrow, this week)."
Step 3 — Run the Level 2 definition test with the exact failing prompt
Add the failing prompt as a new test case. If it fails, you have a reproducible bug you can fix iteratively.
Common debug scenarios
| Symptom | Likely cause | Fix |
|---|---|---|
| Model never calls the tool | Description too narrow or vague | Add more trigger phrases to description |
| Model calls the wrong tool | Descriptions overlap | Add “Do NOT use this for X” to each |
Model passes null for a required field | Description doesn’t explain where to get it | Specify: “Use the city the user mentioned” |
| Model passes wrong type (e.g. number instead of string) | Schema description unclear | Add "type": "string" and an example in description |
| Model calls tool in an infinite loop | Tool result is empty or ambiguous | Return more specific success/error messages |
The testSkill harness
Here’s a reusable harness you can drop into any project:
// test-harness.js
export async function testSkill(fn, testCases) {
let passed = 0;
let failed = 0;
for (const { label, input, assert: check } of testCases) {
try {
const result = await fn(input);
check(result);
console.log(` ✅ ${label}`);
passed++;
} catch (err) {
console.log(` ❌ ${label}: ${err.message}`);
failed++;
}
}
console.log(`\n${passed} passed, ${failed} failed`);
if (failed > 0) process.exit(1);
}
Use it:
import assert from "node:assert/strict";
import { testSkill } from "./test-harness.js";
import { get_weather } from "./weather.js";
await testSkill(get_weather, [
{
label: "returns temperature for valid city",
input: { city: "London" },
assert: r => assert.ok(r.temperature, "Missing temperature")
},
{
label: "handles missing city",
input: {},
assert: r => assert.ok(r.error, "Should have error")
},
{
label: "handles nonexistent city",
input: { city: "Nonexistentville99" },
assert: r => assert.ok(r.error, "Should have error")
}
]);
What’s next
Handle errors gracefully in your skills: Handling Errors in Agent Skills: Retries and Fallbacks
Give your agent persistent memory: Agent Skills with Memory: Persisting State Between Chats
Back to fundamentals: What Are Agent Skills? AI Tools Explained Simply
Related Reading.
Handling Errors in Agent Skills: Retries and Fallbacks
What happens when a tool fails? Handle errors in agent skills — timeouts, bad API responses, retries, and graceful fallbacks with real Node.js code.
Agent Skills with Google Gemini: Function Calling Guide
Complete guide to Gemini function calling — define tools, handle function_call responses, return results, and compare syntax with Claude and OpenAI. Node.js.
Vercel AI SDK Tools: One API for Claude and OpenAI Skills
Vercel AI SDK's unified tool interface works with Claude, OpenAI, and Gemini. Write your skill once and switch AI providers without rewriting the agent loop.