Prompt engineering sounds more sophisticated than it is. In practice, it is writing instructions for someone who is extremely capable but interprets everything literally, sometimes too literally, and other times not literally enough.
I have shipped prompts that worked beautifully in testing and failed in embarrassing ways in production. Here are the ones I remember most clearly, what went wrong, and what I changed.
The summarization prompt that got too creative
The prompt:
“Summarize the following customer feedback in 3 bullet points.”
What I expected: Three clean bullet points with the key themes.
What happened in production: Sometimes three bullets. Sometimes five. Sometimes a paragraph followed by bullets. Sometimes a numbered list. Sometimes bullets formatted with - and sometimes with •. Occasionally, an entire essay with a heading that said “Summary.”
The model interpreted “3 bullet points” as a suggestion rather than a constraint. In testing with my own inputs it was consistent. With a variety of customer feedback lengths and styles, it was not.
The fix:
“Summarize the following customer feedback. Return exactly 3 bullet points. Each bullet must start with
-. Do not include any other text — no headings, no intro sentence, no conclusion. If there are fewer than 3 distinct points, combine or generalize.Example output:
- Customers appreciate fast shipping but want better packaging
- Multiple complaints about the checkout flow on mobile
- Feature requests mostly around notification preferences”
Adding a concrete example and explicit formatting constraints made the output consistent. The model had a template to match instead of a vague instruction to interpret.
The classification prompt that kept saying “other”
The prompt:
“Classify this support ticket into one of these categories: billing, technical, account, or other.”
What happened: The model classified almost everything as “other.” It was extremely conservative. Even clear billing questions ended up as “other” because they could also be account-related.
The fix: I needed to define the categories.
“Classify this support ticket into exactly one of these categories:
- billing: questions about charges, invoices, subscriptions, refunds, or payment methods
- technical: bug reports, errors, app not working, feature not functioning as expected
- account: login issues, password reset, email changes, profile settings, account access
- other: anything that does not clearly fit the above
Return only the category name, lowercase. No explanation.”
Once I told the model what each category actually meant, classification got dramatically better. “Other” dropped from 40% of tickets to under 5%.
The tone prompt that made the chatbot passive-aggressive
The prompt:
“You are a helpful customer support agent. Always be professional and empathetic.”
What happened: The model was professional. But under pressure — when users were frustrated or rude — it became strangely formal in a way that read as passive-aggressive.
User: “This is completely broken and I’ve been waiting for 3 days!” Bot: “I understand your frustration with the situation you have described. I will do my best to assist you with this matter.”
That response is technically empathetic. It also sounds like a form letter from a corporation that does not care.
The fix: I added specific guidance for difficult conversations.
“When a user is frustrated or angry, acknowledge their specific situation directly. Do not use generic phrases like ‘I understand your frustration.’ Instead, reference what they actually described — the wait time, the specific feature that broke, the impact on their work. Then focus immediately on what you can do to help. Never be defensive about the product.”
Specific instructions beat vague values. “Be empathetic” is too abstract. “Acknowledge their specific situation and reference the actual details they mentioned” is something the model can do reliably.
The JSON prompt that put JSON inside a code block
The prompt:
“Extract the following fields from the email and return as JSON: name, email, subject, urgency.”
What happened: The model returned JSON. Inside a markdown code block. Every time.
{
"name": "John Smith",
"email": "[email protected]",
"subject": "Can't log in",
"urgency": "high"
}
When I tried to JSON.parse() the response, it failed because of the code block markers.
The fix: Two options. Either tell the model explicitly:
“Return only the raw JSON object. Do not include markdown code block syntax, backticks, or any other text. The response should start with
{and end with}.”
Or handle it in code — strip the code block markers before parsing. I do both now: tell the model not to wrap it, and strip it defensively in case it does anyway.
The instruction prompt that Claude followed too literally
The prompt:
“Answer the user’s question. If you do not know the answer, say you don’t know.”
This sounds reasonable. Here is what happened.
User: “What’s the capital of Australia?” Bot: “I don’t know.”
The model literally did not know if its training data was current, so it hedged. For a factual question with a well-known answer, it chose “I don’t know” because it was playing it safe.
The fix: Be precise about what “don’t know” means in your context.
“Answer the user’s question using your general knowledge. Only say you don’t know if the question requires real-time information (like current prices, live scores, today’s news) or information specific to this user’s account that you don’t have access to. For general knowledge questions, answer confidently.”
When you tell the model “say you don’t know if you’re not sure,” it interprets “not sure” very broadly. Define the threshold explicitly.
The meta-lesson
Every one of these failures had the same cause: I was writing prompts the way I would explain something to a person, and the model is not a person.
A person would hear “be empathetic” and understand it from lived experience. A model needs to know what empathy looks like in your specific context.
A person would figure out that “3 bullet points” means exactly 3 bullet points. A model interprets ambiguity in whatever direction seems most reasonable, and reasonable varies.
The prompts that work reliably have three things in common:
- Explicit constraints, not suggestions (“return exactly 3”, not “try to be brief”)
- Concrete examples of good output
- Explicit instructions for edge cases and failure modes
Write prompts like you are writing a spec, not a request. The more precisely you describe what you want, the more reliably you get it.
Related: AI Mistakes When Building Apps (And How to Fix Them) · How to Add Claude to Your App Using the Anthropic API
Related Reading.
AI Mistakes When Building Apps (And How to Fix Them)
The mistakes developers make when integrating AI into their apps — not the obvious ones, but the ones that only show up after you've shipped and users are hitting them.
Designing AI-Native Features: What to Build vs What to Prompt
A practical decision framework for knowing which app features should use AI, which should stay in code, and how to avoid the most common trap in AI product design.
Prompt Engineering Is Dead. Long Live System Prompts.
The 2023 obsession with magic prompt tricks is over. What actually works in 2026: clear system prompts, examples over descriptions, explicit constraints, and evals.