12 Red Flags When Hiring an AI Development Agency (2026)

24 min read
AI Automation Client
Muneeb Ashraf
CEO
AI Automation Client
Zahra A.
Technical Writer

AI projects don’t usually fail because the technology is hard. They fail because the wrong agency was hired in the first place.

Anyone with a landing page can call themselves an “AI development agency” right now. The market is flooded on Clutch, Upwork, and DesignRush list thousands of vendors, the vast majority of which have never shipped a production system that survived more than three months. The bad ones cost you $50,000 to $300,000 and 6–12 months. The good ones turn AI into measurable revenue.

This guide is the AI vendor checklist we wish every buyer had before their first sales call. It covers the top red flags to avoid when implementing AI in your organization with 12 specific warning signs that predict project failure with high accuracy, drawn from patterns observed across the AI services market in 2025–2026.

Whether you’re learning how to choose an AI development agency for the first time or vetting your tenth proposal, each red flag includes the disaster scenario it leads to, what good agencies do instead, and a test question to expose it on your first sales call. By the end, you’ll be able to vet any AI vendor in under 30 minutes.

THE AI AGENCY RED-FLAG MAP

12 Red Flags When Hiring an AI Development Agency

Four categories. Each one predicts a different kind of project failure. Three or more flags = walk away.

CATEGORY 1 · PROCESS & PRICING

Sets you up to be overcharged or stalled
1“Contact Us for Pricing”
26-Month “AI Strategy” First
3Unrealistic Timeline Promises

CATEGORY 2 · TECHNICAL CAPABILITY

They’ve never shipped what they’re selling
4Zero Production Deployments
5One-LLM-Only Approach
6No Code, No Engineers

CATEGORY 3 · TRUST & GOVERNANCE

Compliance and legal landmines waiting
7Vague Deliverables, No Criteria
8No Privacy / Compliance Plan
9Black-Box, No Explainability

CATEGORY 4 · COMMUNICATION & SUPPORT

You’ll be on your own when it matters
10“Trust Us, We’re Experts”
11“We Do Everything”
12No Post-Launch Support / SLA
3+ flags in a single vendor = ~90% project failure probability · Walk before you sign.
Sources: FTC Operation AI Comply · EU AI Act Articles 13 & 14 · industry pattern analysis 2024-2026amplence
Quick Answer

The biggest red flags when hiring an AI development agency are no transparent pricing, multi-month "AI strategy" engagements before any code is written, zero production deployments, only using one LLM, vague deliverables with no acceptance criteria, no data privacy plan, and no post-launch support. Most checklists cover 10 warning signs to watch for; this guide expands to 12 by adding two compliance-focused flags – data privacy and explainability – that matter most for regulated industries. In our experience, three or more of these flags in a single vendor is a near-certain sign the project will overrun or fail.

Key Takeaways

12 red flags account for most AI project failures. Organized into four categories: process and pricing (RF #1–3), technical capability (RF #4–6), trust and governance (RF #7–9), and communication and support (RF #10–12).

Red flags multiply, they don’t add. One flag is data; three flags are a pattern; five or more is a near-certain failure. Watch especially for toxic combinations: the consultant trap, the build-and-ghost, the outsourcing shell game.

Test for production proof first. Demos lie; live URLs don’t. A real AI development agency can show you three live systems handling real users with real metrics, uptime, daily volume, cost per call.

Compliance and explainability are the two most underrated criteria. The EU AI Act (Articles 13 and 14, with high-risk obligations now phasing in through December 2027), GDPR, and HIPAA all require auditable AI. Vendors who can’t explain their data flow are a regulatory liability waiting to surface.

30 minutes of vetting saves $50K–300K of pain. Use the four test questions, count your flags, and pick the agency that already ships that does not the one that promises to.

The 12 Red Flags at a Glance

Here’s the complete AI vendor evaluation matrix. Bookmark this section, print it, or pull it up during your next sales call.

# Red Flag What It Costs You Test Question
1 No transparent pricing Priced based on your funding, not the work What's a typical cost for an AI agent like this?
2 6-month "AI strategy" before code $100K–300K in PowerPoints, zero working AI When will I see working code I can test?
3 Unrealistic timeline promises Hidden corners cut; production hardening costs 2× more later What's a realistic timeline for this scope?
4 Zero production deployments They learn production AI on your money Can you show me a live AI agent handling real users?
5 One-LLM-only approach 30–50% worse cost and performance Which LLM, and what are the alternatives?
6 No code samples, no engineers in calls No-code resold at custom-dev rates Can an engineer join our next call?
7 Vague deliverables, no acceptance criteria Legal disputes when delivery doesn't match expectations What exactly will I get, with what acceptance criteria?
8 No data privacy or compliance plan GDPR / EU AI Act exposure, breach disclosure costs Walk me through where my data lives at every step.
9 Black-box outputs, no explainability EU AI Act Articles 13 & 14 violation in regulated use How will I know why the AI made any specific decision?
10 "Trust us, we're experts" attitude They build the wrong thing, you pay twice How often will I see progress and give input?
11 "We do everything" full-service claim Junior subcontractors at agency rates (50% margin) What do you NOT do? Who do you refer out?
12 No post-launch support or SLA 3–5× emergency rates when API deprecates What happens if something breaks in month 2?
How To Use This Matrix

Circle each red flag as it appears during your evaluation calls, then total at the end.

0–1 Safe to hire
2–3 Proceed with caution
4–6 High risk
7+ Walk away

Detailed scoring rules and toxic-combination patterns are in the section below.

Why Vetting an AI Agency Is Different From Vetting a Regular Dev Shop

AI development carries risks that regular software doesn’t. Models hallucinate. APIs change without warning. A demo that works in a controlled environment can crash under 50 concurrent users in production. Costs that look like $800 a month in a sales pitch can balloon to $8,000 once real traffic hits. And unlike a broken website, a hallucinating AI agent can quietly cost you customers and credibility before anyone notices.

On top of that, AI sits in a regulatory grey zone. The EU AI Act, GDPR, HIPAA, SOC 2, and emerging state-level US legislation all touch any agency handling your data. A vendor who doesn’t have a clear answer for “where will our customer data live during model training?” is a liability the moment a regulator or auditor shows up.

That’s why the bar to hire an AI development agency has to be higher than for ordinary development work. The 12 red flags below are organized into four categories: process and pricing, technical capability, trust and governance, and communication and support.

Category 1: Process & Pricing

Red Flag #1: “Contact Us for Pricing”

Should I worry if an AI agency hides their pricing?

WHAT IT LOOKS LIKE: Website has no pricing information at all. Every pitch starts with “every project is unique.” Won’t even give a ballpark range until after a sales call. Pricing magically scales with your funding round.

Why it’s a problem. If an agency can’t articulate what their work typically costs, it almost always means one of two things: they don’t know their own costs (which is amateur), or they price based on what they think you’ll pay (which is predatory). Mature agencies with repeatable processes can quote ballparks confidently because they’ve done the work many times before. The ones who can’t have either never built before or are charging wildly different amounts to different clients for identical scopes.

DISASTER SCENARIO, Composite example. A SaaS founder asks for an AI chatbot. The agency quotes $180,000 after learning the company has just closed a Series A. The same agency had quoted a different founder $35,000 for an identical scope two weeks earlier. The only thing that changed was the perceived budget.

WHAT GOOD AGENCIES DO: Show pricing ranges on the website (e.g., $5K pilots, $25–50K production builds). Give a ballpark in the first conversation. Explain cost drivers clearly: complexity, integrations, compliance. Use a transparent pricing model: fixed-bid, time-and-materials, or hybrid.

How To Test

Ask:

“What’s a typical cost for an AI agent like the one we’re describing?”

Red Flag Answer

Every project is unique, we’d need a discovery call to determine that. Want to schedule one?

Green Flag Answer

$8K–15K, depending on integrations, typically 2–3 weeks. Compliance and SSO add $4–6K.

Red Flag #2: “Let’s Do a 6-Month AI Strategy First”

How long should an AI MVP take to build before I see working code?

WHAT IT LOOKS LIKE: First proposed phase is an “AI readiness assessment” or “AI maturity audit.” Deliverable for the first 3–6 months is slides and frameworks, not code. They want to map your “AI transformation roadmap” before doing anything. The pitch leans on jargon like “AI maturity model” and “value-discovery workshops.”

Why it’s a problem. Strategy without execution is expensive PDFs. By the time a 6-month strategy phase finishes, the model your roadmap recommends has been deprecated, your competitors have shipped, and the original consultants are nowhere to be found. Real AI partners build strategy through shipping that they get a working pilot in front of you in weeks, then iterate on what real users actually do, not what slide 47 of a deck says they should. A healthy AI MVP timeline is 4–8 weeks to a working pilot, not 6 months to a slide deck.

DISASTER SCENARIO, Composite example. A healthcare company spends $250,000 on an 8-month “AI strategy” engagement. The deliverable is a 60-page slide deck with recommendations. They have to hire a different agency to build any of it, and by then the recommendations are outdated. No working code, no production system that is just a beautifully designed PDF that nobody opens twice.

WHAT GOOD AGENCIES DO: Start with a pilot or MVP within 4–8 weeks maximum. Strategy emerges during building, not before it. Show working code within the first one to two weeks. Iterate based on real usage, not theoretical frameworks. The default mindset is “learn by shipping.”

How To Test

Ask:

“When will I see working code that I can actually test?”

Red Flag Answer

After we complete the strategy phase, around month 3–6. We’ll have a clear roadmap by then.

Green Flag Answer

Week 1 or 2. We’ll have a basic version on staging that you can poke at.

Red Flag #3: Unrealistic Timeline Promises

WHAT IT LOOKS LIKE: “We can build that in 3 days, guaranteed.” “Production-ready AI agent in one week.” Promises that beat the market by 3–5× with no caveats. Their proposal is a single line item with a single date with no phases, no milestones.

Why it’s a problem. Fast timelines on AI work always come with hidden costs. Either they’re cutting corners on security, error handling, monitoring, and edge cases, or they’re going to miss the deadline and blame “unforeseen complexity.” Either way, you pay. Real engineers under-promise and over-deliver. They quote ranges with explanations, not absolute deadlines pulled from thin air.

DISASTER SCENARIO, Composite example. A real estate company is promised a voice agent in 3 days. They get something on day three but it has no CRM integration, can’t handle multiple callers, has no error handling, and takes 8–12 seconds to respond. Making it production-ready costs another $15,000 and three weeks. The original agency stops returning calls once the first invoice clears.

WHAT GOOD AGENCIES DO: Give realistic ranges (simple AI agents: 6–10 days; complex production systems: 3–6 weeks). Break the timeline into phases with what’s included in each. Explain why it takes that long, never hand-wave on complexity. Build in a buffer for the things that always break: API rate limits, edge cases, integration timeouts.

How To Test

Ask:

“What’s your typical timeline for an AI agent that handles inbound customer messages?”

Red Flag Answer

We can do that in 2–3 days if we move fast.

Green Flag Answer

10–14 days for a working MVP, including integrations, testing, and basic monitoring. Production hardening adds another 2 weeks.

Learn more about How to Automate Your Business Without a Technical Team.

Category 2: Technical Capability

Red Flag #4: Zero Production Deployments

WHAT IT LOOKS LIKE: Portfolio is full of “prototypes,” “demos,” or “POCs.” “Everything we’ve built is under NDA” for every single project, conveniently. No live URLs, no public case studies with real metrics. When you ask for proof, you get a demo video instead of a live system.

Why it’s a problem. Production is where AI gets hard. Demos work because they ignore everything that breaks at scale: rate limits, race conditions, hallucinations under unusual inputs, cost overruns at peak load, model deprecations. An agency that has only built demos has no battle scars. They will learn how production AI actually works on your project, on your timeline, with your money.

DISASTER SCENARIO, Composite example. An e-commerce company hires an agency with an impressive AI demo. In a controlled environment, it works perfectly. In production, it crashes under load, gives wrong product information 30% of the time, and burns through $8,000 a month in API fees instead of the projected $800. The agency has never dealt with real production traffic, and the founder ends up hiring a second team to clean up the mess.

WHAT GOOD AGENCIES DO: Show three or more live production systems handling real users right now. Share metrics: uptime, daily call volume, cost per interaction, response latency. Discuss specific production problems solved (hallucination rates, latency optimization, fallback design). Hand over live URLs you can test that day.

How To Test

Ask:

“Can you show me a live AI agent handling real users right now?”

Red Flag Answer

Everything we’ve built is under NDA, but here’s a demo video.

Green Flag Answer

Here are three live URLs. Test them now. Daily traffic, uptime, and cost per call: [shares actual data].

Red Flag #5: One-LLM-Only Approach

Why is one-LLM lock-in a problem in 2026?

WHAT IT LOOKS LIKE: “We’re an OpenAI partner” (or Anthropic or Google), and that’s their whole pitch. Every recommendation defaults to the same model regardless of use case. They can’t articulate why one model is better than another for your specific problem. No discussion of fallback or multi-model routing.

Why it’s a problem. No single model is best at everything. Claude tends to be stronger at reasoning and instruction-following. GPT-4 is often better for creative writing. Gemini is fast and cheap for simple lookups. Llama and Mistral run on your own hardware for privacy-sensitive workloads. An agency stuck on one provider is either bound by a partnership deal that prioritizes their margin over your outcome, or they don’t actually understand the AI landscape well enough to compare alternatives. Either way, you pay 30–50% more for worse performance.

DISASTER SCENARIO, Composite example. A financial services company needs fast, accurate data lookups inside their support workflow. The agency uses GPT-4 for everything (their partnership model). Responses take 2–3 seconds and occasionally include hallucinated numbers. A different agency rebuilds the system using Gemini for lookups (0.4-second responses, accurate), Claude for the harder analytical questions, and GPT-4 only for human-friendly explanations. Performance improves 5×, costs drop 40%.

WHAT GOOD AGENCIES DO: Multi-model approach: pick the right model for each step of the pipeline. Recommend models based on your latency, accuracy, and cost requirements, not commercial deals. Explain tradeoffs clearly (cost per token, response time, context window, hallucination rates). Build architecture that lets you swap models when better ones launch (which they do, every quarter).

How To Test

Ask:

“Which LLM would you use for our use case, and what are the alternatives?”

Red Flag Answer

GPT-4, since we’re an OpenAI partner.

Green Flag Answer

Claude for the reasoning steps, Gemini for fast lookups, GPT-4 only where creative output matters. Here’s why each choice…

Red Flag #6: No Code Samples, No Engineers in the Conversation

Is it bad if an AI agency won’t share code samples?

WHAT IT LOOKS LIKE: The sales team does all the talking with no engineers ever join the calls. “We can’t share code due to IP concerns” (even anonymized samples). No GitHub presence, no architecture diagrams, no technical blog posts. Every technical question gets redirected to “we’ll cover that in the kickoff.”

Why it’s a problem. Real engineers love talking about how they build things. Hiding the technical layer behind a wall of salespeople usually means one of three things: the engineering work is outsourced to juniors who can’t speak to it, the “AI development” is actually no-code tools (Zapier, Make, basic LangChain wrappers) charged at engineering rates, or the code itself is so bad they’re embarrassed to show it. Any of those is a problem.

DISASTER SCENARIO, Composite example. A company hires an agency that refuses to share code throughout the build. At handoff, the deliverable is spaghetti code with hardcoded API keys, no tests, and zero documentation. The cost to refactor what should have been built correctly the first time is $45,000, three times the original budget. The original agency vanishes after the final invoice.

WHAT GOOD AGENCIES DO: Engineers are part of the sales process and answer technical questions directly. They can show anonymized code samples or open-source contributions. Architecture diagrams, testing approach, and code review process are all explained openly. You get repository access during the project, not just at handoff.

How To Test

Ask:

“Can an engineer join our next call to walk me through how you’d architect this?”

Red Flag Answer

Our project manager has all the technical details. Engineers are heads-down on builds.

Green Flag Answer

Yes, Sarah will join. She’s the lead on this kind of system and can show code from a similar build.

Category 3: Trust & Governance

Red Flag #7: Vague Deliverables, No Acceptance Criteria

WHAT IT LOOKS LIKE: Statement of work says things like “a working AI chatbot” with no specifics. No mention of accuracy targets, response time, concurrent user limits, or uptime. “We’ll figure it out as we go.” Scope keeps expanding informally over Slack, but the contract never updates.

Why it’s a problem. Without measurable acceptance criteria, every disagreement at delivery becomes a fight. You expected an AI agent that handles 80% of inquiries correctly with sub-2-second responses. They built one that handles 40% in 8 seconds. Both call it “a working AI chatbot.” Without numbers in writing, the legal default usually favors them, and you pay for the gap.

DISASTER SCENARIO, Composite example. A contract says “AI chatbot for customer support.” The agency delivers something that answers 40% of questions correctly, takes 8 seconds to respond, and crashes when more than 10 people use it at once. They claim it meets the contract, “it’s a chatbot, isn’t it?” The founder expected 80% accuracy, sub-2-second responses, and 100+ concurrent users. The dispute goes to legal.

WHAT GOOD AGENCIES DO: Detailed scope with measurable targets: accuracy %, latency ceilings, concurrency limits, uptime SLAs. Phased deliverables with checkpoints, not a single “final delivery.” Clear, written definition of “done” for each milestone. Change requests follow a documented process, not a Slack DM.

How To Test

Ask:

“What exactly will I get, and what are the acceptance criteria?”

Red Flag Answer

A working AI system, fully tested and ready to use.

Green Flag Answer

An AI agent answering 80%+ of tickets correctly, <2s response time, 99.5% uptime SLA, integrated with your CRM. Acceptance test: 50 sample tickets we’ll define together.

Red Flag #8: No Data Privacy or Compliance Plan

WHAT IT LOOKS LIKE: No clear answer for where your customer data lives during training and inference. “We just send everything to OpenAI” with no thought about PII redaction. Doesn’t know what GDPR Article 22, HIPAA, or SOC 2 mean for your project. No mention of data retention, deletion rights, or DPA (data processing agreement).

Why it’s a problem. AI handling regulated data without a compliance plan is a lawsuit waiting to happen. If your AI agent reads customer emails, support tickets, medical records, or financial data, that data is now subject to GDPR, HIPAA, CCPA, or sector-specific regulation depending on your jurisdiction. Sending it to a third-party API without a DPA, without redaction, and without retention controls can trigger regulatory fines, breach disclosure, and customer trust damage that costs more than the entire AI project.

DISASTER SCENARIO, Composite example. A B2B SaaS company lets an AI agency send raw customer support tickets to OpenAI for fine-tuning, including PII. There’s no DPA and no redaction layer. A customer notices sensitive data appearing in unrelated AI responses and files a GDPR complaint. The investigation costs $180,000 in legal fees and triggers mandatory breach disclosures to 12,000 customers.

WHAT GOOD AGENCIES DO: Clear data flow diagram: where data goes, how long it stays, who can see it. PII redaction or pseudonymization before sending to third-party LLMs. Signed DPAs with all model providers and willingness to use enterprise (zero-retention) tiers. Self-hosting options for highly sensitive workloads (Llama, Mistral, on-prem deployment). Documented compliance posture for whatever regs apply (GDPR, HIPAA, SOC 2, ISO 27001).

How To Test

Ask:

“Walk me through where my customer data lives at every step of this AI pipeline.”

Red Flag Answer

It just goes to OpenAI’s API. They handle it.

Green Flag Answer

Here’s the data flow diagram. PII is redacted at ingestion, stored in your VPC, and we use OpenAI’s zero-retention enterprise tier for inference. We have signed DPAs and SOC 2 docs ready.

Red Flag #9: Black-Box Outputs With No Explainability

What is AI explainability and why does it matter for compliance?

WHAT IT LOOKS LIKE: Can’t explain why the AI gave a particular answer or made a particular decision. No logging of inputs, outputs, or model reasoning. No human review queue for low-confidence outputs. When the AI hallucinates, the only fix offered is “try a different prompt.”

Why it’s a problem. AI systems that can’t explain themselves can’t be trusted in any setting that matters. Customer-facing AI needs to be auditable. Internal AI making business decisions needs an audit trail. The EU AI Act (Articles 13 and 14) legally requires transparency and human oversight for high-risk AI systems and those high-risk obligations are now phasing in through December 2027 under the EU’s May 2026 Digital Omnibus, so this is a near-term planning concern, not a hypothetical. An agency that ships systems without confidence scoring, logging, and human-in-the-loop fallbacks is shipping a liability you’ll have to fix later.

DISASTER SCENARIO, Composite example. A logistics company deploys an AI agent that auto-approves supplier invoices below $10K. After three months, finance notices $43,000 in fraudulent approvals. There are no logs of why each invoice was approved, no confidence threshold, no human review queue. They have no way to even identify which approvals to investigate without re-reading every email manually.

WHAT GOOD AGENCIES DO: Every AI decision is logged with input, output, model version, and confidence score. Confidence thresholds route low-certainty outputs to human review. Hallucinations get caught by validation steps (cross-checking, ground truth comparison). Documentation explains how to interpret the system’s reasoning in plain English.

How To Test

Ask:

“How will I know why the AI made any specific decision after the fact?”

Red Flag Answer

It’s an LLM, they’re kind of black boxes by nature, you know?

Green Flag Answer

Every interaction is logged with input, output, confidence score, and model version. Below 80% confidence routes to a human queue. Here’s the dashboard.

Category 4: Communication & Support

Red Flag #10: “Trust Us, We’re Experts”

WHAT IT LOOKS LIKE: Pushback when you ask to see in-progress work. “You wouldn’t understand the technical details.” Dismissive of your input on product or domain decisions. Wants 6-week chunks of silence between updates.

Why it’s a problem. Good builders welcome questions and weekly demos. “Trust us” usually means hiding incompetence, hiding scope creep, or hiding the fact that they’ve barely started. You know your business better than they ever will. An agency that doesn’t want your input ends up building what they think the right product is, not what your customers need. By the time you see it, it’s too late to change direction without paying twice.

DISASTER SCENARIO, Composite example. A founder keeps asking to see the in-progress work. The agency says “trust our process, you’ll see it when it’s done.” After six weeks and $40,000, they unveil a completely wrong product. They built what they thought the founder wanted, ignoring every piece of input given during the kickoff. The team has to scrap and start over with a different agency.

WHAT GOOD AGENCIES DO: Weekly progress demos with working software, not slide updates. Welcomes your questions and treats your domain knowledge as an asset. Explains technical decisions in plain English, not jargon. Collaborative tone, you are a partner, not a wallet. Pushes back honestly on bad product decisions but explains why.

How To Test

Ask:

“How often will I see progress, and can I give input during development?”

Red Flag Answer

We’ll show you the result when it’s ready. Just trust the process.

Green Flag Answer

Weekly demos every Friday. Your input is critical that you know your customers, we know AI. We’re building this together.

Red Flag #11: “We Do Everything” Full-Service Claim

WHAT IT LOOKS LIKE: Claims expertise in AI, blockchain, AR/VR, IoT, mobile apps, design, marketing, SEO, and HR consulting. Pitches themselves as a one-stop shop for digital transformation. No clear specialization, “yes” to every technology you mention. The team page shows mostly project managers, not specialists.

Why it’s a problem. No agency is world-class at everything. Generalist shops survive by outsourcing the work to cheaper subcontractors and taking a 40–60% margin as the middleman. The AI portion of your project ends up with a junior contractor in another time zone who has never shipped production AI. You pay agency rates for freelancer-quality work, with an extra layer of broken-telephone communication on top.

DISASTER SCENARIO, Composite example. A SaaS company hires a “full-service” agency for AI plus design plus marketing plus DevOps. The AI work is done by a junior outsourced developer, the design uses stock templates, the marketing is generic. They eventually hire specialists for each function and learn the original agency was just a project manager taking a 50% margin on subcontractor work.

WHAT GOOD AGENCIES DO: Clear focus: “we build AI agents” or “we do enterprise LLM integration,” not “we do everything.” Honest about what they don’t do, with partner recommendations for adjacent needs. Deep expertise visible in their narrow lane (case studies, blog posts, conference talks). Senior engineers in the lane, not project managers reselling subcontractors.

How To Test

Ask:

“What do you NOT do? What would you outsource or refer to a partner?”

Red Flag Answer

We handle everything in-house: AI, design, marketing, mobile, the works.

Green Flag Answer

We don’t do native mobile or branding. For those, we recommend [partner names]. We focus exclusively on AI agents and integrations.

Red Flag #12: No Post-Launch Support or SLA

Does an AI agency need to give me an SLA, or can I add it later?

WHAT IT LOOKS LIKE: “We deliver, then you own it.” Support terms are “we’ll discuss it after launch.” No SLA, no bug-fix warranty, no maintenance retainer. “Just call us if something breaks”, but no formal agreement says they’ll answer.

Why it’s a problem. AI systems are not set-and-forget. Models get deprecated. APIs change without notice. Real users find edge cases that your testing didn’t. Without a support contract in writing, the agency has zero obligation to answer the phone the day OpenAI sunsets the API version your agent depends on. You’re left scrambling, often paying “emergency” rates that are 3–5× the original build cost, to keep your own production system alive.

DISASTER SCENARIO, Composite example. An agency builds a voice agent for a real estate firm with no support contract. Three weeks after launch, OpenAI deprecates the API version it depended on. The agent stops working overnight. The agency quotes $12,000 to “upgrade” for what is actually two hours of work. The founder has no SLA, no recourse, and loses a week of revenue while scrambling for another vendor.

WHAT GOOD AGENCIES DO: 30–90 day bug-fix warranty included in the build cost. Clear SLA for critical issues (e.g., 4-hour response for downtime, 24-hour for non-critical bugs). Monthly retainer options for ongoing improvements and model updates. Monitoring and alerting included from day one, not as an upsell. Support terms in the contract upfront, not negotiated after handoff.

How To Test

Ask:

“What happens if something breaks in month two post-launch?”

Red Flag Answer

We can take a look at that, that’d be a separate engagement.

Green Flag Answer

90-day bug-fix warranty included. Beyond that, we offer a $2K/month retainer covering monitoring, model updates, and a 4-hour response on critical issues.

How to Score an AI Agency: The Red-Flag Matrix

Counting flags is the easiest way to make a hiring decision quickly. Use this matrix during your evaluation calls, circle each red flag as it shows up, then total at the end. This is the AI agency due diligence framework we run on our own work, and it’s the same one we’d encourage you to run on us.

RUN ANY VENDOR THROUGH THIS IN 30 MINUTES

The 12-Flag Vendor Checklist

Tick each red flag as it shows up on your evaluation calls. Your verdict updates live.

0 / 12 flags tickedSAFE TO HIRE
~85% success probability

CATEGORY 1 · PROCESS & PRICING

CATEGORY 2 · TECHNICAL CAPABILITY

CATEGORY 3 · TRUST & GOVERNANCE

CATEGORY 4 · COMMUNICATION & SUPPORT

3+ flags in a single vendor = ~90% project failure probability · Walk before you sign.
Percentages reflect industry pattern observation, not Amplence-specific dataamplence

A few things worth knowing about how this matrix works in practice.

Red flags don’t add, they multiply. One red flag is data: you can investigate, ask follow-up questions, and decide it’s a non-issue. Three red flags are a pattern, and patterns predict behavior. By the time you’re at five flags, the question isn’t whether the project will fail, it’s how much it will cost you when it does.

Some toxic combinations are especially predictive. “Six-month strategy + zero production track record + vague deliverables” is the consultant trap that costs $100K–300K in PowerPoints and zero working code. “Unrealistic timelines + no support + trust us” is the build-and-ghost combo. “We do everything + no production + vague delivera

THE 30-MINUTE VETTING CALL

Bad Agency vs Good Agency: Same Questions, Different Answers

Ask these 4 questions. Compare what you hear. Decide in under 30 minutes.

YOUR QUESTION
RED FLAG ANSWER
GREEN FLAG ANSWER

QUESTION 1 · PRICING

“What’s a typical cost for an AI agent like the one we’re describing?”

Tests Red Flag #1: opaque pricing

RED FLAG

DODGE

“Every project is unique. We’d need a discovery call to determine that. Want to schedule one?”

→ They’ll price based on what they think you’ll pay.

GREEN FLAG

SPECIFIC

“$8K-15K, depending on integrations. 2-3 weeks. Compliance and SSO add $4-6K. Want a rough scope?”

→ They’ve done this many times. Repeatable process.

QUESTION 2 · PROOF

“Can you show me a live AI agent handling real users right now?”

Tests Red Flag #4: zero production track record

RED FLAG

DEFLECT

“Everything we’ve built is under NDA, but here’s a demo video showing our capabilities.”

→ Probably never shipped to production.

GREEN FLAG

PROVEN

“Here are three live URLs. Test them now. Daily traffic, uptime, and cost per call (shares actual numbers).”

→ They’ve shipped real systems that survive real users.

QUESTION 3 · ARCHITECTURE

“Which LLM would you use, and what are the alternatives?”

Tests Red Flag #5: single-LLM lock-in

RED FLAG

LOCKED IN

“GPT-4. That’s what we use for everything since we’re an OpenAI partner.”

→ Commercial bias, not technical reasoning.

GREEN FLAG

MULTI-MODEL

“Claude for reasoning, Gemini for fast lookups, GPT-4 only where creative output matters. Here’s why...”

→ Picks tools by use case, not partnership deal.

QUESTION 4 · COMPLIANCE

“Walk me through where my customer data lives at every step.”

Tests Red Flag #8: no privacy / compliance plan

RED FLAG

HAND-WAVE

“It just goes to OpenAI’s API. They handle it from there.”

→ GDPR + EU AI Act exposure on day one.

GREEN FLAG

DOCUMENTED

“Here’s the data flow diagram. PII is redacted at ingestion, stored in your VPC. Zero-retention DPA...”

→ Compliance is a default, not an add-on afterthought.

Good agencies welcome these questions. Bad ones get visibly uncomfortable.

That’s the whole signal.


Use this side-by-side during your next sales call · Print and bring it with youamplence.com

The 30-Minute AI Agency Vetting Workflow

You can run any AI vendor through this in a single call. Here’s the five-step process.

1. Prep (5 min). Pull up the 12-flag matrix and the four test questions. Open the vendor’s site and note whether pricing is public before the call, opaque pricing is Red Flag #1, and you can score it for free.

2. Ask the four test questions (15 min). On the call: (a) What’s a typical cost for a build like ours? (b) When will I see working code I can test? (c) Can you show me a live AI agent handling real users right now? (d) Walk me through where our data lives at every step. Compare each answer to the red/green tables above.

3. Demand production proof (5 min). Ask for three live URLs with real metrics, uptime, daily volume, cost per call. A demo video is a red flag; a system you can test that day is a green flag.

4. Count the flags (3 min). Tally against the matrix: 0–1 safe · 2–3 caution · 4–6 high risk · 7+ walk away. Flag any toxic combinations (the consultant trap, the build-and-ghost, the outsourcing shell game).

5. Decide (2 min). Green-light, request a small paid pilot to de-risk, or pass. If you pass, ask for a referral, good agencies refer out what they don’t do (Red Flag #11).

THE RED-FLAG SCORING MATRIX

How to Score an AI Agency

Circle each flag as it appears during your evaluation calls. Total at the end. Decide.

0-1 FLAGS

SAFE TO HIRE

Strong candidate. Verify references and check one live system, then proceed.

~85% success probability

2-3 FLAGS

CAUTION

Patterns starting to emerge. Get two competing bids and demand contract specifics.

~50% project disruption risk

4-6 FLAGS

HIGH RISK

Likely to overrun on cost or timeline. Consider walking unless you have a backup plan.

~70% project failure risk

7+ FLAGS

WALK AWAY

This is not an agency. It’s a lottery ticket. End the call politely and move on.

~90% project failure

TOXIC COMBINATIONS: RED FLAGS DON’T ADD, THEY MULTIPLY

If you see any of these three patterns, the cost of waiting is greater than the cost of walking right now.

THE CONSULTANT TRAP

RF #2 + #4 + #7

6-month strategy + zero production + vague deliverables.

$100K-300K in PowerPoint, no code.

THE BUILD-AND-GHOST

RF #3 + #10 + #12

Unrealistic timeline + “trust us” + no SLA.

Ships broken, vanishes after invoice.

THE OUTSOURCING SHELL GAME

RF #4 + #7 + #11

No production + vague deliverables + “we do everything.”

Junior subcontractors at agency rates.

HOW TO USE THIS MATRIX

Print the 12 flags. Run any vendor through them in 30 minutes.

Good agencies welcome these questions. Bad ones dodge, deflect, or get visibly uncomfortable. That’s the whole signal.

Bar percentages reflect industry pattern observation, not Amplence-specific dataamplence.com

How Amplence Stacks Up Against This Checklist

Fair question and yes, it's the same test we run on our own work. Amplence is an AI automation agency: we build custom AI web apps, workflow systems, and business-process automation that ship to production and get used.

We'd rather show than tell, so here are three we'll stand behind line by line:

Amazon Appeal Wizard (ReinstateIQ), a RAG-powered app that turns a $3,500, multi-day legal process into a submission-ready Plan of Action in under 3 minutes. 2,000+ appeals generated, 87% reinstatement rate, ~$350 per appeal. Shipped in 4 months.

Learn more about Amazon Appeal Wizard - AI Legal Automation.

My Contractor Report (BuildVerify Pro), contractor-verification platform that queries five government databases in parallel and returns a risk-scored PDF in 30 seconds, for $19.99 instead of $500+ and a 3–5 day wait. Nationwide, 1,200+ reports generated. 3-month build.

Learn more about AI Contractor Verification Case Study | Risk Scoring Platform.

CollageDepot, AI customer-service automation handling 5,000+ support emails a month across four languages, auto-resolving 65% with sub-60-second responses. 2-month build.

Every number above sits on a published case study with a named client and the actual stack we used. That's the standard we hold to on this checklist: if we can't point to the proof, we don't print the number.

Learn more about AI Customer Service Case Study | E-commerce Support Automation.

→ See the full case studies: https://amplence.com/case-studies

Frequently Asked Questions

1: What are the biggest red flags when hiring an AI development company?

The biggest red flags when hiring an AI development company are no transparent pricing, multi-month “AI strategy” engagements before any code is written, zero production deployments, only using one LLM, vague deliverables with no acceptance criteria, no data privacy or compliance plan, and no post-launch support agreement. In our experience, three or more of these flags in a single vendor almost always signals a project that will overrun or fail.

2: How can I tell if an AI agency is just reselling no-code tools?

Ask to see a code sample from a recent project, ask which programming languages and frameworks they use, and request that an engineer (not a salesperson) join your next technical call. If they refuse all three, push back on whether they’re actually building custom AI or assembling Zapier and Make automations charged at engineering rates. Both can be valuable but you should pay no-code pricing for no-code work, not custom-development pricing.

3: What pricing should I expect from a legitimate AI agency in 2026?

As a rough industry baseline: simple AI agents and chatbots typically range from $5,000 to $25,000. Production-grade systems with integrations and compliance fall between $25,000 and $80,000. Multi-agent enterprise systems with custom data pipelines run $80,000 to $250,000+. If a vendor’s quote is more than 2× outside these ranges without a clear technical reason, treat that as a red flag and get two more bids before signing. Cross-check pricing against directories like Clutch or DesignRush to validate the range.

4: How important is data privacy when hiring an AI vendor?

Data privacy should be one of the first things you discuss, not the last. Any AI agency handling customer data must be able to explain where the data lives during training and inference, whether they redact PII before sending data to third-party LLMs, whether they have signed DPAs with model providers, and whether they offer self-hosted or zero-retention enterprise model tiers for sensitive workloads. If they can’t answer those four questions clearly, they’re not ready to handle regulated data.

5: Should I worry if an agency only uses one LLM like GPT-4?

Yes. No single model is best for every task. Claude tends to perform better on reasoning and instruction-following; GPT-4 is often stronger at creative output; Gemini is faster and cheaper for simple lookups; and self-hosted models like Llama or Mistral are essential for privacy-sensitive workloads. An agency locked into one provider is either commercially biased by a partnership deal or doesn’t have enough technical depth to compare alternatives. Either way, you’ll get worse performance and higher costs than a multi-model approach would deliver.

6: What is AI washing and how do I spot it?

AI washing is when a vendor labels a product or service as “AI-powered” when it’s really rule-based logic, basic automation, or third-party API calls dressed up in marketing language. Regulators are now enforcing it. In September 2024 the FTC launched Operation AI Comply with five cases including DoNotPay’s “AI lawyer” service and the AI writing tool Rytr and earlier, in March 2024, the SEC brought its first “AI washing” actions against investment advisers Delphia and Global Predictions for overstating their use of AI. To spot it, ask the vendor which specific models they use, what training data is involved, how the system behaves on edge cases, and what happens when the model fails. Real AI builders give precise technical answers; AI washers pivot to vague benefits language about “transformation” and “intelligence.”

7: How long should an AI MVP take to build?

A working AI MVP should take 4–8 weeks from kickoff to a usable pilot, not 4–8 months. Simple AI agents (form intake, basic classification, single-LLM Q&A) often ship in 2–3 weeks. Production-grade systems with custom integrations, multi-model routing, and compliance hardening take 6–12 weeks. Anything longer than 8 weeks to first working code usually means the agency is selling strategy slides instead of shipping software that’s Red Flag #2 in disguise. Insist on working staging-environment code within the first two weeks, even if it’s minimal.

8: Should I hire an AI agency or build AI in-house?

Hire an AI agency when you need a working production system in under 8 weeks, you don’t have existing in-house ML or LLM engineering talent, or the project sits outside your team’s core competency (compliance, multi-model routing, edge-case handling). Build in-house when AI is core to your product roadmap, you have or can hire 2+ senior engineers with production LLM experience, and you can afford a 6–12 month ramp-up. A common hybrid approach: hire an agency to build and ship the first system, then have them transfer code, documentation, and operational runbooks to your in-house team for ongoing iteration. This is faster, cheaper, and lower-risk than either pure-build or pure-buy.

How to Use This Guide on Your Next Sales Call

Print the 12 flags, pull up the four test questions, and run any AI vendor through them in a single 30-minute call (the step-by-step workflow is above). By the end, you’ll have a flag count and a clear go/no-go signal.

The agencies worth hiring will welcome these questions. They’ve answered them dozens of times, the answers are crisp, and they’ll often start asking better questions back about your actual problem. The agencies you should walk away from the ones exhibiting any combination of these red flags when hiring an AI development company that will dodge, deflect, redirect to a sales pitch, or get visibly uncomfortable. That’s the whole signal.

AI is too important and too expensive to get wrong. Spend an extra week on vetting now, save yourself $50,000 to $300,000 and 6 to 12 months of pain later.

Pick the agency that already ships, not the one that promises to.

Ready to Automate Your Business?

Discover where AI can save time, reduce manual work, and improve your business operations.

Get Free Consultation