AI projects don’t usually fail because the technology is hard. They fail because the wrong agency was hired in the first place.
Anyone with a landing page can call themselves an “AI development agency” right now. The market is flooded on Clutch, Upwork, and DesignRush list thousands of vendors, the vast majority of which have never shipped a production system that survived more than three months. The bad ones cost you $50,000 to $300,000 and 6–12 months. The good ones turn AI into measurable revenue.
This guide is the AI vendor checklist we wish every buyer had before their first sales call. It covers the top red flags to avoid when implementing AI in your organization with 12 specific warning signs that predict project failure with high accuracy, drawn from patterns observed across the AI services market in 2025–2026.
Whether you’re learning how to choose an AI development agency for the first time or vetting your tenth proposal, each red flag includes the disaster scenario it leads to, what good agencies do instead, and a test question to expose it on your first sales call. By the end, you’ll be able to vet any AI vendor in under 30 minutes.
Key Takeaways
• 12 red flags account for most AI project failures. Organized into four categories: process and pricing (RF #1–3), technical capability (RF #4–6), trust and governance (RF #7–9), and communication and support (RF #10–12).
• Red flags multiply, they don’t add. One flag is data; three flags are a pattern; five or more is a near-certain failure. Watch especially for toxic combinations: the consultant trap, the build-and-ghost, the outsourcing shell game.
• Test for production proof first. Demos lie; live URLs don’t. A real AI development agency can show you three live systems handling real users with real metrics, uptime, daily volume, cost per call.
• Compliance and explainability are the two most underrated criteria. The EU AI Act (Articles 13 and 14, with high-risk obligations now phasing in through December 2027), GDPR, and HIPAA all require auditable AI. Vendors who can’t explain their data flow are a regulatory liability waiting to surface.
• 30 minutes of vetting saves $50K–300K of pain. Use the four test questions, count your flags, and pick the agency that already ships that does not the one that promises to.
The 12 Red Flags at a Glance
Here’s the complete AI vendor evaluation matrix. Bookmark this section, print it, or pull it up during your next sales call.
Why Vetting an AI Agency Is Different From Vetting a Regular Dev Shop
AI development carries risks that regular software doesn’t. Models hallucinate. APIs change without warning. A demo that works in a controlled environment can crash under 50 concurrent users in production. Costs that look like $800 a month in a sales pitch can balloon to $8,000 once real traffic hits. And unlike a broken website, a hallucinating AI agent can quietly cost you customers and credibility before anyone notices.
On top of that, AI sits in a regulatory grey zone. The EU AI Act, GDPR, HIPAA, SOC 2, and emerging state-level US legislation all touch any agency handling your data. A vendor who doesn’t have a clear answer for “where will our customer data live during model training?” is a liability the moment a regulator or auditor shows up.
That’s why the bar to hire an AI development agency has to be higher than for ordinary development work. The 12 red flags below are organized into four categories: process and pricing, technical capability, trust and governance, and communication and support.
Category 1: Process & Pricing
Red Flag #1: “Contact Us for Pricing”
Should I worry if an AI agency hides their pricing?
WHAT IT LOOKS LIKE: Website has no pricing information at all. Every pitch starts with “every project is unique.” Won’t even give a ballpark range until after a sales call. Pricing magically scales with your funding round.
Why it’s a problem. If an agency can’t articulate what their work typically costs, it almost always means one of two things: they don’t know their own costs (which is amateur), or they price based on what they think you’ll pay (which is predatory). Mature agencies with repeatable processes can quote ballparks confidently because they’ve done the work many times before. The ones who can’t have either never built before or are charging wildly different amounts to different clients for identical scopes.
DISASTER SCENARIO, Composite example. A SaaS founder asks for an AI chatbot. The agency quotes $180,000 after learning the company has just closed a Series A. The same agency had quoted a different founder $35,000 for an identical scope two weeks earlier. The only thing that changed was the perceived budget.
WHAT GOOD AGENCIES DO: Show pricing ranges on the website (e.g., $5K pilots, $25–50K production builds). Give a ballpark in the first conversation. Explain cost drivers clearly: complexity, integrations, compliance. Use a transparent pricing model: fixed-bid, time-and-materials, or hybrid.
Red Flag #2: “Let’s Do a 6-Month AI Strategy First”
How long should an AI MVP take to build before I see working code?
WHAT IT LOOKS LIKE: First proposed phase is an “AI readiness assessment” or “AI maturity audit.” Deliverable for the first 3–6 months is slides and frameworks, not code. They want to map your “AI transformation roadmap” before doing anything. The pitch leans on jargon like “AI maturity model” and “value-discovery workshops.”
Why it’s a problem. Strategy without execution is expensive PDFs. By the time a 6-month strategy phase finishes, the model your roadmap recommends has been deprecated, your competitors have shipped, and the original consultants are nowhere to be found. Real AI partners build strategy through shipping that they get a working pilot in front of you in weeks, then iterate on what real users actually do, not what slide 47 of a deck says they should. A healthy AI MVP timeline is 4–8 weeks to a working pilot, not 6 months to a slide deck.
DISASTER SCENARIO, Composite example. A healthcare company spends $250,000 on an 8-month “AI strategy” engagement. The deliverable is a 60-page slide deck with recommendations. They have to hire a different agency to build any of it, and by then the recommendations are outdated. No working code, no production system that is just a beautifully designed PDF that nobody opens twice.
WHAT GOOD AGENCIES DO: Start with a pilot or MVP within 4–8 weeks maximum. Strategy emerges during building, not before it. Show working code within the first one to two weeks. Iterate based on real usage, not theoretical frameworks. The default mindset is “learn by shipping.”
Red Flag #3: Unrealistic Timeline Promises
WHAT IT LOOKS LIKE: “We can build that in 3 days, guaranteed.” “Production-ready AI agent in one week.” Promises that beat the market by 3–5× with no caveats. Their proposal is a single line item with a single date with no phases, no milestones.
Why it’s a problem. Fast timelines on AI work always come with hidden costs. Either they’re cutting corners on security, error handling, monitoring, and edge cases, or they’re going to miss the deadline and blame “unforeseen complexity.” Either way, you pay. Real engineers under-promise and over-deliver. They quote ranges with explanations, not absolute deadlines pulled from thin air.
DISASTER SCENARIO, Composite example. A real estate company is promised a voice agent in 3 days. They get something on day three but it has no CRM integration, can’t handle multiple callers, has no error handling, and takes 8–12 seconds to respond. Making it production-ready costs another $15,000 and three weeks. The original agency stops returning calls once the first invoice clears.
WHAT GOOD AGENCIES DO: Give realistic ranges (simple AI agents: 6–10 days; complex production systems: 3–6 weeks). Break the timeline into phases with what’s included in each. Explain why it takes that long, never hand-wave on complexity. Build in a buffer for the things that always break: API rate limits, edge cases, integration timeouts.
Learn more about How to Automate Your Business Without a Technical Team.
Category 2: Technical Capability
Red Flag #4: Zero Production Deployments
WHAT IT LOOKS LIKE: Portfolio is full of “prototypes,” “demos,” or “POCs.” “Everything we’ve built is under NDA” for every single project, conveniently. No live URLs, no public case studies with real metrics. When you ask for proof, you get a demo video instead of a live system.
Why it’s a problem. Production is where AI gets hard. Demos work because they ignore everything that breaks at scale: rate limits, race conditions, hallucinations under unusual inputs, cost overruns at peak load, model deprecations. An agency that has only built demos has no battle scars. They will learn how production AI actually works on your project, on your timeline, with your money.
DISASTER SCENARIO, Composite example. An e-commerce company hires an agency with an impressive AI demo. In a controlled environment, it works perfectly. In production, it crashes under load, gives wrong product information 30% of the time, and burns through $8,000 a month in API fees instead of the projected $800. The agency has never dealt with real production traffic, and the founder ends up hiring a second team to clean up the mess.
WHAT GOOD AGENCIES DO: Show three or more live production systems handling real users right now. Share metrics: uptime, daily call volume, cost per interaction, response latency. Discuss specific production problems solved (hallucination rates, latency optimization, fallback design). Hand over live URLs you can test that day.
Red Flag #5: One-LLM-Only Approach
Why is one-LLM lock-in a problem in 2026?
WHAT IT LOOKS LIKE: “We’re an OpenAI partner” (or Anthropic or Google), and that’s their whole pitch. Every recommendation defaults to the same model regardless of use case. They can’t articulate why one model is better than another for your specific problem. No discussion of fallback or multi-model routing.
Why it’s a problem. No single model is best at everything. Claude tends to be stronger at reasoning and instruction-following. GPT-4 is often better for creative writing. Gemini is fast and cheap for simple lookups. Llama and Mistral run on your own hardware for privacy-sensitive workloads. An agency stuck on one provider is either bound by a partnership deal that prioritizes their margin over your outcome, or they don’t actually understand the AI landscape well enough to compare alternatives. Either way, you pay 30–50% more for worse performance.
DISASTER SCENARIO, Composite example. A financial services company needs fast, accurate data lookups inside their support workflow. The agency uses GPT-4 for everything (their partnership model). Responses take 2–3 seconds and occasionally include hallucinated numbers. A different agency rebuilds the system using Gemini for lookups (0.4-second responses, accurate), Claude for the harder analytical questions, and GPT-4 only for human-friendly explanations. Performance improves 5×, costs drop 40%.
WHAT GOOD AGENCIES DO: Multi-model approach: pick the right model for each step of the pipeline. Recommend models based on your latency, accuracy, and cost requirements, not commercial deals. Explain tradeoffs clearly (cost per token, response time, context window, hallucination rates). Build architecture that lets you swap models when better ones launch (which they do, every quarter).
Red Flag #6: No Code Samples, No Engineers in the Conversation
Is it bad if an AI agency won’t share code samples?
WHAT IT LOOKS LIKE: The sales team does all the talking with no engineers ever join the calls. “We can’t share code due to IP concerns” (even anonymized samples). No GitHub presence, no architecture diagrams, no technical blog posts. Every technical question gets redirected to “we’ll cover that in the kickoff.”
Why it’s a problem. Real engineers love talking about how they build things. Hiding the technical layer behind a wall of salespeople usually means one of three things: the engineering work is outsourced to juniors who can’t speak to it, the “AI development” is actually no-code tools (Zapier, Make, basic LangChain wrappers) charged at engineering rates, or the code itself is so bad they’re embarrassed to show it. Any of those is a problem.
DISASTER SCENARIO, Composite example. A company hires an agency that refuses to share code throughout the build. At handoff, the deliverable is spaghetti code with hardcoded API keys, no tests, and zero documentation. The cost to refactor what should have been built correctly the first time is $45,000, three times the original budget. The original agency vanishes after the final invoice.
WHAT GOOD AGENCIES DO: Engineers are part of the sales process and answer technical questions directly. They can show anonymized code samples or open-source contributions. Architecture diagrams, testing approach, and code review process are all explained openly. You get repository access during the project, not just at handoff.
Category 3: Trust & Governance
Red Flag #7: Vague Deliverables, No Acceptance Criteria
WHAT IT LOOKS LIKE: Statement of work says things like “a working AI chatbot” with no specifics. No mention of accuracy targets, response time, concurrent user limits, or uptime. “We’ll figure it out as we go.” Scope keeps expanding informally over Slack, but the contract never updates.
Why it’s a problem. Without measurable acceptance criteria, every disagreement at delivery becomes a fight. You expected an AI agent that handles 80% of inquiries correctly with sub-2-second responses. They built one that handles 40% in 8 seconds. Both call it “a working AI chatbot.” Without numbers in writing, the legal default usually favors them, and you pay for the gap.
DISASTER SCENARIO, Composite example. A contract says “AI chatbot for customer support.” The agency delivers something that answers 40% of questions correctly, takes 8 seconds to respond, and crashes when more than 10 people use it at once. They claim it meets the contract, “it’s a chatbot, isn’t it?” The founder expected 80% accuracy, sub-2-second responses, and 100+ concurrent users. The dispute goes to legal.
WHAT GOOD AGENCIES DO: Detailed scope with measurable targets: accuracy %, latency ceilings, concurrency limits, uptime SLAs. Phased deliverables with checkpoints, not a single “final delivery.” Clear, written definition of “done” for each milestone. Change requests follow a documented process, not a Slack DM.
Red Flag #8: No Data Privacy or Compliance Plan
WHAT IT LOOKS LIKE: No clear answer for where your customer data lives during training and inference. “We just send everything to OpenAI” with no thought about PII redaction. Doesn’t know what GDPR Article 22, HIPAA, or SOC 2 mean for your project. No mention of data retention, deletion rights, or DPA (data processing agreement).
Why it’s a problem. AI handling regulated data without a compliance plan is a lawsuit waiting to happen. If your AI agent reads customer emails, support tickets, medical records, or financial data, that data is now subject to GDPR, HIPAA, CCPA, or sector-specific regulation depending on your jurisdiction. Sending it to a third-party API without a DPA, without redaction, and without retention controls can trigger regulatory fines, breach disclosure, and customer trust damage that costs more than the entire AI project.
DISASTER SCENARIO, Composite example. A B2B SaaS company lets an AI agency send raw customer support tickets to OpenAI for fine-tuning, including PII. There’s no DPA and no redaction layer. A customer notices sensitive data appearing in unrelated AI responses and files a GDPR complaint. The investigation costs $180,000 in legal fees and triggers mandatory breach disclosures to 12,000 customers.
WHAT GOOD AGENCIES DO: Clear data flow diagram: where data goes, how long it stays, who can see it. PII redaction or pseudonymization before sending to third-party LLMs. Signed DPAs with all model providers and willingness to use enterprise (zero-retention) tiers. Self-hosting options for highly sensitive workloads (Llama, Mistral, on-prem deployment). Documented compliance posture for whatever regs apply (GDPR, HIPAA, SOC 2, ISO 27001).
Red Flag #9: Black-Box Outputs With No Explainability
What is AI explainability and why does it matter for compliance?
WHAT IT LOOKS LIKE: Can’t explain why the AI gave a particular answer or made a particular decision. No logging of inputs, outputs, or model reasoning. No human review queue for low-confidence outputs. When the AI hallucinates, the only fix offered is “try a different prompt.”
Why it’s a problem. AI systems that can’t explain themselves can’t be trusted in any setting that matters. Customer-facing AI needs to be auditable. Internal AI making business decisions needs an audit trail. The EU AI Act (Articles 13 and 14) legally requires transparency and human oversight for high-risk AI systems and those high-risk obligations are now phasing in through December 2027 under the EU’s May 2026 Digital Omnibus, so this is a near-term planning concern, not a hypothetical. An agency that ships systems without confidence scoring, logging, and human-in-the-loop fallbacks is shipping a liability you’ll have to fix later.
DISASTER SCENARIO, Composite example. A logistics company deploys an AI agent that auto-approves supplier invoices below $10K. After three months, finance notices $43,000 in fraudulent approvals. There are no logs of why each invoice was approved, no confidence threshold, no human review queue. They have no way to even identify which approvals to investigate without re-reading every email manually.
WHAT GOOD AGENCIES DO: Every AI decision is logged with input, output, model version, and confidence score. Confidence thresholds route low-certainty outputs to human review. Hallucinations get caught by validation steps (cross-checking, ground truth comparison). Documentation explains how to interpret the system’s reasoning in plain English.
Category 4: Communication & Support
Red Flag #10: “Trust Us, We’re Experts”
WHAT IT LOOKS LIKE: Pushback when you ask to see in-progress work. “You wouldn’t understand the technical details.” Dismissive of your input on product or domain decisions. Wants 6-week chunks of silence between updates.
Why it’s a problem. Good builders welcome questions and weekly demos. “Trust us” usually means hiding incompetence, hiding scope creep, or hiding the fact that they’ve barely started. You know your business better than they ever will. An agency that doesn’t want your input ends up building what they think the right product is, not what your customers need. By the time you see it, it’s too late to change direction without paying twice.
DISASTER SCENARIO, Composite example. A founder keeps asking to see the in-progress work. The agency says “trust our process, you’ll see it when it’s done.” After six weeks and $40,000, they unveil a completely wrong product. They built what they thought the founder wanted, ignoring every piece of input given during the kickoff. The team has to scrap and start over with a different agency.
WHAT GOOD AGENCIES DO: Weekly progress demos with working software, not slide updates. Welcomes your questions and treats your domain knowledge as an asset. Explains technical decisions in plain English, not jargon. Collaborative tone, you are a partner, not a wallet. Pushes back honestly on bad product decisions but explains why.
Red Flag #11: “We Do Everything” Full-Service Claim
WHAT IT LOOKS LIKE: Claims expertise in AI, blockchain, AR/VR, IoT, mobile apps, design, marketing, SEO, and HR consulting. Pitches themselves as a one-stop shop for digital transformation. No clear specialization, “yes” to every technology you mention. The team page shows mostly project managers, not specialists.
Why it’s a problem. No agency is world-class at everything. Generalist shops survive by outsourcing the work to cheaper subcontractors and taking a 40–60% margin as the middleman. The AI portion of your project ends up with a junior contractor in another time zone who has never shipped production AI. You pay agency rates for freelancer-quality work, with an extra layer of broken-telephone communication on top.
DISASTER SCENARIO, Composite example. A SaaS company hires a “full-service” agency for AI plus design plus marketing plus DevOps. The AI work is done by a junior outsourced developer, the design uses stock templates, the marketing is generic. They eventually hire specialists for each function and learn the original agency was just a project manager taking a 50% margin on subcontractor work.
WHAT GOOD AGENCIES DO: Clear focus: “we build AI agents” or “we do enterprise LLM integration,” not “we do everything.” Honest about what they don’t do, with partner recommendations for adjacent needs. Deep expertise visible in their narrow lane (case studies, blog posts, conference talks). Senior engineers in the lane, not project managers reselling subcontractors.
Red Flag #12: No Post-Launch Support or SLA
Does an AI agency need to give me an SLA, or can I add it later?
WHAT IT LOOKS LIKE: “We deliver, then you own it.” Support terms are “we’ll discuss it after launch.” No SLA, no bug-fix warranty, no maintenance retainer. “Just call us if something breaks”, but no formal agreement says they’ll answer.
Why it’s a problem. AI systems are not set-and-forget. Models get deprecated. APIs change without notice. Real users find edge cases that your testing didn’t. Without a support contract in writing, the agency has zero obligation to answer the phone the day OpenAI sunsets the API version your agent depends on. You’re left scrambling, often paying “emergency” rates that are 3–5× the original build cost, to keep your own production system alive.
DISASTER SCENARIO, Composite example. An agency builds a voice agent for a real estate firm with no support contract. Three weeks after launch, OpenAI deprecates the API version it depended on. The agent stops working overnight. The agency quotes $12,000 to “upgrade” for what is actually two hours of work. The founder has no SLA, no recourse, and loses a week of revenue while scrambling for another vendor.
WHAT GOOD AGENCIES DO: 30–90 day bug-fix warranty included in the build cost. Clear SLA for critical issues (e.g., 4-hour response for downtime, 24-hour for non-critical bugs). Monthly retainer options for ongoing improvements and model updates. Monitoring and alerting included from day one, not as an upsell. Support terms in the contract upfront, not negotiated after handoff.
How to Score an AI Agency: The Red-Flag Matrix
Counting flags is the easiest way to make a hiring decision quickly. Use this matrix during your evaluation calls, circle each red flag as it shows up, then total at the end. This is the AI agency due diligence framework we run on our own work, and it’s the same one we’d encourage you to run on us.
A few things worth knowing about how this matrix works in practice.
Red flags don’t add, they multiply. One red flag is data: you can investigate, ask follow-up questions, and decide it’s a non-issue. Three red flags are a pattern, and patterns predict behavior. By the time you’re at five flags, the question isn’t whether the project will fail, it’s how much it will cost you when it does.
Some toxic combinations are especially predictive. “Six-month strategy + zero production track record + vague deliverables” is the consultant trap that costs $100K–300K in PowerPoints and zero working code. “Unrealistic timelines + no support + trust us” is the build-and-ghost combo. “We do everything + no production + vague delivera
The 30-Minute AI Agency Vetting Workflow
You can run any AI vendor through this in a single call. Here’s the five-step process.
1. Prep (5 min). Pull up the 12-flag matrix and the four test questions. Open the vendor’s site and note whether pricing is public before the call, opaque pricing is Red Flag #1, and you can score it for free.
2. Ask the four test questions (15 min). On the call: (a) What’s a typical cost for a build like ours? (b) When will I see working code I can test? (c) Can you show me a live AI agent handling real users right now? (d) Walk me through where our data lives at every step. Compare each answer to the red/green tables above.
3. Demand production proof (5 min). Ask for three live URLs with real metrics, uptime, daily volume, cost per call. A demo video is a red flag; a system you can test that day is a green flag.
4. Count the flags (3 min). Tally against the matrix: 0–1 safe · 2–3 caution · 4–6 high risk · 7+ walk away. Flag any toxic combinations (the consultant trap, the build-and-ghost, the outsourcing shell game).
5. Decide (2 min). Green-light, request a small paid pilot to de-risk, or pass. If you pass, ask for a referral, good agencies refer out what they don’t do (Red Flag #11).
How Amplence Stacks Up Against This Checklist
Fair question and yes, it's the same test we run on our own work. Amplence is an AI automation agency: we build custom AI web apps, workflow systems, and business-process automation that ship to production and get used.
We'd rather show than tell, so here are three we'll stand behind line by line:
Amazon Appeal Wizard (ReinstateIQ), a RAG-powered app that turns a $3,500, multi-day legal process into a submission-ready Plan of Action in under 3 minutes. 2,000+ appeals generated, 87% reinstatement rate, ~$350 per appeal. Shipped in 4 months.
Learn more about Amazon Appeal Wizard - AI Legal Automation.
My Contractor Report (BuildVerify Pro), contractor-verification platform that queries five government databases in parallel and returns a risk-scored PDF in 30 seconds, for $19.99 instead of $500+ and a 3–5 day wait. Nationwide, 1,200+ reports generated. 3-month build.
Learn more about AI Contractor Verification Case Study | Risk Scoring Platform.
CollageDepot, AI customer-service automation handling 5,000+ support emails a month across four languages, auto-resolving 65% with sub-60-second responses. 2-month build.
Every number above sits on a published case study with a named client and the actual stack we used. That's the standard we hold to on this checklist: if we can't point to the proof, we don't print the number.
Learn more about AI Customer Service Case Study | E-commerce Support Automation.
→ See the full case studies: https://amplence.com/case-studies
Frequently Asked Questions
1: What are the biggest red flags when hiring an AI development company?
The biggest red flags when hiring an AI development company are no transparent pricing, multi-month “AI strategy” engagements before any code is written, zero production deployments, only using one LLM, vague deliverables with no acceptance criteria, no data privacy or compliance plan, and no post-launch support agreement. In our experience, three or more of these flags in a single vendor almost always signals a project that will overrun or fail.
2: How can I tell if an AI agency is just reselling no-code tools?
Ask to see a code sample from a recent project, ask which programming languages and frameworks they use, and request that an engineer (not a salesperson) join your next technical call. If they refuse all three, push back on whether they’re actually building custom AI or assembling Zapier and Make automations charged at engineering rates. Both can be valuable but you should pay no-code pricing for no-code work, not custom-development pricing.
3: What pricing should I expect from a legitimate AI agency in 2026?
As a rough industry baseline: simple AI agents and chatbots typically range from $5,000 to $25,000. Production-grade systems with integrations and compliance fall between $25,000 and $80,000. Multi-agent enterprise systems with custom data pipelines run $80,000 to $250,000+. If a vendor’s quote is more than 2× outside these ranges without a clear technical reason, treat that as a red flag and get two more bids before signing. Cross-check pricing against directories like Clutch or DesignRush to validate the range.
4: How important is data privacy when hiring an AI vendor?
Data privacy should be one of the first things you discuss, not the last. Any AI agency handling customer data must be able to explain where the data lives during training and inference, whether they redact PII before sending data to third-party LLMs, whether they have signed DPAs with model providers, and whether they offer self-hosted or zero-retention enterprise model tiers for sensitive workloads. If they can’t answer those four questions clearly, they’re not ready to handle regulated data.
5: Should I worry if an agency only uses one LLM like GPT-4?
Yes. No single model is best for every task. Claude tends to perform better on reasoning and instruction-following; GPT-4 is often stronger at creative output; Gemini is faster and cheaper for simple lookups; and self-hosted models like Llama or Mistral are essential for privacy-sensitive workloads. An agency locked into one provider is either commercially biased by a partnership deal or doesn’t have enough technical depth to compare alternatives. Either way, you’ll get worse performance and higher costs than a multi-model approach would deliver.
6: What is AI washing and how do I spot it?
AI washing is when a vendor labels a product or service as “AI-powered” when it’s really rule-based logic, basic automation, or third-party API calls dressed up in marketing language. Regulators are now enforcing it. In September 2024 the FTC launched Operation AI Comply with five cases including DoNotPay’s “AI lawyer” service and the AI writing tool Rytr and earlier, in March 2024, the SEC brought its first “AI washing” actions against investment advisers Delphia and Global Predictions for overstating their use of AI. To spot it, ask the vendor which specific models they use, what training data is involved, how the system behaves on edge cases, and what happens when the model fails. Real AI builders give precise technical answers; AI washers pivot to vague benefits language about “transformation” and “intelligence.”
7: How long should an AI MVP take to build?
A working AI MVP should take 4–8 weeks from kickoff to a usable pilot, not 4–8 months. Simple AI agents (form intake, basic classification, single-LLM Q&A) often ship in 2–3 weeks. Production-grade systems with custom integrations, multi-model routing, and compliance hardening take 6–12 weeks. Anything longer than 8 weeks to first working code usually means the agency is selling strategy slides instead of shipping software that’s Red Flag #2 in disguise. Insist on working staging-environment code within the first two weeks, even if it’s minimal.
8: Should I hire an AI agency or build AI in-house?
Hire an AI agency when you need a working production system in under 8 weeks, you don’t have existing in-house ML or LLM engineering talent, or the project sits outside your team’s core competency (compliance, multi-model routing, edge-case handling). Build in-house when AI is core to your product roadmap, you have or can hire 2+ senior engineers with production LLM experience, and you can afford a 6–12 month ramp-up. A common hybrid approach: hire an agency to build and ship the first system, then have them transfer code, documentation, and operational runbooks to your in-house team for ongoing iteration. This is faster, cheaper, and lower-risk than either pure-build or pure-buy.
How to Use This Guide on Your Next Sales Call
Print the 12 flags, pull up the four test questions, and run any AI vendor through them in a single 30-minute call (the step-by-step workflow is above). By the end, you’ll have a flag count and a clear go/no-go signal.
The agencies worth hiring will welcome these questions. They’ve answered them dozens of times, the answers are crisp, and they’ll often start asking better questions back about your actual problem. The agencies you should walk away from the ones exhibiting any combination of these red flags when hiring an AI development company that will dodge, deflect, redirect to a sales pitch, or get visibly uncomfortable. That’s the whole signal.
AI is too important and too expensive to get wrong. Spend an extra week on vetting now, save yourself $50,000 to $300,000 and 6 to 12 months of pain later.
Pick the agency that already ships, not the one that promises to.



