pragmat.ai

Why pragmat.ai?

2025-10-13T12:00:00Z

pragmat@ai:~$ find posts/ -name "why*" | head -1 | xargs cat

Around mid 2025, a CEO showed me a GenAI prototype their team built in two days. It was impressive. Natural language to SQL queries, instant insights from complex data, the works. “Why can’t we ship this next sprint?” they asked.

I’ve heard this question before. Twenty-three years ago, during the dotcom boom, it was “Why can’t we just put the Access database on the web?” The tool had changed, but the gap remained: the distance between something that works on your laptop and something that survives contact with reality.

That prototype? No input validation. No rate limiting. No cost controls. Token costs that would scale exponentially with real usage. Error handling that consisted of retry loops that could cascade into service outages. The SQL it generated was functional but would have brought down their production database under real load.

/*
 * TODO: Add input validation
 * TODO: Implement rate limiting
 * TODO: Add cost controls
 * TODO: Fix error handling
 * FIXME: Everything before production
 * Ship date: Next sprint
 */

This is not a GenAI problem. This is an engineering problem as old as software itself.

The Pattern Repeats

During the dotcom years, everyone could suddenly build websites. HTML was approachable. FrontPage made it visual. The barrier to entry collapsed overnight. Leadership saw demos and assumed the hard part was done. Then systems crashed, data leaked, businesses failed. Not because the web was bad technology, but because the craft of engineering cannot be abstracted away by tools.

GenAI follows the same pattern, only faster and more seductive. A junior developer with ChatGPT can create in hours what used to take weeks. This is genuinely powerful. But they can also create time bombs that look like solutions.

The democratization of prototyping has convinced too many leaders that the engineering discipline required for production systems has somehow become optional.

It hasn’t.

Experience Changes Everything

In experienced hands, GenAI changes the game. I use it daily. Code review acceleration. Documentation generation. Test case creation. Complex refactoring assistance. Real, measurable productivity gains. But I also know when to ignore its suggestions, how to validate its output, where its reasoning breaks down. Twenty years of debugging production failures teaches you to spot the disasters hiding in seemingly perfect code.

This is why pragmat.ai exists. Not to gatekeep or dismiss genuine innovation, but to share what two decades of building enterprise systems teaches you about technology transitions. To help engineering leaders navigate between the pressure of “everyone else is shipping AI features” and the reality of what it takes to build systems that don’t just demo well, but actually work when your business depends on them.

Fundamentals Are Not Features

The tools are more powerful than ever. The fundamentals remain unchanged. Systems still need to scale. Security still matters. Technical debt still compounds. And the distance between a compelling demo and production-ready system is still measured in engineering discipline, not technological promises.

Here’s what I know: GenAI will transform how we build software, but not in the way most people think. The revolution isn’t in replacing engineers or making expertise obsolete. It’s in amplifying what experienced practitioners can accomplish. The gap between those who understand this and those who don’t will define the next generation of technical success and failure.

I’ve lived through enough technology waves to recognize the pattern. The winners aren’t the ones who adopt fastest or resist longest. They’re the ones who understand both the genuine potential and the genuine limitations. Who can separate signal from noise. Who know that sustainable advantage comes from engineering excellence, not from tool selection.

That’s the perspective you’ll find here. Not Anti-AI. Not Pro-AI. Just pragmatic engineering reality from someone who’s been building and breaking systems long enough to know the difference between promise and production.

If you’re an engineering leader trying to navigate GenAI adoption while keeping your systems stable and your budgets intact, you’re in the right place.

Welcome to pragmat.ai. Let’s build something that actually works.

// pragmat.ai v1.0 - Compiled with experience, may contain traces of actual engineering

Your GenAI POC Isn't Lying. It's Just Not Telling the Whole Truth.

2025-11-03T12:00:00Z

Your executive built a “working prototype” over the weekend with ChatGPT and a few company docs. It answered support questions beautifully. They demoed it Monday and now want to know why engineering needs two months to ship it.

They enjoyed demo privilege: curated inputs, single user, no logging, no audit, no cost caps, no failure paths, and a human babysitter. They proved the experience under perfect conditions. Your team has to make it work when those privileges vanish. That is most of the work.

Old problems; new accelerants. These are well-known software realities: concurrency, edge cases, cost, failure modes. GenAI just lets an executive skip engineering during the demo, compressing weeks into a weekend and making the gap look trivial when it isn’t.

Demo privilege is real; production adds the friction

The weekend demo answered one question: can this produce something useful under perfect conditions?

Production asks different ones. Concurrency. Unanticipated inputs. Cost ceilings. Adversarial use. Bad upstreams. All at once. Those questions always existed; GenAI just surfaces them faster.

The C-suite fantasy

“I built a prototype in a weekend. If it’s this easy, why two months?”

Because you had demo privilege. You picked the one document that works. You asked questions you already knew the answers to. You didn’t test what happens when 50 people hit it at once, when someone pastes a 30-page PDF, when the model is confidently wrong, or when the token bill arrives.

Your prototype proved the interface feels good. It didn’t prove the system survives contact with reality.

It’s the team’s job to build without those privileges. It’s your job to sponsor the time and scope to do it right. That has always been most of the work; GenAI just made it easier to forget.

Why this fantasy repeats

Availability bias. Demos are vivid. Tail risks are invisible until they’re expensive.

No blast radius. Prototypes carry zero operational accountability. Production carries all of it.

Success theater. Shipping fast looks like leadership. Paying the interest later lands on someone else’s budget.

These patterns aren’t new. Executives have always underestimated the gap between working demo and working system. GenAI just shortened the time it takes to create that gap from weeks to a weekend, making the pattern repeat faster and with more confidence.

Weekend demo → production delivery

Software has always had these gaps. GenAI doesn’t change the destination; it just makes the starting point look deceptively close.

Demo (Frictionless)	Production (Friction)
Curated docs; one happy file	Every messy file; OCR errors, tables, encodings, contradictions
Single request; warm model	Sustained RPS; cold starts; retries; pooling; caches
“Looks right”	Eval set, thresholds, drift detection, CI gates
Swipe card	Per-step budgets; hard caps; truncation; cost regression; forecasts
Provider defaults	Red-team suite; PII masking; retention map; runbooks
You watching a spinner	On-call; tracing; correlation Ids; dashboards; tuned alerts

A demo is frictionless; production adds the friction that keeps systems honest. You microwaved a snack; they have to run a kitchen.

What your team heard (and what you should say)

What you said: "I built this in a weekend."

What the team heard: No contracts. No tests. No guardrails. No cost controls. No failure modes. Please replicate the magic and take the blame when it doesn't scale.

What you should say: "I validated the experience. Now harden it for scale, cost, and audit."

What you said: "It worked great in the demo."

What the team heard: Golden path, single user, curated data, zero edge cases tested.

What you should say: "Design validated. Define what production-ready means and show me the evidence."

What you said: "We can optimize later."

What the team heard: We have no idea where tokens go or what this costs at scale.

What you should say: "Set the cost ceiling we can defend to finance. Show me the controls and the analysis."

What you said: "Let's scale after we launch."

What the team heard: We didn't test concurrency; we're hoping for the best.

What you should say: "Define the performance target and prove we can hit it under load."

What you said: "Safety is handled by the vendor."

What the team heard: Untested abuse scenarios and unclear retention.

What you should say: "Nothing launches until we've tested for abuse and mapped our data boundaries. What do you need?"

Executive reality check

Before asking “why can’t we just ship this,” commit to:

Scope: Pick two on a tight timeline: features, accuracy, SLA.
Evidence: Authorize time for load tests, cost analysis, and accuracy baselines. Data, not vibes.
Budget: Set a token ceiling and accept automatic degradation when it’s exceeded.
Risk: Define user-visible failure modes for wrong, slow, and down before launch.
Ownership: Name on-call and escalation today.

Direct Message to the Executive

Your weekend demo proved the experience is worth pursuing. Your job isn’t to dismiss engineering reality; your job is to recognize that “works in my hands under perfect conditions” and “works at scale, under load, within budget, when things break, and when auditors ask” are different contracts that require different evidence.

GenAI compresses prototyping from weeks to a weekend. Treat that velocity as proof you can bypass engineering and the gains invert. Costs rise. Accuracy drifts. Tech debt compounds faster than you can staff it. Applaud the demo; underwrite the hardening. That’s the difference between momentum and mess.

POCs are theater; production is surgery. Measure like a surgeon.

// Pragmatic GenAI. May contain traces of actual engineering

Lines of Code Are Dead. Tokens Aren't the Answer Either.

2025-12-22T12:00:00Z

The board wants GenAI wins. Not exploration. Not ideation. Wins. They want competitive advantage, visible progress, and proof they are not sleepwalking through the biggest shift since the early internet.

C-suites need momentum on a slide. By next quarter. The pressure is legitimate. Show that the money spent on ChatGPT licenses and vendor briefings is producing value.

So leadership measures what they can see.

Spoiler: success is not what they are measuring.

The Celebrated Metrics

"AI adoption is soaring. Ten thousand prompts this month."

Sales, marketing, operations, support. Everyone's using it. Usage graphs are climbing. The quarterly deck looks impressive. Even the CFO is using ChatGPT to better understand the growing GPT invoice.

"Our top performer needed increased token allocation. Leading the charge!"

Developer X is all-in on GenAI. Burning through tokens faster than anyone on the team. Clearly embracing the future. Worthy of recognition. Finance approved the budget increase to keep the momentum going.

"Technical excellence: Engineer running 30+ MCP servers simultaneously. Mastering the tools!"

Model Context Protocol servers for everything. Documentation, APIs, databases, code repos, internal wikis; comprehensive coverage. The setup is impressive. The team lead is documenting it as a best practice for others to follow.

What’s Actually Being Measured

What do these have in common? They’re all measuring activity. None of them measure outcomes.

Welcome to Metrics Theater. The sequel to Success Theater: we built demos that looked good but didn’t survive production. Now we’re measuring numbers that look good but don’t measure value.

Lines of code didn’t measure quality twenty years ago. We learned that lesson the hard way: more code meant more bugs, more maintenance debt, more complexity. Volume was a vanity metric. The industry moved on.

All of this has happened before, and all of this is happening again.

Prompt counts, token consumption, MCP server proliferation. Different wrappers, same delusion. We’re celebrating activity and calling it progress. We’re rewarding consumption and calling it performance. We’re documenting waste and calling it best practice.

Let’s Dismantle This

The Activity Trap

Ten thousand prompts in a month. Fourteen hundred hours of employee time. Sounds impressive until you ask one question: how many of those outputs were actually used?

Prompt counts measure how many times someone asked an LLM something. They don’t measure whether the answer was useful, whether it accelerated work, or whether it went anywhere at all. A thousand questions that produce zero usable answers is just expensive noise.

Activity has never equaled productivity. We knew this. Somehow we forgot.

The Complexity Trap

Running 30+ MCP servers simultaneously isn’t mastery. It’s complexity worship.

You know the type. Seven layers of abstraction where two would do. Microservices for a CRUD app. It looks sophisticated. It sounds impressive in architecture reviews. And when that engineer presented this setup, nobody in leadership asked what problem 30 servers solved that 3 couldn’t.

Here’s what that sprawl actually creates: context windows drowning in redundant information, premature context rot as signal dies in noise, token burn at conversation startup before any work begins. Every server adds overhead. Every source adds noise.

The pragmatic engineer running 3 carefully chosen MCP servers, getting clean context and shipping quality code? Invisible. The inefficient one burning 5x the tokens for equivalent results? Documented as best practice.

MCP isn’t the problem. Sprawl is the problem. And leadership just institutionalized it.

The Consumption Trap

This is where the threads converge. Token usage is up 300%. Your highest consumer needed increased allocation; clearly leading the charge on AI adoption. Budget approved. Recognition given. The quarterly deck shows the upward trend.

But nobody asked why. Nobody asked what you got for it. You have the report. You know who your top token consumers are. You celebrated them. You increased their budgets. You never asked what they produced.

High token consumption might mean poorly designed prompts stuffing unnecessary context. It might mean inefficient workflows repeating failed attempts. It might mean context bloat drowning signal in noise. It might mean plausible-sounding nonsense that got discarded after burning tokens. It might mean burning money on garbage outputs nobody used.

You didn’t measure cost per outcome. You didn’t ask if they’re shipping quality code efficiently or just burning cloud budget theatrically.

You saw a big number and handed out the trophy. What the fuck are we actually doing?

The CFO isn’t using ChatGPT to understand the invoice because you’re winning. The CFO is using ChatGPT because the numbers don’t make sense and nobody can explain what you got for them. Finance didn’t approve the budget increase because your top consumer is creating value. Finance approved it because leadership said “this is important” and finance doesn’t have the technical depth to push back.

You rewarded your least efficient developers. And now every other engineer sees what gets celebrated: token burn gets visibility, complexity gets rewarded, efficiency gets ignored.

Three metrics. Three traps. One pattern: rewarding visibility over value.

We spent twenty years teaching the industry that volume is a vanity metric!!!

I’m frustrated because this is predictable.

(Breathe.)

Pattern Recognition

We’ve been here before. Lines of code. Function points. Velocity without context. Sprint completion rates that meant nothing. The industry has a long history of measuring the wrong things, realizing it, and correcting course.

2004 Lines of Code Bloat. Bugs. Complexity.

2025 Token Count Bloat. Bugs. Complexity.
Now With More Cost.

Board pressure is legitimate. The question was never whether to measure GenAI usage. It’s what to measure. And right now, you’re reaching for metrics that feel familiar but don’t fit.

Here’s the thing about those twenty years of learning: engineers learned that lesson. The debates about LOC happened in code reviews, methodology arguments, engineering retrospectives. The industry moved on because engineers stopped using volume as a proxy for quality.

But most execs weren’t in those rooms. From their vantage point, activity metrics looked like progress. And progress translates to momentum in the boardroom.

So when GenAI arrived and the board wanted metrics, you reached for familiar shapes. Token counts feel like lines of code. Usage graphs feel like adoption curves. Big numbers feel like progress. The pattern match wasn’t malice or laziness. It was reaching for tools that looked right from a distance.

But GenAI doesn’t map cleanly to those shapes. Volume and value are decoupled. High consumption can mean high waste just as easily as high productivity. The metrics that look right are measuring the wrong thing.

Here’s the feedback loop you’ve created: celebrate activity metrics, and engineers optimize for activity. Token consumption goes up. Prompt counts climb. The numbers look good, so you report progress. But the underlying value never moved. Outcomes, efficiency, velocity stayed flat.

The incentive structure is the problem. Reward consumption, get consumption. Reward outcomes, get outcomes. Right now, you’re rewarding the wrong thing.

What Actually Matters

If prompts, tokens, and complexity don’t measure success, what does?

Stop measuring how much GenAI your organization consumes. Start measuring what you got for it.

Adoption Rate

Has GenAI been absorbed into actual workflows, or does it remain adjacent to the real work?

Look for workflow displacement, not just usage:

Did time-to-completion decrease, or are you just doing more steps now?
Are cycle times decreasing (code reviews, documentation, debugging)?
Is GenAI in the critical path, or is it a side tool people use occasionally?

The test: If you turned off GenAI tomorrow, would velocity drop significantly, or would work continue largely unchanged? If the answer is “largely unchanged,” you don’t have adoption, you have activity.

Cost Per Outcome

What did you spend per successful business result? Developer A burns 500K tokens and ships three features. Developer B burns 50K tokens and ships three features. One of them understands efficiency. The other one got a participation trophy.

Time Saved

Measurable reduction in task completion time compared to manual process. If your “AI-accelerated” workflow takes the same amount of time as the old process, you didn’t accelerate anything. You added complexity and a cloud bill.

Quality Improvement

Fewer errors, higher accuracy, better outcomes compared to baseline. GenAI that produces work requiring the same level of human review and correction as the manual process isn’t creating value. It’s creating busywork.

Business Impact

Which specific KPI moved because of this GenAI initiative? Revenue up? Costs down? Customer satisfaction improved? Deployment time reduced? If you can’t connect the GenAI spend to a business metric that matters to the board, you’re not showing wins, you’re showing activity.

These aren’t aspirational metrics. They’re the minimum bar for knowing whether GenAI is creating value or consuming budget.

Direct Message to the Executive

Board pressure is real. You need to show GenAI wins. But manufacturing metrics that look good in quarterly decks while actual value stays flat isn’t strategy. It’s borrowed time until someone asks the ROI question you can’t answer.

Not because GenAI failed. Because you measured theater instead of outcomes.

Start with the metrics that matter. Adoption rate. Cost per outcome. Time saved. Quality improvement. Business impact. Put these in your next review.

Ask the uncomfortable questions. What did we actually get for this spend? Who’s shipping value efficiently versus burning tokens theatrically?

You’ll know you’re measuring the right things when the numbers make you uncomfortable. When workflow displacement reveals that most manual processes are still running in parallel. When cost per outcome shows who’s efficient and who’s burning budget. When time saved demonstrates that most workflows haven’t actually accelerated. That discomfort is data. Don’t smooth it over. Use it.

Champion the engineers who understand cost per outcome. The ones who push back when big numbers get celebrated. The ones whose instinct is “what did we get for it?” rather than “look how much we’re using.” They exist. They’ve been doing the work while the metrics theater played out around them.

Reward outcomes, not consumption. Measure adoption, not generation. Celebrate efficiency, not complexity.

The board wants GenAI wins. Give them real ones.

Measure like an engineer. Report like an executive. Stop confusing the two.

// Pragmatic GenAI. Where vanity metrics go to die.

You're Not Adopting GenAI. You're Coping With It.

2025-12-29T12:00:00Z

Same input. Same output. Every time.

For decades, this pattern has been our mantra. We used words like idempotent and deterministic to describe it. This behavior isn’t incidental; it sits at the core of what we consider good software.

The expectation isn’t naïve. It’s discipline. It’s how reliable systems are built, tested, scaled, and trusted. Determinism is what lets teams debug confidently, automate safely, and hold engineering accountable. When something changes unexpectedly, you investigate. You don’t shrug. You don’t accept “close enough.” You fix it.

Most of us have spent our entire careers building on this foundation. Query a database twice with the same parameters and you expect identical results. Call an API with the same payload and you expect the same response. The deterministic behavior isn’t even a feature, it’s the contract.

Regression suites exist because the contract can be proven. Same input. Same output. This is engineering discipline.

And it’s non-negotiable. Systems violating the contract don’t ship. Engineers producing flaky outputs get PRs rejected. Leaders tolerating inconsistent behavior are on the receiving end of incident tickets.

This is why modern software works so well.

So when support reports the RAG system is producing inconsistent guidance for the same customer question, the reaction is predictable.

Same customer query. Varying outputs: structure reorganized, supporting details changed, emphasis shifted. Close enough to recognize the topic. Off enough to report as buggy.

This is how disciplined organizations are supposed to react. Unpredictable outputs trigger investigation. Something feels wrong. The system is behaving erratically. It gets escalated. Engineering is asked to take a look.

The instinct came from decades of deterministic software. Unreliable behavior means something broke. Reliable systems repeat their answers. This reflex has kept production systems stable for years.

And yet nothing is actually broken here.

The RAG system is behaving under a different contract. One producing plausible variation instead of identical repetition. Same question, defensible answers with shifting emphasis and framing. Probabilistic systems don’t honor deterministic contracts.

The collision starts when outputs are evaluated using deterministic standards.

The Collision

When GenAI tooling is rolled out, the familiarity comes along for the ride. New licenses are procured, usage tracked, and early wins celebrated.

The paradigm shift was unrecognized, though. So support, sales, and operations do what they’ve always done.

Support escalates: customers are getting inconsistent answers to the same question. Yesterday’s explanation doesn’t match today’s. The knowledge base used to be consistent. Now it depends on what, exactly? Timing? Phrasing? The phase of the moon?

Sales feels it next. The proposal generator produces different justifications for the same deal structure. A client asks a clarifying question and gets a slightly different explanation than the one they heard yesterday. Now credibility is at risk.

Operations flags it too. The document summarizer highlights different takeaways from the same quarterly report. Which version goes into the board deck? Do we run it three times and pick one? What does “best” even mean when the system won’t commit?

These are not technical complaints. They are operational breakdowns.

The business runs on repeatable processes. GenAI introduces variation by design. Engineering applies deterministic discipline. The new tooling refuses deterministic contracts.

Repeatable processes don’t absorb plausible variation. They fracture. And when every team reports the same friction, the question leadership asks is always the same.

“Why can’t this be made consistent?”

The Coping Mechanisms

When the paradigm doesn’t shift, organizations cope. The patterns are predictable.

Engineering absorbs unspoken expectations first. Support defines success one way. Sales defines it differently. Operations has a third interpretation. None of these definitions are written down. None are reconciled. Engineering builds to expectations that shift by department, by stakeholder, by mood. When outputs don’t match what someone imagined, engineering absorbs the blame. The variation isn’t the bug. The lack of shared understanding is.

But engineering can’t absorb forever. So the coping spreads.

Prompts get rewritten to chase consistency. Three sentences become three paragraphs. Instructions are added, reordered, expanded to “lock in” behavior. Each revision feels like progress because the output changes. Whether it improves is a different question. The prompt grows longer, more fragile, harder to reason about. Confidence erodes quietly.

Token budgets inflate next. “We have unlimited tokens” becomes justification for the cycle. Outputs vary, so teams respond with more context, more examples, more retries. The assumption is simple: more tokens should mean better answers. Cost rises. Clarity does not.

And evaluation stays stuck in the old paradigm. Multiple outputs get compared side by side. Differences in wording, tone, emphasis become the focus. Conversations drift toward which answer is “correct” instead of whether any answer is useful. Variation becomes the problem to solve.

The system never passes a test it was never designed to take.

What the Paradigm Shift Looks Like

Here’s the shift, stated plainly.

Traditional software rewards determinism. Same input. Same output. Deviations are defects to be eliminated.

GenAI does not work that way. It produces plausible outputs within a range. Variation is expected. Consistency comes from framing and evaluation, not from forcing identical responses. When the mental model changes, the behavior around the system changes with it.

When You Think This…	The Shift Is…
“It gave different answers; something’s broken”	Nothing broke. Expect a range of defensible answers, not one canonical response.
“I need to rewrite this prompt until it gives the same results”	The prompt isn’t the control surface. Define what success looks like separately; tune for usefulness, not sameness.
“We need more tokens to make the outputs more reliable”	More tokens won’t force consistency. Curate context deliberately; spend on precision, not volume.
“Which of these three outputs is the correct one?”	Maybe all three. Evaluate against requirements, not against each other.
“Engineering needs to fix why this keeps changing”	Engineering builds to outcomes, not expectations. Align on acceptable variation first; they’ll deliver to that.

Prompts Stop Being the Control Surface

When the mental model is correct, prompts stabilize early. They stop growing with every unexpected output. Instructions are no longer treated as enforcement mechanisms, but as framing tools. Success is defined outside the prompt, so variation in phrasing no longer triggers rewrites. The prompt stops being the thing teams fight over.

Tokens Become a Design Choice, Not a Crutch

With the right model, adding tokens is intentional. Context is curated, not inflated. Retries are purposeful, not reflexive. Teams can explain why additional context is necessary and when it is not. Token usage becomes part of system design instead of a substitute for clarity.

Evaluation Shifts from Sameness to Usefulness

Outputs are no longer judged against each other line by line. Evaluation moves to outcomes. Multiple answers can be acceptable if they satisfy the same objective. Differences in tone or emphasis are tolerated as long as the result works. The question stops being “Which one is correct?” and becomes “Did this do what we needed?”

Expectations Get Defined, Not Assumed

Success criteria are established early and evolve as understanding improves. What “good output” means gets articulated—including acceptable variation. Engineering builds to defined outcomes instead of inferred intent. When outputs vary within agreed parameters, that’s expected behavior, not an escalation. Alignment happens through iteration, not comprehensive upfront specification.

None of this requires new tooling. It requires a different contract.

GenAI does not replace discipline. It relocates it. The discipline moves from enforcing sameness to defining bounds, from controlling outputs to agreeing on outcomes. Once that happens, the coping mechanisms disappear on their own.

Direct Message to the Executive

If you mandated GenAI adoption, you need to hold yourself to account.

You didn’t just approve a tool. You changed the contract your organization has operated under for decades. And you did it without saying so out loud.

You rolled GenAI out the way you’ve rolled out every other system. Licenses procured. Usage tracked. Early wins celebrated. On paper, it looked disciplined.

And that’s the problem.

GenAI is not deterministic software with new features. It’s a probabilistic system operating under a different contract. You deployed probabilistic tooling using deterministic playbooks. No explicit acknowledgment of the shift. No shared definition of what “good output” means when variation is expected. No guidance on how to evaluate systems that don’t guarantee repetition.

That frustration you’re seeing is not a tooling failure. It’s a leadership gap.

Here’s the paradox: you mandated adoption. Engineering deployed it. What about the paradigm shift?

You probably assigned someone. And they delivered your mandate: adoption metrics, use cases, governance. The visible work.

The shift never made the mandate. Who’s left holding the bag?

When you deploy probabilistic systems under deterministic contracts, every downstream team is set up to fail. Support escalates behavior that looks broken. Sales loses confidence in outputs that won’t repeat verbatim. Operations stalls trying to pick the “right” answer. Engineering absorbs the blame for a mismatch it didn’t create.

None of this is surprising. It’s the predictable outcome of introducing a new paradigm without the shift ever making the mandate.

You don’t fix this by buying better models. You don’t fix it by adding guardrails. You don’t fix it by asking engineering to stabilize behavior that is working as designed.

You fix it by doing the part only leadership can do.

Own the shift. Explicitly. Publicly. Repeatedly.

Be clear about where consistency is required and where variation is acceptable. Be clear about how outputs should be evaluated. And be clear that engineering is not responsible for enforcing guarantees that were never defined.

Until that happens, your organization will keep coping. Prompts will grow. Tokens will inflate. Reviews will stall. Confidence will erode quietly. And you’ll keep hearing the same question framed a dozen different ways.

“Why can’t this be made consistent?”

Because consistency was never the promise. You just never told anyone.

If you want GenAI to work, stop treating it like software you already understand. Change the contract first. Everything else follows.

// Pragmatic GenAI. May contain traces of paradigm shifts

The Silent Cost

2026-02-15T12:00:00Z

I recently spent real time on a colleague’s pull request. My reviews tend to be thorough, methodical, and tempered with mentorship. I lean toward hints at what could be improved rather than prescriptive fixes. Over the years I’ve found this approach far more effective for developing engineers who solve problems rather than engineers who follow instructions.

This review was no different. I traced logic through services, questioned why a new dependency existed when the project already had a wrapper for the same thing, and wrote feedback meant to teach. In my own words. Because I take pride in our team’s success rate, our low bug frequency, and I respect my colleague enough to invest real effort.

The code was GenAI-driven. You could tell. Not because the code was bad. Because the code was generic. Existing project patterns, ignored. A dependency we already had a wrapper for, reintroduced. The model defaulted to training-data assumptions about how things should work rather than how things actually work in our codebase, with our business context, under our constraints. A well-prompted model can evaluate existing patterns. This one was not asked to.

A conversation for another day. What matters here is what happened next.

The response to my review came back in minutes. Three structured paragraphs. Each one referenced specific points from my feedback. Acknowledged the architectural concern. Noted the dependency overlap. Closed with something along the lines of, “I’d be eager to better understand the established patterns here so I can align more closely going forward.”

Read that again.

Sounds like comprehension. Like someone absorbing feedback and signaling growth. Exactly the kind of response a senior engineer hopes to get after investing real effort in a review.

I read enough LLM output daily to recognize the patterns. Authentic feedback went in. A cold LLM response came back. The signal was there, but the wrong signal was sent.

What the Reviewer Actually Reads

Here’s the thing about plausible-sounding nonsense. The person sending the message thinks the message lands. They read the response back, see structure and specificity, and feel confident they’ve responded adequately. The LLM flatters the sender. Always does.

The recipient reads something different

What was sent (AI-generated)

"Thank you for the detailed review. You raise a valid point about the dependency overlap; I can see how consolidating would improve maintainability. The architectural concern around the service boundary is well-taken and I'll revisit the approach. I'd be eager to better understand the established patterns here so I can align more closely going forward."

What the reviewer actually reads

"I pasted your feedback into the model and received something resembling comprehension. I didn't engage with any of this. I don't know why we have the abstractions we have. But the response looks professional enough to close the loop."

What 90 seconds of actual human effort produces

"Good catch on the dependency, honestly didnt realize we had a wrapper. Wheres the documentation? Also I went back and forth on the service boundary. My concern was where the validation layer sits. Does the existing pattern handle the case or is the split intentional? happy to pair on this if async isn't working."

The human version has typos in its soul. Imperfect. Asks a question revealing a gap in understanding. And precisely why the human version builds trust. Someone thinking about the problem rather than performing comprehension.

The Conversation

Here’s where I’m supposed to tell you I quietly updated my mental model of this person, started giving less feedback, and moved on. The expected pattern. What most people do.

But this colleague and I had something most professional relationships don’t. I’m several levels above him on the org chart. The kind of gap where honest feedback typically flows one direction or not at all. I’ve intentionally built bidirectional feedback dynamics with several developers on my team. Not by accident. By practice. This colleague was one of them. He’d called me out on things. I’d called him out. We’d built the kind of trust where directness across a level gap isn’t a threat. Where the hierarchy exists on paper but doesn’t govern how we talk to each other.

So I told him. Directly. Why the response was hollow. What the response missed. Why the response mattered. Not a power move. Not a gotcha. A continuation of an honest working relationship between two people who had already agreed candid feedback is how you get better.

He received the feedback well and started being more intentional with his responses and PR interactions.

About a month later, he told me he appreciated the feedback. Said he feels like he’s actually learning more now by engaging with feedback himself instead of routing responses through a model producing the appearance of engagement.

The story is real. And the story ended well.

The Evaluation Gap

The code review was one exchange. The pattern is everywhere. Every interaction builds or erodes a running assessment of who you are in someone else’s head. Not your resume. Not your title. Whether you’re someone worth investing in. Worth mentoring. Worth being honest with.

The sender judges polish. The recipient judges substance.

Plausible-sounding nonsense does not stop being nonsense because you signed your name to it.

The corrosion is quiet. Your manager does not pull you aside and say, “I can tell you’re using AI for your messages and investing less in your development because of the lack of effort.” Your colleague does not reply with, “Your response reads like an LLM and I’m going to stop giving you real feedback.” They just give you less. Less mentorship. Less honesty. Less of the messy, human, imperfect investment actually building careers.

Every AI-generated message is a withdrawal from a trust account you cannot see. The balance silently declines. Nobody sends you a statement.

You’re not saving time. You’re spending credibility.

The Quiet Part

My colleague’s story ended well because we could have the conversation. Two people with enough mutual trust to be comfortable with the uncomfortable feedback.

This did not happen overnight. The bidirectional feedback culture on my team took deliberate effort to build. Not just between me and individual developers. Peer to peer. Across level gaps. The kind of environment where anyone can say “your response missed the point” without navigating politics first. Years of practice, not a policy memo.

Most teams do not have this level of maturity. Professional relationships require years of trust before directness feels safe. Up the org chart? Rarely! The conversation carries career risk nearly everyone avoids for their own self-preservation. Across peer lines without established trust? The conversation stalls out before it can even have a chance to start.

The uncomfortable truth: People sending AI-generated communication know exactly when they receive AI-generated responses. They recognize the patterns in everyone else’s messages. The structured acknowledgment, the curiosity closer, the polished emptiness. They just have not connected the recognition to their own behavior. The same patterns they spot in others are the patterns others spot in them.

Which means most people sending AI-generated communication into their professional relationships are eroding something they cannot see, cannot measure, and will not be warned about. Their colleagues already know, they are just not saying anything. Quietly recalibrating how much of themselves to invest in someone who apparently cannot be bothered to invest in return.

If nobody has had this conversation with you yet, consider why.

My colleague got lucky. Someone cared enough to say the uncomfortable part. He’s a better engineer for receiving the feedback. And I’m better for having built the kind of team where the conversation was even possible.

Not everyone has this. Maybe you do. Maybe the cost of finding out is worth the conversation.

// Pragmatic GenAI. Credibility sold separately.