The rush to embed AI into every product roadmap has created a paradox. Product managers have more tools than ever to build intelligent features, yet fewer guardrails to determine when AI actually serves the customer versus when it becomes expensive theater. I have been asked this question multiple times in various discussions on how AI can do the job of a Product Manager. However, the correct question in my mind is how can a seasoned Product Manager employ AI agents to get their tasks done faster. Use of AI in product management is not a replacement of the Product Manager but to help the Product Manager get through their tasks like research, data synthesis, competitor analysis, and much more by deploying AI tools and agents. This brings down the time taken for mundane, routine and research tasks from days to hours and hours to minutes. While you will burn the new currency i.e. TOKENs at an alarming paced depending on how fast you want to go, there is opportunity to get outsized ROI through precision prompting, targeted output definition and continuous training!
I framed my thinking of Product Management with AI into a framework called The ONE CARD rubric which is aimed at cutting through current AI hyperbole and noise through precision questioning (one for each letter) that forces rigor before a single model is trained or API called.
Clarity: Could a Customer Act on This?
AI outputs are worthless if they sit idle. Before greenlighting any AI-driven feature, ask whether the insight, prediction, or automation produced can be directly operationalized by the person receiving it.
• A churn prediction model that flags at-risk accounts but offers no recommended intervention path fails this test
• A SQL query generator that returns syntactically valid but semantically wrong code actively harms the user
• However, a demand forecast that feeds automatically into procurement workflows passes with distinction
The clarity standard demands that the AI’s output terminates in a concrete action or decision, not merely in information for its own sake. If your customer must perform additional translation, interpretation, or guesswork to benefit, your product has not delivered clarity—it has delegated cognitive burden.
Ask: What Specific Outcome Are You Improving?
Vague aspirations like “enhance user experience” or “leverage AI capabilities” doom products to measurement ambiguity. This is what I refer to as the “generic AI slop” that any AI chatbot would happily generate for you. The Ask question forces precision (and potentially saves costly token burning chat sessions :D): what metric moves, by how much, and for whom?
• Baseline the current non-AI state with empirical rigor. It is extremely hard to measure progress without it.
• Define the target outcome in business terms (revenue per user, support ticket resolution time, forecast error reduction)
• Specify the population segment where improvement is expected
Without this anchor, teams optimize model accuracy while the business bleeds. A recommendation engine achieving 94% precision means nothing if it increases cart abandonment because the latency of inference degrades checkout flow. The outcome question keeps product and model objectives aligned.
Risk: What Breaks When the AI Is Wrong?
Every AI system fails. The Risk question compares failure modes between AI and non-AI approaches with eyes wide open.
False positive costs: Automated fraud flags blocking legitimate transactions; medical triage models sending healthy patients to emergency care
False negative costs: Missed defects in manufacturing inspection; undetected security anomalies in log analysis
Systemic risks: Training data drift degrading performance silently; adversarial manipulation of input features; regulatory non-compliance in automated decision-making
The non-AI baseline matters critically. Human-operated processes have failure rates too—often higher, but differently distributed. The product manager must weigh whether AI errors are more corrigible, more frequent, or more catastrophic than the status quo, and whether the product design includes appropriate human-in-the-loop or human-on-the-loop safeguards.
Data: How Will You Prove ROI Was Justified?
AI investments consume compute, talent, and customer trust. The Data question mandates the evidentiary framework before deployment, not after. Trust BUT verify takes on a different meaning in the AI Agentic world and doing that at machine speed in a consistent and repeatable way is still an UNSOLVED problem!
Counterfactual infrastructure: Can you run controlled experiments isolating AI impact from confounding variables?
Longitudinal tracking: Are you measuring sustained improvement or novelty effects that decay?
Total cost accounting: Does your ROI calculation include model retraining, monitoring infrastructure, incident response, and compliance auditing?
The data requirement is particularly acute for product managers serving technical audiences. Your users will scrutinize whether “AI-powered” represents genuine capability augmentation or marketing veneer. Specify the telemetry, the experimental design, and the decision criteria for continuation or sunsetting.
Putting ONE CARD Into Practice
The framework’s power lies in its sequential discipline. Clarity prevents solutionism. Ask prevents metric confusion. Risk prevents blind optimism. Data prevents sunk-cost entrenchment. Each question builds a decision record that product managers can defend to engineering skeptics, finance controllers, and customers alike.
For data platform products specifically—query optimization, schema recommendation, anomaly detection—the rubric offers particular value. These domains suffer from plausible-sounding AI applications that collapse under operational scrutiny. A query planner using learned cardinality estimates must pass the clarity test (can the DBA override and trust the plan?), the ask test (what’s the p95 latency improvement versus the legacy optimizer?), the risk test (when estimates fail, does the system degrade gracefully or catastrophically?), and the data test (how do we attribute performance changes to the model versus hardware upgrades or data skew shifts?).
The Hard Truth
Not every product deserves AI. The ONE CARD rubric makes this determination explicit rather than politically fraught. A feature that fails multiple dimensions may indicate that traditional deterministic approaches, improved data quality, or simpler heuristic methods deliver superior customer value at lower risk.
The product manager who internalizes this framework becomes the credible voice for when not to use AI—a rarer and more valuable stance than cheerleading every neural network trend. Your roadmap gains integrity. Your engineering partnerships deepen. And your customers receive products that solve problems rather than showcase technology.
ONE CARD in Action: NLP-to-SQL with a Frontier Model
Theoretical frameworks collapse without concrete application. Below is a worked example of the ONE CARD rubric applied to a real product decision: whether to embed a frontier LLM (GPT-5.3, Claude 4 Sonnet, or equivalent) into your data platform to let business users write natural language questions and receive executable SQL queries.
The Product Context
Your organization runs a cloud data warehouse (Snowflake, Databricks, BigQuery). Analysts and business users currently file tickets with a centralized BI team to get answers. Average turnaround is 2-3 days. The product proposal: integrate a frontier model via API to generate SQL from natural language, cutting latency to minutes and democratizing data access.
Clarity: Can the Customer Act Correctly?
The generated SQL must be executable and semantically faithful to the user’s intent. This is where text-to-SQL systems most commonly fail.
Syntax-only success is insufficient. A query that runs but joins orders to members on the wrong key produces a plausible-looking result set that misleads the business decision
Schema hallucinations remain unsolved. Frontier models occasionally reference tables or columns that do not exist, especially on large enterprise schemas with 200+ columns per table. This unfortunately increases review times for Coding outputs, PRDs and technical outputs like SQL queries.
Business semantics require translation. The model must know that “active user” at your company means last_login > 7_days_ago and account_status != 'disabled'—not infer from column names alone. This is where ONE semantic truth is highly important to have and where the data foundation maturity gets underscored twice in building a trusted data foundation for our AI workload.
As an example, Uber’s QueryGPT addressed this by adding a “Table Agent” that surfaces proposed tables to the user for acknowledgment before generation, plus a “Column Prune Agent” that strips irrelevant schema metadata to reduce hallucination surface area. Even with these safeguards, 22% of users reported they still needed to modify generated queries before execution.
Clarity verdict: Passes only with human-in-the-loop confirmation and explicit semantic layer integration. Raw frontier-model-to-SQL fails this test for non-technical users.
Ask: What Specific Outcome Improves?
Vague goals like “democratize data access” obscure measurement. The Ask question forces specificity.
Baseline Metrics vs Targets:
• Average query authoring time (technical users): 10 min → 3 min
• Business user ticket-to-insight time: 2.5 days → 15 min
• Self-service query volume (non-technical users): 5% of total queries → 25%
• Analyst time reclaimed from ad-hoc requests: 40% of week → 15%
The outcome must be tied to a cohort. If the product targets Operations managers at Uber, the metric is their monthly interactive query volume and the productivity gain per query. If targeting finance analysts, the metric might be forecast cycle time reduction.
Critical discipline: Define what does not improve. Text-to-SQL will not reduce data governance overhead. It will not eliminate the need for semantic modeling. It will not make complex multi-table joins reliable without investment in context infrastructure.
Risk: What Breaks When the AI Is Wrong?
Text-to-SQL failure modes differ materially from the non-AI baseline (human analyst writes query).
Wrong aggregation (SUM vs COUNT):
• AI: Silent—query runs, returns wrong number (Higher severity due to silent failure)
• Non-AI: Caught in code review or QA
Incorrect join path:
• AI: Returns inflated/deflated result set (Higher on undocumented schemas)
• Non-AI: Analyst knows schema relationships
Schema hallucination:
• AI: Query fails with “column not found” (Obvious, recoverable)
• Non-AI: Human does not hallucinate schema
Semantic drift:
• AI: Model uses stale training context (Higher without live context refresh)
• Non-AI: Analyst reads updated documentation
Data exposure:
• AI: Model includes sensitive columns if not filtered (Depends on prompt engineering rigor)
• Non-AI: Role-based access controls enforce limits
Uber explicitly tracks “Run Has Output” as a safety metric—queries that execute successfully but return zero rows often indicate hallucinated filter values (e.g., WHERE status = 'Finished' instead of WHERE status = 'Completed').
The dbt Labs 2026 benchmark reveals a structural insight: for queries covered by a well-modeled Semantic Layer, accuracy approaches 100% because the LLM cannot produce subtly wrong joins or aggregations—the deterministic engine handles SQL generation. The risk gap between raw text-to-SQL and semantic-layer-mediated approaches is enormous.
Risk mitigation design:
• Intent Agent to classify user questions into bounded business domains (workspaces)
• Validation Agent that executes generated SQL against test data and checks result plausibility
• Mandatory human approval for queries targeting financial or customer PII tables
Data: How Will You Prove ROI?
AI infrastructure costs are non-trivial. Frontier model API calls for complex schemas can consume 40-60K tokens per request. Without pre-defined measurement, the project becomes a faith-based initiative.
Cost Categories (Annual Estimates):
• Frontier model API (per 1K queries/month): $15K-$45K
• Semantic layer construction: 3-4 engineer-months
• Evaluation set curation: 2 analyst-months
• Monitoring infrastructure: $8K-$12K
• Incident response (bad query reaches production): Unquantified liability
Benefit Measurement:
• Analyst time reclaimed: Pre/post time allocation surveys; sprint velocity change
• Business decision velocity: Time from question to board-ready metric
• Query volume shift: Ratio of self-service to ticketed queries
• Error rate reduction: Comparison of production incident root causes (AI vs human)
Uber’s evaluation framework is instructive: they run standardized question sets through “Vanilla” (full AI) and “Decoupled” (human-in-the-loop) product flows, tracking intent accuracy, table overlap score, execution success, and qualitative SQL similarity via LLM-as-judge. This enables component-level debugging—knowing whether failure originates in intent classification, schema retrieval, or SQL generation.
The Decision Record
Applying ONE CARD to this NLP-to-SQL proposal yields a conditional go using the example rubric below:
• Clarity: Go with Table Agent confirmation and semantic layer pre-modeling
• Ask: Go with Operations analyst cohort, 10→3 minute target, quarterly evaluation
• Risk: Go with workspace isolation, validation agent, and PII table restrictions
• Data: Go with token cost telemetry, golden SQL evaluation set, and analyst time-tracking integration
Without these four commitments, the product manager would be shipping a prototype disguised as production infrastructure. The frontier model is capable—but capability without guardrails is liability.
Summary Table: ONE CARD Applied to NLP-to-SQL
Clarity: Could a customer act or leverage the insight provided by AI correctly?
Application: Can the analyst execute the generated SQL and trust the result without manual rewrite?
Ask: What is the specific outcome that you are looking to improve or achieve?
Application: Reduce query authoring time from 10 min to 3 min; increase self-service analytics adoption
Risk: What is the risk in using AI for this approach vs non-AI way of achieving the output?
Application: Wrong joins/aggregations return plausible but incorrect results; schema hallucinations on large databases
Data: What data will you need to collect to verify that the ROI was justified?
Application: Query execution success rate, semantic accuracy vs golden SQL, time-to-insight, support ticket reduction
Note: This post used AI to research and author parts of this post.