Natural language to SQL in Production

Architecture
Dev
data
text_sql
Design
HLD
LLD
Author

Mukesh

Published

April 6, 2026

Design(High and Low Level) and Architecture of text to SQL Agent

This blog is to understand high and low level design of natual languagae to Sql Query Execution and get the output in human understable format.

Why Not Single LLM Call?

Lack of context

LLM(Large Language Model) doesn’t understand business log - what is what?. And it will fail in production. And the accuracy is also around 50 - 60%.

Hallucination

LLM confidently generated imaginary tables, column names and flawed joins.

Performance Disaster

Query that perform a entire table scan on billion rows that criple the entire warehouse. Some database charge based on the query runtime.

Security

Direct execution of llm generated sql is an open invitation for sql ingesion attack or an unitentional data exposure.

How can I go with production grade?

Pipeline with safeguard, context aware and validation at every step. Ignoring the validation leads to inefficient query the worst case - massive security vulnerablities

  • Handle complex queries(Enterprise scheme)
  • self correction queries
  • safe and executable ones
  • Enforcing Security permission
  • providing explainable result

High Level Design

Modularity and Specilization

One LLM Doesn’t do everything

Agentic Workflow

  • statefull graph
  • conditional rerouting
  • retry loops

Safety First

Read Replicas and Read Only DB Acess - SQL Guard - > High Risk queries human in the loop

Observablity/Monitoring

Every Step is traced (Logging / Eval/ LLM as Judge)

Optimizer

We optimize the performace like feedback loop, negative signaling with ground truth creates a golden data set out of it.

Scalablity

Async, caching, serverless, loadbalacing, etc

Text to SQL Agent Architecture
Note

We put all the actionable item as a tool - Task giver.

Intent classifier as a tool

  • Based on the user query like hi/hello/thank you/chit chat, etc. Generic query goes to generic query responder tool
  • How many schema are present goes to schema searcher tool and
  • Data query as last week sales are go the Meta data retreiver tool

Meta data Retreiver as tool

We create a embedding for the names and more importantly detailed description of every table column in one vector database. When the user ask the question we create the embedding of that question and perform the vector search to find the top k similiar table and column description. Only this small relevant schema is passed to the SQL Generation prompt this improves the accuracy.

SQL Generator and validator

After SQL Generator passed to SQl validator very important for the production. Before executor must pass through the validator Check the generatod SQL against list of dangerous keyword(Drop, delete, update, etc,) and also make sure it not acesss unauthorized table.

Use explain command to analze the query plan if the plan involves full table scan or massive table or has an estimated high cost, put that in the human in the loop or reject it.

Expert insight

Feedback loop how you perfect it

Benovalant Dictator: Every time user indicates negative signal pass it to data analyst/ SQL query Admin correctly it manaually and trace it . this this golden data for future training & optimization of the current system.

Low Level Design

Query Classification and metadata retreival

  • Classified query + intent classification via structured pydantic output(few shot + business glosarry).
  • Scheme Retreivel: Embed scheme as document(tablename + column + sample rows + description of schema)
  • Hybrid Search(keyword + BM25, cosine similiarity), graph for joins
  • top k tables and column to avoid bloat
  • cache embedding

SQL Generation and validation

  • Validator + guard (SQL GLot) - > check for dangerous operation delete, update, etc
  • LLM As Judge or rule based limits, PII Filters
  • Dry run with sample data

Execution

  • Parametrized execution(No string concatination)
  • timeouts, row limit, read only connection
  • return result along with metadata

Corrector

  • Self refinement Loop -if error or empty loop
  • feed error + original sql back to generator
  • Conditional edge max 3-4 retries
  • Exponential back off

Orchestration and frame work

  • Langchain/Langgraph/ Pydatic AI/ Smolagent/Crew ai
  • For statefull graph / Human in the loop
  • Langchain/ Langgraph/ crew AI/ SQL ALchemy/ Vector Store - Milvus, pipnecone/chroma db, etc
  • API / FastAPI (Read only replicas)

Explain and answer

  • Natural language summay + visualization(Next level)

Prompt Engineering best practice

  • System prompt with role and constrains + dailects
  • few shot prompt with chain of thoughts
  • structured output(pydantic output parser)
  • self critique refinement) - Memory - short time session
  • prompt includes rewritten query + relevent schemas + few shot example + plan + dialect + specific rules with Structured output

Mixed LLM

Use LLM based on the task- for example intent classification often require smaller model such as GPT 4o-mini, haiku, etc for complex task such as higher model LLAMA 4, GPT 5, Qwen 3.5, Grok 4, Claude Opus 4.5 + ,etc

Evaluation

  • Using Spider and Bird as the benchmark for text to Sql Validation.
  • Validating against different model.
  • Negative signals are capture with MLflow / Langsmith, etc used for golden dataset

Deployment

Docker and kubernetes