Design(High and Low Level) and Architecture of text to SQL Agent

This blog is to understand high and low level design of natual languagae to Sql Query Execution and get the output in human understable format.

Why Not Single LLM Call?

Lack of context

LLM(Large Language Model) doesn’t understand business log - what is what?. And it will fail in production. And the accuracy is also around 50 - 60%.

Hallucination

LLM confidently generated imaginary tables, column names and flawed joins.

Performance Disaster

Query that perform a entire table scan on billion rows that criple the entire warehouse. Some database charge based on the query runtime.

Security

Direct execution of llm generated sql is an open invitation for sql ingesion attack or an unitentional data exposure.

How can I go with production grade?

Pipeline with safeguard, context aware and validation at every step. Ignoring the validation leads to inefficient query the worst case - massive security vulnerablities

Handle complex queries(Enterprise scheme)
self correction queries
safe and executable ones
Enforcing Security permission
providing explainable result

High Level Design

Modularity and Specilization

One LLM Doesn’t do everything

Agentic Workflow

statefull graph
conditional rerouting
retry loops

Safety First

Read Replicas and Read Only DB Acess - SQL Guard - > High Risk queries human in the loop

Observablity/Monitoring

Every Step is traced (Logging / Eval/ LLM as Judge)

Optimizer

We optimize the performace like feedback loop, negative signaling with ground truth creates a golden data set out of it.

Scalablity

Async, caching, serverless, loadbalacing, etc

Note

We put all the actionable item as a tool - Task giver.

Intent classifier as a tool

Based on the user query like hi/hello/thank you/chit chat, etc. Generic query goes to generic query responder tool
How many schema are present goes to schema searcher tool and
Data query as last week sales are go the Meta data retreiver tool

Meta data Retreiver as tool

We create a embedding for the names and more importantly detailed description of every table column in one vector database. When the user ask the question we create the embedding of that question and perform the vector search to find the top k similiar table and column description. Only this small relevant schema is passed to the SQL Generation prompt this improves the accuracy.

SQL Generator and validator

After SQL Generator passed to SQl validator very important for the production. Before executor must pass through the validator Check the generatod SQL against list of dangerous keyword(Drop, delete, update, etc,) and also make sure it not acesss unauthorized table.

Use explain command to analze the query plan if the plan involves full table scan or massive table or has an estimated high cost, put that in the human in the loop or reject it.

Expert insight

Feedback loop how you perfect it

Benovalant Dictator: Every time user indicates negative signal pass it to data analyst/ SQL query Admin correctly it manaually and trace it . this this golden data for future training & optimization of the current system.

Low Level Design

Query Classification and metadata retreival

Classified query + intent classification via structured pydantic output(few shot + business glosarry).
Scheme Retreivel: Embed scheme as document(tablename + column + sample rows + description of schema)
Hybrid Search(keyword + BM25, cosine similiarity), graph for joins
top k tables and column to avoid bloat
cache embedding

SQL Generation and validation

Validator + guard (SQL GLot) - > check for dangerous operation delete, update, etc
LLM As Judge or rule based limits, PII Filters
Dry run with sample data

Execution

Parametrized execution(No string concatination)
timeouts, row limit, read only connection
return result along with metadata

Corrector

Self refinement Loop -if error or empty loop
feed error + original sql back to generator
Conditional edge max 3-4 retries
Exponential back off

Orchestration and frame work

Langchain/Langgraph/ Pydatic AI/ Smolagent/Crew ai
For statefull graph / Human in the loop
Langchain/ Langgraph/ crew AI/ SQL ALchemy/ Vector Store - Milvus, pipnecone/chroma db, etc
API / FastAPI (Read only replicas)

Explain and answer

Natural language summay + visualization(Next level)

Prompt Engineering best practice

System prompt with role and constrains + dailects
few shot prompt with chain of thoughts
structured output(pydantic output parser)
self critique refinement) - Memory - short time session
prompt includes rewritten query + relevent schemas + few shot example + plan + dialect + specific rules with Structured output

Mixed LLM

Use LLM based on the task- for example intent classification often require smaller model such as GPT 4o-mini, haiku, etc for complex task such as higher model LLAMA 4, GPT 5, Qwen 3.5, Grok 4, Claude Opus 4.5 + ,etc

Evaluation

Using Spider and Bird as the benchmark for text to Sql Validation.
Validating against different model.
Negative signals are capture with MLflow / Langsmith, etc used for golden dataset

Deployment

Docker and kubernetes