AI Workflow Orchestration Case Study: Remote MCP Server + Human-Reviewed Support Drafts

TLDR

This case study uses a classify-then-draft AI workflow instead of a single prompt.
A remote MCP server lets the model retrieve prior sent replies during drafting.
Humans stay in control because the system creates Gmail drafts only and never auto-sends.
Prompt management, model routing, and audit logs are treated as production concerns.
The architecture emphasizes observability, idempotency, and safe skip paths over flashy automation claims.

When support teams answer the same questions over and over, the real cost is not only time. It is also inconsistency, context switching, and the mental load of reconstructing a careful answer that the organization has likely written before. That was the business problem behind this project for Civil PE Surveying Review (CPESR), an online exam-prep business whose inbox regularly receives repeat questions about course access, quiz behavior, billing, device compatibility, and enrollment logistics.

Instead of chasing a risky "fully autonomous" support bot, we built a production AI workflow that does something more practical: it classifies inbound email, decides whether a reply draft should be generated, and then uses a capable model plus a remote MCP server to retrieve useful precedent from prior sent mail. The result is a draft in Gmail for staff to review, edit if needed, and send manually. Nothing is auto-sent.

This matters because the project is really a case study in AI workflow orchestration, OpenAI Responses API usage, tool calling, and human-in-the-loop system design. The Gmail integration is important, but it is not the headline. The headline is how to run a multi-step LLM pipeline safely in production.

The support problem: why we needed an AI workflow

CPESR's support inbox contains many legitimate questions, but a large share of them are variations on themes the team has already handled well. A student may ask whether progress is retained after access expires. Another wants to know whether quizzes should be treated like timed exam practice. Another asks whether the course works across a laptop and tablet. These are not trivial questions, but they are often answerable by adapting prior responses rather than starting from zero.

That made the opportunity clear. The goal was not to replace support staff or turn the inbox into a ticketing platform. The goal was to shift time from writing routine replies from scratch to reviewing high-quality drafts and handling edge cases with human judgment.

The deployed workflow runs on a schedule, roughly every 15 minutes, inspects eligible unread messages, and either skips them with a logged reason or creates a reply draft in Gmail. This keeps the process conservative and traceable. Spam, automated receipts, outreach, already-handled threads, and classifier-declined messages do not get a "best effort" draft. They are explicitly skipped.

For readers who want the business context behind the deployment, the broader CPESR engagement is also reflected in Endertech's Civil PE Surveying Review project case study.

The workflow at a glance

This is a two-stage workflow with conditional branching, model routing, and tool use only where it adds value. That distinction is important. Many AI demos collapse everything into one prompt. Production systems usually should not.

Stage	Model tier	Tools	Outcome
1. Classification	Small and fast model	None	Structured decision: generate draft or skip
2. Drafting	More capable model	Remote MCP tool: `search_sent_mail`	Validated JSON subject and body, then Gmail draft creation

The first step is cheap and strict. The second step is richer, slower, and tool-augmented. That is a classic production pattern: use a smaller model for triage, and reserve the more expensive path for messages that deserve it.

flowchart TD
  subgraph trigger["Workflow trigger"]
    Schedule["Scheduled job"]
  end
  subgraph orchestration["AI workflow orchestration"]
    Poll["Ingest inbound messages"]
    Classify["Step 1: Classification model"]
    Draft["Step 2: Drafting model plus tools"]
  end
  subgraph mcp["MCP server POST /mcp"]
    ToolSearch["Tool: search_sent_mail"]
  end
  subgraph integrations["Integrations"]
    Gmail["Gmail API drafts and inbox"]
    OpenAI["OpenAI Responses API"]
  end
  subgraph ops["Operations layer"]
    Prompts["Prompt registry"]
    Logs["Activity audit log"]
  end
  Schedule --> Poll
  Poll --> Classify
  Classify -->|eligible| Draft
  Draft --> OpenAI
  OpenAI -->|"tools/call"| ToolSearch
  ToolSearch --> Gmail
  ToolSearch --> Draft
  Draft --> Gmail
  Prompts --> Classify
  Prompts --> Draft
  Classify --> Logs
  Draft --> Logs

The architectural takeaway is that the model provider is not only supplying text generation. During drafting, the Responses API acts as an MCP client and calls the team's server-side tool when precedent retrieval is needed.

If you are exploring similar systems, this is the kind of problem that benefits from thoughtful application architecture, not just prompt experimentation. Endertech approaches these projects as custom software development for production AI workflows, where orchestration, validation, and operational controls matter as much as model choice.

Designing a production AI workflow: classify, branch, then draft with tools

The core design decision was to model the process as a workflow, not a chatbot. That means explicit stages, explicit status changes, and explicit ownership of side effects.

Pattern	Why it matters
Multi-step pipeline	Lets each step have its own model, prompt, and validation rules
Conditional routing	Prevents ineligible messages from reaching the draft step
Idempotency	Stops scheduled runs from reprocessing finished messages
Structured I/O	Makes classification and draft payloads machine-checkable before side effects
Side-effect ownership	Keeps Gmail draft creation in application logic rather than inside model output assumptions
Fail-closed behavior	Avoids fallback drafts when classification, tooling, or validation fails

flowchart TD
  polling["Ingest inbound messages"] --> classify["Step 1: Classification"]
  classify --> eligible{"Eligible?"}
  eligible -->|no| skip["skipped_ineligible plus audit log"]
  eligible -->|yes| draft["Step 2: Drafting with remote MCP"]
  draft --> mcpHttp["MCP tools/call search_sent_mail"]
  mcpHttp --> draft
  draft --> validate["Validate JSON output"]
  validate --> gmailDraft["Create Gmail draft for review"]

In practice, the workflow loads thread context, checks whether the support team has already replied, resolves the active classification prompt, and calls the model without any tools. If the result says no draft should be produced, the job ends there with a concrete status and a logged reason.

Classification branch excerpt

$classification = $this->runClassifier($job, $classifierPrompt, $email, $threadMessages, $settings);

if (!($classification['should_generate_draft'] ?? false)) {
    return $this->skip(
        $job,
        $classification['skip_reason'] ?? $classification['reason'] ?? 'not_eligible',
        'classifier_ineligible',
        EmailJob::STATUS_SKIPPED_INELIGIBLE,
        $dryRun,
    );
}

$job->setStatus(EmailJob::STATUS_CLASSIFIED_ELIGIBLE);

Only after a message is classified as eligible does the orchestration layer move into drafting. At that point it sends a Responses API request with a remote MCP tool attached. The model can call that tool zero or more times in the same run, which allows precedent retrieval to be model-driven instead of hard-coded in advance.

Drafting request with remote MCP tool

$response = $this->responses->send(new ResponsesRequest(
    workflow: WorkflowType::Responses,
    operation: 'email_drafting',
    prompt: $prompt,
    input: $input,
    options: [
        'max_output_tokens' => 2200,
        'reasoning' => ['effort' => 'low'],
        'tools' => [$this->mcpPublicConfig->openAiRemoteToolDefinition()],
    ],
    metadata: [
        'email_job_id' => (string) $job->getId(),
        'gmail_message_id' => $email->messageId,
        'mcp_tool' => EmailMcpConfig::TOOL_SEARCH_SENT_MAIL,
    ],
));

The workflow statuses tell the coarse story: pending, processing, skipped_ineligible, classified_eligible, response_generated, draft_created, and failure states. That may sound mundane, but it is one of the biggest differences between a production workflow and a demo. Jobs need durable state.

Building a remote MCP server the model calls mid-draft

The Model Context Protocol piece is what makes this workflow especially interesting. Instead of pre-fetching a large blob of prior replies in application code and stuffing them into the prompt, the team exposed a narrow tool boundary over HTTP: a remote MCP server with a single model-facing tool, search_sent_mail.

That choice changes the shape of the architecture in three useful ways. First, retrieval logic lives in one place instead of being duplicated across orchestrators. Second, the model can decide when precedent is relevant and what kind of query to run. Third, credentials remain server-side. The model never receives mailbox tokens; it only gets normalized read-only snippets returned by the tool.

Protocol detail	Implementation
Protocol version	`2025-06-18`
Transport	HTTP POST with JSON-RPC 2.0
Methods	`initialize`, `notifications/initialized`, `tools/list`, `tools/call`
Health check	`GET /health`
Auth	Bearer token on `POST /mcp`, configured to fail closed

MCP controller dispatch excerpt

return match ($method) {
    'initialize' => $this->initialize($id),
    'notifications/initialized' => new Response('', Response::HTTP_ACCEPTED),
    'tools/list' => $this->jsonRpcResult($id, ['tools' => $this->toolDefinitions()]),
    'tools/call' => $this->callTool($id, $params),
    default => $this->jsonRpcError($id, -32601, sprintf('Method "%s" not found.', $method), 404),
};

The design of search_sent_mail is intentionally narrow. It accepts a Gmail query and an optional result count, wraps the query with in:sent, searches starred sent mail first, then general sent mail, deduplicates by message ID, rejects unsafe query tokens, and returns normalized snippets such as subject, date, and excerpt. That is enough to support grounded drafting without exposing broad mailbox operations to the model.

Remote MCP tool definition excerpt

return [
    'type' => 'mcp',
    'server_label' => 'cpesr_gmail_sent',
    'server_description' => 'Read-only Gmail Sent search for draft precedents. '
        .'Call search_sent_mail when prior sent replies are needed.',
    'server_url' => rtrim(trim($this->publicUrl), '/'),
    'authorization' => trim($this->bearerToken),
    'allowed_tools' => ['search_sent_mail'],
    'require_approval' => 'never',
];

sequenceDiagram
  participant Workflow as AI workflow orchestrator
  participant OAI as OpenAI Responses API
  participant Mcp as MCP server
  participant Data as Sent mail integration
  Workflow->>OAI: Draft request with remote MCP tool
  OAI->>Mcp: tools/call search_sent_mail
  Mcp->>Data: Search prior replies
  Data-->>Mcp: Normalized snippets
  Mcp-->>OAI: Tool result
  OAI-->>Workflow: Draft JSON subject and body
  Workflow->>Data: Create Gmail draft for human review

A practical implementation detail is that the MCP endpoint lives inside the same production web application as the orchestration and operations interface. That keeps deployment simpler and avoids the assumption that MCP always requires a separate Node process.

For organizations thinking about secure tool boundaries, governed retrieval, and provider-callable endpoints, Endertech's MCP server development services page describes the broader category of work this project represents.

Orchestrating OpenAI Responses API calls in a multi-step workflow

The OpenAI Responses API is the execution engine inside the workflow, not a replacement for orchestration. That distinction is important because branching, idempotency, retries, validation, and human handoff all remain responsibilities of the application.

Each request is assembled through a single transport boundary that builds payloads, resolves the model, attaches metadata, applies options such as tools only when appropriate, enforces global and per-workflow enable flags, checks for an API key, and retries on retriable transport failures like 408s, 429s, and 5xx responses.

Payload assembly excerpt

$payload = [
    'model' => $this->resolveModel($request),
    'input' => $this->buildInputMessages($request),
];
if ($metadata !== []) {
    $payload['metadata'] = $metadata;
}
foreach ($request->options as $key => $value) {
    if (!\in_array($key, ['input', 'metadata', 'model'], true)) {
        $payload[$key] = $value;
    }
}

The pattern here is mature and portable. Treat prompts, metadata, and tool availability as part of an execution contract. Treat the workflow layer as the place where business rules live. That is how you keep an AI pipeline understandable six months later.

Operating prompts as production data in an AI workflow

Another strong production choice in this project is that prompts are managed as operational data rather than buried in code. The system stores prompts in a database table, supports one active prompt per workflow type, binds each workflow type to an approved model, and records prompt references in audit logs so operators can see exactly which prompt influenced a run.

Prompt operation concept	What it enables
Separate workflow types	Different prompts for classification and drafting
Single active prompt	Predictable production behavior
Model binding and allowlists	Prevents accidental use of unsupported model IDs
Prompt references in logs	Supports debugging and change tracking

Single-active prompt enforcement excerpt

public function enforceSingleActivePrompt(Prompt $prompt): void
{
    if ($prompt->getStatus() !== PromptStatus::Active) {
        return;
    }
    $this->entityManager->createQueryBuilder()
        ->update(Prompt::class, 'prompt')
        ->set('prompt.status', ':status')
        ->where('prompt.workflowType = :workflowType')
        ->andWhere('prompt.id != :id')
        ->setParameter('status', PromptStatus::Inactive)
        ->setParameter('workflowType', $prompt->getWorkflowType())
        ->execute();
}

For teams new to production AI, this is a valuable lesson: prompt ops is real ops. If prompts can change behavior, then prompt activation, model binding, and auditability are part of the platform.

Observability: logging every step of an AI workflow

Good AI systems need more than provider dashboards. They need application-native observability. This workflow uses two layers: operational logs for engineering diagnostics and a job-linked activity audit log for step-by-step operator forensics.

The activity log captures operations such as classification, drafting, MCP sent-mail searches, and Gmail draft creation. It stores identifiers, request and response summaries, prompt references, statuses, and error details. That makes it possible to answer practical questions like: Why was this message skipped? Which prompt version was active? Did the model call the sent-mail tool? Did Gmail draft creation fail after a good response?

A thoughtful privacy pattern also appears here: MCP sent-search queries are logged as SHA-256 hashes instead of raw text. That preserves forensic usefulness without unnecessarily storing sensitive query contents.

Audit log writer excerpt

$log = (new AILog())
    ->setEmailJob($entry->emailJob)
    ->setWorkflowType($entry->workflow)
    ->setOperation($entry->operation)
    ->setStatus($entry->status)
    ->setIdentifiers($entry->identifiers)
    ->setPromptReference([/* id, name, model, workflow_type */])
    ->setRequestSummary($entry->requestSummary)
    ->setResponseSummary($entry->responseSummary);
$this->entityManager->persist($log);

This is also where the project shows healthy honesty. A runtime logging_level setting exists, but it is not yet enforced in code, so production currently writes full audit rows. Calling out that limitation is a sign of real engineering practice.

MCP workflow handoff in Gmail

Gmail is the approval surface, not the intelligence layer. The system polls unread inbox messages, loads thread context through the Gmail API, and, after a successful eligible run, creates a reply draft in the original thread using users.drafts.create. Humans still review and send the message from Gmail itself. After terminal outcomes such as draft creation, skip, or failure, the inbound message is marked read so the scheduler does not rediscover it.

Draft contract excerpt

/**
 * Create a reply draft in an existing thread (human review in Gmail; no send).
 * $draftPayload keys: "to" (required), "text" and/or "html", optional "subject", etc.
 */
public function createReplyDraft(string $threadId, array $draftPayload, ?MailboxAccessContext $context = null): string;

The precedent corpus for drafting is the mailbox's Sent folder, searched starred-first. That is a subtle but useful signal: some historical replies are more trustworthy than others, and the tool design reflects that.

Lessons for teams building or hiring AI workflow and MCP expertise

Start with workflow semantics. Define branching, idempotency, validation, failure paths, and human approval before debating model settings.
Use MCP for clean tool boundaries. A narrow, server-authoritative tool is often better than preloading giant context blobs into a prompt.
Classify before you draft. A strict gate saves cost and reduces the chance of low-value drafts on spam, receipts, or already-handled threads.
Treat prompts and logs like production assets. Prompt activation, request metadata, and per-job audit trails are part of the system, not side notes.
Keep humans in the loop until the business is ready for more risk. Draft-only workflows preserve trust and make adoption easier.

There is also a broader architectural lesson here. The project does not overclaim. It does not pretend to be a CRM replacement. It does not promise instant autonomous support. And it does not frame retrieval as magic. Instead, it shows what responsible business automation looks like when reliability matters.

Next steps

This case study reflects a practical way to deploy AI in support and operations: orchestrated stages, model-driven retrieval, secure remote tools, and a human-reviewed handoff. It is a stronger pattern than forcing everything into one prompt or jumping straight to auto-send.

If your team is planning a support, operations, or customer-success workflow, Endertech can help design the surrounding application logic, integrations, and governance through custom software development for production AI workflows. And if the harder part is exposing internal capabilities safely to models, their MCP server development services are directly aligned with the kind of tool architecture shown here.

For readers interested in adjacent implementations, Endertech has also written about another MCP-driven production build in a case study on replacing Shopify search with AI. Different business problem, same larger theme: production AI succeeds when orchestration, retrieval, and operational discipline are designed together.

Building an AI Support Workflow with a Remote MCP Server: A Production Case Study