Microsoft Azure Prompt Flow · 2023 · B2B Tooling

When AI workflows are script-heavy, iteration breaks at scale

Led end-to-end design of Prompt Flow’s visual system: Author → Evaluate → Compare.

Key shift: from scattered scripts & tools → one continuous build → test → compare loop.

Demo: build → test → compare without leaving the editor

3x faster iteration

(integrated bulk test + eval in-flow)

20% Less setup overhead

(unified configuration + fewer context switches)

Adopted as a standard workflow pattern

(rolled into Azure AI tooling)

THE CHALLENGE

The Complexity of Scaling AI Workflows

Iterative but Fragmented: Prompt engineering evolves through constant iteration—but rarely in one place.

The Visibility Gap: Scripts, notebooks, and configs scatter decisions into a black box.

At scale, AI workflows fail not because of model quality,
but because iteration becomes invisible and unmanageable.

The standard AI lifecycle:

A loop that becomes unmanageable when scaled across cross-functional teams.

THE STAKEHOLDERS

Who feels the pain?

AI Engineers

Struggling with complex DAG logic.

Data Scientists

Lacking systematic evaluation tools.

Product Managers

Having zero visibility into prompt performance.

BREAKDOWN

Understanding the Problem Space

Why Existing AI Tooling Fails in Practice?

Evidence-Based Insights

12 Product Logs Analyzed (To understand behavioral patterns)

8 Expert Interviews (In-depth 1:1 sessions with DS & Engineers)

5 Contextual Inquiries (Observing real-world AI development workflows)


Fragmented Workflow


  • Design lives in Jupyter or local notebooks

  • Evaluation relies on ad-hoc scripts or manual scoring

  • Deployment handled separately in production pipelines

Result: workflows become manual, fragmented, and hard to reproduce.

The Tooling Gap: Usability vs. Control


Pros: Easy to start


Cons: Lacks Engineering Rigor, Subjective Evaluation Only

Pros: Flexible for developers


Cons: Siloed Environment, Reproducibility Challenges

Result: No tool today connects design, evaluation, and deployment into a single, shared workflow.

VALIDATION

3 Core Pain Points

Invisible Logic

Hard to trace or debug complex execution flows buried in scattered code scripts.

Unscalable Quality

Manual, one-off testing fails to provide systematic metrics across datasets at scale.

Siloed Handoffs

Significant friction between roles due to the lack of a shared visual workspace.

How might we transform fragmented prompt experiments into a visible, evaluable, and shareable workflow system that scales across teams?

DESIGN STRATEGY

The Visual DAG

Core Strategy

Represent prompt workflows as first-class, visual system objects — not hidden code.

By modeling workflows as a Directed Acyclic Graph (DAG), execution logic becomes explicit, traceable, and reusable. This shared structure enables consistent evaluation, reliable collaboration, and scalable deployment across teams.

Why This Matters

SOLUTION 1

Making Prompt Logic Visible

Turn invisible execution into a visible workflow.

A visual editor that exposes prompt execution as a structured flow—so teams can understand, debug, and iterate with confidence.

DAG-based workflow model

Graph-based workflow view

Node-level inspection

Inline testing

SOLUTION 2

Unified Evaluation Flow

From subjective testing to consistent results.

A centralized evaluation flow that enables bulk testing, standardized metrics, and side-by-side comparison.

1. Run large-scale prompt tests

3. Compare the Metrics

Evaluation runs on the same underlying workflow model

SOLUTION 3

Collaboration & Deployment

From handoffs to continuity

A centralized evaluation flow that enables bulk testing, standardized metrics, and side-by-side comparison.

Shared flow gallery

One-click deployment

Design, evaluation, and deployment stay connected

EXPLORATION

User Flow & Early Wireframe

Resolving system complexity before committing to UI Decisions

Defining the ideal workflow across experimentation, evaluation, and production.

From Abstract Logic to Tangible Interfaces

EXPLORATION

System Architecture & Navigation Model

A unified system that supports authoring, evaluation, and production handoff.

Highlighted paths show the most common authoring → evaluation → deployment flow.

Early Validation in a Code - Centric Environment

Key Insight

Code-centric workflows scaled execution, but collapsed collaboration. They worked for individual engineers — and failed across roles.

This limitation led us to design a shared, visual workflow in Azure ML Studio.

ITERATION

Improving Graph & Node Readability

As flows grew larger, readability — not functionality — became the primary bottleneck.

Before

After

1. Fragmented execution context → Unified into a single flow

2. Panel-dependent understanding → Flow-first layout

3. Flat visual hierarchy → Clear node hierachy

ITERATION

Integrating Evaluation Flow

Before: Separate setup for Bulk Run and Evaluation

After: Unified configuration —— less switching, faster testing


ITERATION

Unifying the “View Result” Flow for Multi-Variant Evaluation

Iteration Focus: Turning fragmented evaluation results into a first-class, comparable system object.

Before

After

Visual Prompt Flow Editor

FINAL DESIGN

Unified authoring interface with inline configuration and visualized node status, improving clarity and iteration speed.

Evaluation & Result Review

FINAL DESIGN

Watch the seamless flow: triggering a bulk test directly from the authoring view in just 2 clicks.

Key Interaction Decisions

DESIGN DETAIL

Selective deep dives into critical interaction decisions that improved clarity and flexibility.

Detail 1 — Clear Input–Output Association

Problem
Multiple inputs made data flow hard to trace in complex nodes.

Decision
Enable linking directly from input fields to reveal data flow.

Detail 2 — No-Lock-in Authoring (Code-First Mode)

Problem
Advanced users preferred code-level control over visual-only editing.

Decision
Enable instant switching between visual and Python modes with full sync.

DESIGN ALIGNMENT

Adapting Fluent 2 for AI Workflows

Built on Fluent 2, we adapted supporting UI patterns to handle the density and interactivity required by AI workflow authoring.

Key Adaptation

  • High-density command surfaces

  • Adaptive panels for multi-context workflows

  • AI-specific interaction patterns

Outcome

  • High-density command surfaces

  • Adaptive panels for multi-context workflows

  • AI-specific interaction patterns

IMPACT

Impact & Adoption

Adopted as a standard workflow pattern
Used across Azure AI tooling for prompt authoring & evaluation.

3× faster iteration speed
Enabled by in-editor bulk testing and integrated evaluation flows.

~20% reduction in setup overhead
Fewer environment switches and unified configuration.

IMPACT

KEY Takeaways

Systems beat features
I learned that consistency is infrastructure — shared workflow logic matters more than isolated UI polish when scaling AI tools across teams.

Validation over speculation
High-fidelity prototypes aligned engineers faster than documentation or meetings, reducing decision risk early.

Design needs governance to scale
Design systems without contribution models eventually bottleneck teams — governance is a force multiplier, not overhead.