[Fix][Refactor]Migrate FoundryLocal integration to official Microsoft.AI.Foundry.Local.WinML SDK #535

weiyuanyue · 2025-12-15T12:30:21Z

Summary

This PR resolves a critical production-breaking bug that blocked all FoundryLocal model downloads in AI Dev Gallery after upgrading to Foundry Local v0.8.x+. We migrate from fragile custom HTTP API calls to the official Microsoft.AI.Foundry.Local.WinML SDK (v0.8.2.1), restoring full functionality and establishing a resilient foundation for future compatibility.

Impact: Users can now reliably download, prepare, and use FoundryLocal models. The system is significantly more robust against upstream API changes.

Background & Root Cause

In Jul 2025, Foundry Local changed the internal format of a critical Name field in its HTTP API response. While this change was handled internally by Foundry Local, it was not communicated externally, causing silent incompatibility with AIDG's direct HTTP-based integration.(ADO)

After upgrading to Foundry Local v0.8.x:

All model download requests failed silently due to field format mismatch
Downstream workflows (model preparation → chat/inference) were completely blocked
Users experienced inability to download or use any FoundryLocal models

Business Impact:

Disrupted critical developer workflows
Increased support burden and user frustration
Eroded trust in AIDG's stability and reliability

Solution: SDK Migration

To eliminate this entire class of failures and future-proof the integration, we migrate to the official SDK:

Key Benefits

Native SDK is fully self‑contained and does not require the Foundry Local CLI or any external services
Native SDK path replaces HTTP service, eliminating cross‑process overhead.
Explicit model preparation removes cold‑start latency and ensures deterministic loading.
Unified cache management exposes all models and enables precise storage control.
Typed native streaming replaces SSE parsing for higher reliability and lower latency.
Alias‑based model identity standardizes lookup and eliminates naming ambiguity.
Removal of Web Service layer reduces failure surfaces and simplifies maintenance.

Screenshots

Foundry Local running normally with no UX change.

Systems that do not support Foundry Local, showing the new in-app message presented to users. (Previously, the UI incorrectly instructed users to install Foundry Local.)

The updated Settings page, which now includes a management interface for Foundry Local’s cached models.

Warning notification shown when a downloaded model fails to load during EnsureModelLoadedAsync validation

Technical Changes

1. Architecture Migration: From Web Service to Native SDK

Migrated from HTTP-based integration (launching external service process + OpenAI-compatible REST API) to direct SDK integration:

Before: Application → HTTP Client → Foundry Local Service → Model
After:  Application → FoundryClient → FoundryLocalManager (SDK) → IModel → Model

Benefits:

In-process execution eliminates HTTP/SSE overhead
Direct access to model metadata (e.g., MaxOutputTokens)
Native async streaming without parsing
No external process lifecycle management

2. Explicit Model Lifecycle Management

Replaced implicit model loading (triggered by first inference request) with explicit two-phase lifecycle:

Phase 1 - Preparation:

EnsureModelReadyAsync() in FoundryLocalModelProvider calls EnsureModelLoadedAsync() in FoundryClient
SDK loads model into memory (GPU/NPU/CPU) and caches in _loadedModels dictionary
Thread-safe loading via SemaphoreSlim prevents duplicate loads

Phase 2 - Usage:

GetIChatClient() retrieves loaded model via GetLoadedModel(alias)
Returns IChatClient adapter wrapping the SDK's native OpenAIChatClient

This enables predictable loading behavior with progress indication and avoids inference-time delays.

3. SDK Adapter for Microsoft.Extensions.AI

FoundryLocalChatClientAdapter bridges the SDK's OpenAIChatClient to our IChatClient abstraction:

Maps ChatOptions parameters (Temperature, TopP, MaxOutputTokens, etc.) to SDK's ChatSettings
Enforces MaxTokens configuration (required for output generation) with fallback defaults
Converts streaming responses to IAsyncEnumerable<ChatResponseUpdate>
Validates non-empty output and provides actionable error messages
Transforms message formats between abstractions (text-only; multi-modal not yet supported)

4. Stable Model Identity

Adopted SDK's two-tier model identification:

Alias - Stable family identifier (e.g., "qwen2.5-0.5b") used in fl://<alias> URLs
ModelId - Precise variant identifier (e.g., "qwen2.5-0.5b-instruct-generic-cpu:3") for operations

ICatalog.GetModelAsync(alias) handles automatic variant selection. The Task field enables filtering by capability ("chat-completion", "automatic-speech-recognition"). This approach insulates the app from variant naming changes.

5. Unified Cache Management

Integrated SDK's cache (managed in ModelCacheDir) with application's cache UI:

GetCachedModelsWithDetails() queries SDK cache via ICatalog.GetCachedModelsAsync()
DeleteCachedModelAsync() unloads active models, removes from SDK cache, and cleans internal tracking
ClearAllCacheAsync() deletes all Foundry Local models with proper state cleanup
Settings UI now displays all cached models with unified management

6. Telemetry

Added comprehensive telemetry in FoundryLocalEvents.cs:

FoundryLocalOperationEvent - Success tracking with duration metrics (Info level)
FoundryLocalDownloadEvent - Download success/failure, size, duration (Info/Critical)
FoundryLocalErrorEvent - Operation failures with phase and error details (Critical)

Covers initialization, loading, downloads, deletions, and inference operations.

7. Package Dependencies

Added:

Microsoft.AI.Foundry.Local.WinML 0.8.2.1

Upgraded:

Microsoft.ML.OnnxRuntimeGenAI.Managed 0.10.1 → 0.11.4
Microsoft.ML.OnnxRuntimeGenAI.WinML 0.10.1 → 0.11.4

Removed:

HTTP-based OpenAI client dependencies
Web service management utilities

Build Configuration:

Updated nuget.config, ExcludeExtraLibs.props, and Directory.Packages.props for proper packaging and conflict resolution

8. Removed Components

Web service process management (FoundryServiceManager.cs)
HTTP client and SSE parsing logic
JSON serialization/deserialization for HTTP payloads
Localhost service discovery and port management

Eliminated failure modes: process crashes, HTTP timeouts, connection issues, SSE parsing errors.

📊 Execution Flow Comparison

Before: Web Service Architecture

sequenceDiagram
    participant App as Application
    participant Provider as FoundryLocalModelProvider
    participant ServiceMgr as FoundryServiceManager
    participant HTTP as HttpClient (OpenAI SDK)
    participant Service as Foundry Local Web Service
    participant Model as Model Inference

    rect rgb(240, 240, 240)
        Note over App,ServiceMgr: Initialization (once at startup)
        Provider->>ServiceMgr: StartService()
        ServiceMgr->>Service: Launch Web Service Process
        Service-->>ServiceMgr: Service URL (http://localhost:xxxx)
        Provider->>Provider: Store service URL
    end

    rect rgb(250, 250, 240)
        Note over App,HTTP: Get Chat Client (no model loading yet)
        App->>Provider: GetIChatClient(url)
        Provider->>HTTP: new OpenAIClient(serviceUrl)
        HTTP-->>Provider: OpenAI IChatClient wrapper
        Provider-->>App: IChatClient (pointing to service)
    end

    rect rgb(240, 250, 240)
        Note over App,Model: Inference (model loaded on first request)
        App->>HTTP: GetStreamingResponseAsync(messages)
        HTTP->>Service: POST /v1/chat/completions
        Service->>Model: Load model (if not loaded) + Inference
        Model-->>Service: Response Stream (SSE format)
        Service-->>HTTP: text/event-stream chunks
        HTTP->>HTTP: Parse SSE: "data: {...}\n\n"
        HTTP-->>App: ChatResponseUpdate chunks
    end

    Note over Provider,Service: Issues:<br/>- HTTP/SSE overhead for local calls<br/>- Implicit model loading (no control)<br/>- Service process management<br/>- SSE parsing complexity

After: Native SDK Architecture

sequenceDiagram
    participant App as Application
    participant Provider as FoundryLocalModelProvider
    participant Client as FoundryClient
    participant Manager as FoundryLocalManager (SDK)
    participant Catalog as ICatalog
    participant Model as IModel
    participant ChatClient as OpenAIChatClient (SDK)
    participant Adapter as FoundryLocalChatClientAdapter

    rect rgb(240, 240, 240)
        Note over App,Manager: Initialization (on first use)
        App->>Provider: GetModelsAsync() / IsAvailable()
        Provider->>Provider: InitializeAsync()
        Provider->>Client: FoundryClient.CreateAsync()
        Client->>Manager: FoundryLocalManager.CreateAsync(config)
        Manager-->>Client: Instance + IsInitialized
        Client->>Manager: GetCatalogAsync()
        Manager-->>Client: ICatalog
        Client-->>Provider: FoundryClient instance
    end

    rect rgb(250, 240, 240)
        Note over App,Model: Model Preparation (explicit, before inference)
        App->>Provider: EnsureModelReadyAsync(url)
        Provider->>Provider: Check _foundryManager.GetLoadedModel(alias)
        Provider->>Client: EnsureModelLoadedAsync(alias)
        Client->>Client: Acquire _loadLock, check _loadedModels
        Client->>Catalog: GetModelAsync(alias)
        Catalog-->>Client: IModel (variant selected)
        Client->>Model: IsCachedAsync() / LoadAsync()
        Note over Model: Load into GPU/NPU/CPU memory
        Model-->>Client: Loaded
        Client->>Client: Cache in _loadedModels[alias]
        Client-->>Provider: Completed
        Provider-->>App: Ready
    end

    rect rgb(240, 250, 240)
        Note over App,Adapter: Get Chat Client (model must be loaded)
        App->>Provider: GetIChatClient(url)
        Provider->>Client: GetLoadedModel(alias)
        Client-->>Provider: IModel (from _loadedModels)
        Provider->>Model: GetChatClientAsync()
        Model-->>Provider: SDK's OpenAIChatClient
        Provider->>Provider: GetModelMaxOutputTokens(alias)
        Provider->>Adapter: new FoundryLocalChatClientAdapter(chatClient, modelId, maxTokens)
        Adapter-->>Provider: IChatClient wrapper
        Provider-->>App: IChatClient ready for inference
    end

    rect rgb(240, 240, 250)
        Note over App,Model: Inference (direct in-process calls)
        App->>Adapter: GetStreamingResponseAsync(messages, options)
        Adapter->>Adapter: ApplyChatOptions(options) → Set MaxTokens, Temperature, etc.
        Adapter->>Adapter: ConvertToOpenAIMessages(messages)
        Adapter->>ChatClient: CompleteChatStreamingAsync(openAIMessages)
        ChatClient->>Model: Direct API call (in-process)
        Model-->>ChatClient: Streaming chunks
        loop For each chunk
            ChatClient-->>Adapter: StreamingResponse chunk
            Adapter->>Adapter: Extract content, create ChatResponseUpdate
            Adapter-->>App: Yield ChatResponseUpdate
        end
        Adapter->>Adapter: Validate chunkCount > 0
    end

    Note over Provider,Model: Benefits:<br/>- No HTTP/network overhead<br/>- Explicit model lifecycle control<br/>- Direct in-process calls<br/>- Native streaming (no parsing)<br/>- Access to model metadata

Key Flow Differences

Aspect	Before (Web Service)	After (Native SDK)
Initialization	Service process start → Get service URL	Lazy SDK initialization → Get catalog (on first use)
Model Loading	Implicit (on first HTTP request to service)	Explicit via `EnsureModelReadyAsync()` before use
Chat Client Creation	OpenAI HTTP client wrapper (lightweight)	Native SDK chat client from loaded IModel
Communication	HTTP POST to localhost service	Direct in-process method calls
Streaming	SSE text parsing ("data: {...}\n\n")	Native async enumerable (IAsyncEnumerable)
Error Handling	HTTP status codes (400, 500, etc.)	Typed exceptions (InvalidOperationException, etc.)
Resource Management	External service process + HTTP connections	SDK singleton + IModel instances in memory
Model State	Unknown to application (service internal)	Explicit tracking in `_loadedModels` dictionary

Known Limitations & Follow-up Work

IDisposableAnalyzers Build Warnings

Root Cause: Microsoft.AI.Foundry.Local.WinML (v0.8.2.1) includes IDisposableAnalyzers (v4.0.8) as a transitive dependency
Impact: 237+ analyzer violations across codebase related to improper IDisposable pattern usage
Temporary Solution: Suppressed IDISP001, IDISP003, IDISP017 in Directory.Build.props and project files
Planned Remediation: Dedicated follow-up PR to address all violations project-wide

Code/Project Generation Features - Deferred to Future PR

Focus on Core Integration: The priority is establishing stable FoundryLocal SDK integration, which requires careful design for complex scenarios like chat clients, model lifecycle, and error handling. Code generation can be added once the foundation is solid.
Project Generator Redesign: Our existing project generator requires refactoring to support multiple AI sources (GitHub Models, Azure OpenAI, FoundryLocal). Rather than retrofit FoundryLocal into the current system, we'll redesign the architecture to cleanly support all providers in a follow-up PR

FoundryLocal SDK Bug: cancel button doesn't stop model downloads

When users click "Cancel" during large model downloads (multi-GB files), the UI shows "canceled" but downloads continue to 100% completion in the background. This impacts our AI Dev Gallery users who cannot stop accidental downloads. Users perceive the cancel functionality as broken, creating confusion and negative experience. The SDK ignores standard .NET cancellation patterns during download operations. issue link

Functional Testing

1. Initialization & Service Availability

Application Startup - Foundry Local client auto-initializes
Connection Status Detection - Verify Foundry Local availability in model picker
IsAvailable() Returns False - When Foundry Local is not installed/initialized
SDK Initialization Failure - User messaging when service fails to start
Error Recovery - Error messages and retry mechanism after initialization failure

2. Model Catalog & Discovery

List Complete Catalog - GetAllModelsInCatalog() returns full catalog including non-downloaded models
Model Information Integrity - DisplayName, Alias, FileSizeMb display correctly
Cached Status Identification - Downloaded models correctly marked
Runtime Info Filtering - Models with null Runtime are correctly filtered
ExecutionProvider Display - Correctly shows "CPU", "WebGPU", "QNN", etc.
File Size Display - Model file sizes correctly displayed (MB)
License Information - License info correctly displayed

3. Model Download Flow

4. Model Preparation & Loading

5. Chat Inference & Streaming

6. Model Cache Management 🆕

7. Code Generation 🔜 TODO: Next PR

[-] ~~GetIChatClientString() returns valid compilable C# code~~
[-] ~~Generated code includes correct SDK initialization pattern~~
[-] ~~Generated code uses correct Alias (not deprecated Name)~~
[-] ~~Generated code includes necessary using statements~~
[-] ~~Generated code runs directly~~

Regression Testing

8. Project Generation 🔜 TODO: Next PR

[-] ~~Foundry Local models generate correct code samples~~
[-] ~~Reference correct NuGet packages (Microsoft.AI.Foundry.Local.WinML, Microsoft.Extensions.AI)~~
[-] ~~Generated projects compile without errors~~
[-] ~~NugetPackageReferences property returns correct package list~~

9. Multi-Provider Compatibility

Other Providers Unaffected - OpenAI, Ollama, GitHub Models, HuggingFace work normally
Provider Switching - Switching between Foundry Local and other providers works seamlessly
Model Picker UI - All provider types correctly displayed

10. UI/UX Flows

Foundry Local Picker View - Displays correctly
Download Progress UI - Smooth updates (no flickering or freezing)
Preparation Status Indication - User shown status when model not ready
Error Message Display - Download/preparation failures shown to user
Model Details Rendering - Size, License, Description correctly rendered
ExecutionProvider Short Labels - GetShortExecutionProvider() displays correctly

11. Cross-Sample Testing

Test Foundry Local models in the following Samples:

Generate (Language Models) - Basic text generation (with MaxOutputTokens setting)
Chat - Multi-turn conversation
RAG - Retrieval Augmented Generation
Other IChatClient Samples - All samples using chat client

12. Platform-Specific Validation

Windows x64
Windows ARM64 QNN
Intel NPU
No APPX1101 Errors - No duplicate DLL errors on either platform

⚠️ Edge Cases & Error Handling

13. Invalid Inputs

Invalid URL Format - GetIChatClient() throws clear exception
Wrong Model Type - DownloadModel() with non-FoundryCatalogModel returns false
Non-existent Alias - EnsureModelReadyAsync() throws clear exception
Empty Alias - GetIChatClient() with empty/null URL throws exception

14. Resource Management

FoundryClient Disposal - Dispose() called correctly (no resource leaks)
Semaphore Disposal - _prepareLock disposed correctly
SDK Singleton Management - Models managed by SDK singleton, no manual disposal

Reference

Foundry Local SDK reference

…l SDK's ExcludeExtraLibs pattern

Milly Wei (from Dev Box) added 30 commits December 12, 2025 19:20

Update

d484ccf

delete useless files

83ccfcc

rename

6b70841

rename

2b960aa

rename

27773b8

remove acceleratorInfo

5c60cf1

update

271f610

update

1199d02

test nuget source

98196fb

test nuget source

90eb03f

fix format

970ab4d

fix format

ef5c00b

CI/CD error

378100c

CI/CD error

6811b13

CI/CD error

061806e

Fix APPX1101 duplicate onnxruntime.dll error by adopting Foundry Loca…

2e5bcf1

…l SDK's ExcludeExtraLibs pattern

CI/CD error

32b52a7

Add telemetry

e555a51

Add telemetry

4733634

format

a024a97

chatClient use SDK

a46aed9

Fix

ecb30f7

Fix

bc8c498

UPDATE

76b78cd

UPDATE

649d8cb

revert

05fe712

update

f88bdb1

update

e9aa10e

fix format

b5ca5b6

fix format

56841a3

Milly Wei (from Dev Box) added 29 commits December 26, 2025 11:23

Add Telemetry

d4d105b

Update FoundryLocal not available message for SDK-based implementation

a920b21

Add retry button

b7618db

use ModelCacheDir

c4188ae

remove web

792a743

update

8b1fa6f

remove url

2727c98

update

28feede

update

a5330e6

Add task-based filtering for Foundry Local model picker

85c0db7

Add task-based filtering for Foundry Local model picker

3e3caa2

code clean

6b38291

format

66cd8f1

Add UT

2fdefb7

format

48b4714

format

ed6b6ea

rename

d472fad

rename

d3a5762

Eliminated sync-over-async anti-pattern

1e2deec

update file path

cda55e4

revert

9ccc8f9

update

c7c4682

remove useless telemetry

c91bd28

format

b65b45b

Add UT

f7fd545

update comment

bcb8917

update

459cea9

format

8858893

UT format

88e4ab0

weiyuanyue closed this Dec 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Fix][Refactor]Migrate FoundryLocal integration to official Microsoft.AI.Foundry.Local.WinML SDK #535

[Fix][Refactor]Migrate FoundryLocal integration to official Microsoft.AI.Foundry.Local.WinML SDK #535

Uh oh!

weiyuanyue commented Dec 15, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Fix][Refactor]Migrate FoundryLocal integration to official Microsoft.AI.Foundry.Local.WinML SDK #535

[Fix][Refactor]Migrate FoundryLocal integration to official Microsoft.AI.Foundry.Local.WinML SDK #535

Uh oh!

Conversation

weiyuanyue commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background & Root Cause

Solution: SDK Migration

Key Benefits

Screenshots

Technical Changes

1. Architecture Migration: From Web Service to Native SDK

2. Explicit Model Lifecycle Management

3. SDK Adapter for Microsoft.Extensions.AI

4. Stable Model Identity

5. Unified Cache Management

6. Telemetry

7. Package Dependencies

8. Removed Components

📊 Execution Flow Comparison

Before: Web Service Architecture

After: Native SDK Architecture

Key Flow Differences

Known Limitations & Follow-up Work

IDisposableAnalyzers Build Warnings

Code/Project Generation Features - Deferred to Future PR

FoundryLocal SDK Bug: cancel button doesn't stop model downloads

Functional Testing

1. Initialization & Service Availability

2. Model Catalog & Discovery

3. Model Download Flow

4. Model Preparation & Loading

5. Chat Inference & Streaming

6. Model Cache Management 🆕

7. Code Generation 🔜 TODO: Next PR

Regression Testing

8. Project Generation 🔜 TODO: Next PR

9. Multi-Provider Compatibility

10. UI/UX Flows

11. Cross-Sample Testing

12. Platform-Specific Validation

⚠️ Edge Cases & Error Handling

13. Invalid Inputs

14. Resource Management

Reference

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

weiyuanyue commented Dec 15, 2025 •

edited

Loading