Skip to content

Conversation

@weiyuanyue
Copy link
Contributor

@weiyuanyue weiyuanyue commented Dec 15, 2025

Summary

This PR resolves a critical production-breaking bug that blocked all FoundryLocal model downloads in AI Dev Gallery after upgrading to Foundry Local v0.8.x+. We migrate from fragile custom HTTP API calls to the official Microsoft.AI.Foundry.Local.WinML SDK (v0.8.2.1), restoring full functionality and establishing a resilient foundation for future compatibility.

Impact: Users can now reliably download, prepare, and use FoundryLocal models. The system is significantly more robust against upstream API changes.


Background & Root Cause

In Jul 2025, Foundry Local changed the internal format of a critical Name field in its HTTP API response. While this change was handled internally by Foundry Local, it was not communicated externally, causing silent incompatibility with AIDG's direct HTTP-based integration.(ADO)

After upgrading to Foundry Local v0.8.x:

  • All model download requests failed silently due to field format mismatch
  • Downstream workflows (model preparation → chat/inference) were completely blocked
  • Users experienced inability to download or use any FoundryLocal models

Business Impact:

  • Disrupted critical developer workflows
  • Increased support burden and user frustration
  • Eroded trust in AIDG's stability and reliability

Solution: SDK Migration

To eliminate this entire class of failures and future-proof the integration, we migrate to the official SDK:

Key Benefits

  1. Native SDK is fully self‑contained and does not require the Foundry Local CLI or any external services
  2. Native SDK path replaces HTTP service, eliminating cross‑process overhead.
  3. Explicit model preparation removes cold‑start latency and ensures deterministic loading.
  4. Unified cache management exposes all models and enables precise storage control.
  5. Typed native streaming replaces SSE parsing for higher reliability and lower latency.
  6. Alias‑based model identity standardizes lookup and eliminates naming ambiguity.
  7. Removal of Web Service layer reduces failure surfaces and simplifies maintenance.

Screenshots

  1. Foundry Local running normally with no UX change.
image
  1. Systems that do not support Foundry Local, showing the new in-app message presented to users. (Previously, the UI incorrectly instructed users to install Foundry Local.)
image
  1. The updated Settings page, which now includes a management interface for Foundry Local’s cached models.
image
  1. Warning notification shown when a downloaded model fails to load during EnsureModelLoadedAsync validation
image

Technical Changes

1. Architecture Migration: From Web Service to Native SDK

Migrated from HTTP-based integration (launching external service process + OpenAI-compatible REST API) to direct SDK integration:

Before: Application → HTTP Client → Foundry Local Service → Model
After:  Application → FoundryClient → FoundryLocalManager (SDK) → IModel → Model

Benefits:

  • In-process execution eliminates HTTP/SSE overhead
  • Direct access to model metadata (e.g., MaxOutputTokens)
  • Native async streaming without parsing
  • No external process lifecycle management

2. Explicit Model Lifecycle Management

Replaced implicit model loading (triggered by first inference request) with explicit two-phase lifecycle:

Phase 1 - Preparation:

  • EnsureModelReadyAsync() in FoundryLocalModelProvider calls EnsureModelLoadedAsync() in FoundryClient
  • SDK loads model into memory (GPU/NPU/CPU) and caches in _loadedModels dictionary
  • Thread-safe loading via SemaphoreSlim prevents duplicate loads

Phase 2 - Usage:

  • GetIChatClient() retrieves loaded model via GetLoadedModel(alias)
  • Returns IChatClient adapter wrapping the SDK's native OpenAIChatClient

This enables predictable loading behavior with progress indication and avoids inference-time delays.


3. SDK Adapter for Microsoft.Extensions.AI

FoundryLocalChatClientAdapter bridges the SDK's OpenAIChatClient to our IChatClient abstraction:

  • Maps ChatOptions parameters (Temperature, TopP, MaxOutputTokens, etc.) to SDK's ChatSettings
  • Enforces MaxTokens configuration (required for output generation) with fallback defaults
  • Converts streaming responses to IAsyncEnumerable<ChatResponseUpdate>
  • Validates non-empty output and provides actionable error messages
  • Transforms message formats between abstractions (text-only; multi-modal not yet supported)

4. Stable Model Identity

Adopted SDK's two-tier model identification:

  • Alias - Stable family identifier (e.g., "qwen2.5-0.5b") used in fl://<alias> URLs
  • ModelId - Precise variant identifier (e.g., "qwen2.5-0.5b-instruct-generic-cpu:3") for operations

ICatalog.GetModelAsync(alias) handles automatic variant selection. The Task field enables filtering by capability ("chat-completion", "automatic-speech-recognition"). This approach insulates the app from variant naming changes.


5. Unified Cache Management

Integrated SDK's cache (managed in ModelCacheDir) with application's cache UI:

  • GetCachedModelsWithDetails() queries SDK cache via ICatalog.GetCachedModelsAsync()
  • DeleteCachedModelAsync() unloads active models, removes from SDK cache, and cleans internal tracking
  • ClearAllCacheAsync() deletes all Foundry Local models with proper state cleanup
  • Settings UI now displays all cached models with unified management

6. Telemetry

Added comprehensive telemetry in FoundryLocalEvents.cs:

  • FoundryLocalOperationEvent - Success tracking with duration metrics (Info level)
  • FoundryLocalDownloadEvent - Download success/failure, size, duration (Info/Critical)
  • FoundryLocalErrorEvent - Operation failures with phase and error details (Critical)

Covers initialization, loading, downloads, deletions, and inference operations.


7. Package Dependencies

Added:

  • Microsoft.AI.Foundry.Local.WinML 0.8.2.1

Upgraded:

  • Microsoft.ML.OnnxRuntimeGenAI.Managed 0.10.1 → 0.11.4
  • Microsoft.ML.OnnxRuntimeGenAI.WinML 0.10.1 → 0.11.4

Removed:

  • HTTP-based OpenAI client dependencies
  • Web service management utilities

Build Configuration:

  • Updated nuget.config, ExcludeExtraLibs.props, and Directory.Packages.props for proper packaging and conflict resolution

8. Removed Components

  • Web service process management (FoundryServiceManager.cs)
  • HTTP client and SSE parsing logic
  • JSON serialization/deserialization for HTTP payloads
  • Localhost service discovery and port management

Eliminated failure modes: process crashes, HTTP timeouts, connection issues, SSE parsing errors.


📊 Execution Flow Comparison

Before: Web Service Architecture

sequenceDiagram
    participant App as Application
    participant Provider as FoundryLocalModelProvider
    participant ServiceMgr as FoundryServiceManager
    participant HTTP as HttpClient (OpenAI SDK)
    participant Service as Foundry Local Web Service
    participant Model as Model Inference

    rect rgb(240, 240, 240)
        Note over App,ServiceMgr: Initialization (once at startup)
        Provider->>ServiceMgr: StartService()
        ServiceMgr->>Service: Launch Web Service Process
        Service-->>ServiceMgr: Service URL (http://localhost:xxxx)
        Provider->>Provider: Store service URL
    end

    rect rgb(250, 250, 240)
        Note over App,HTTP: Get Chat Client (no model loading yet)
        App->>Provider: GetIChatClient(url)
        Provider->>HTTP: new OpenAIClient(serviceUrl)
        HTTP-->>Provider: OpenAI IChatClient wrapper
        Provider-->>App: IChatClient (pointing to service)
    end

    rect rgb(240, 250, 240)
        Note over App,Model: Inference (model loaded on first request)
        App->>HTTP: GetStreamingResponseAsync(messages)
        HTTP->>Service: POST /v1/chat/completions
        Service->>Model: Load model (if not loaded) + Inference
        Model-->>Service: Response Stream (SSE format)
        Service-->>HTTP: text/event-stream chunks
        HTTP->>HTTP: Parse SSE: "data: {...}\n\n"
        HTTP-->>App: ChatResponseUpdate chunks
    end

    Note over Provider,Service: Issues:<br/>- HTTP/SSE overhead for local calls<br/>- Implicit model loading (no control)<br/>- Service process management<br/>- SSE parsing complexity
Loading

After: Native SDK Architecture

sequenceDiagram
    participant App as Application
    participant Provider as FoundryLocalModelProvider
    participant Client as FoundryClient
    participant Manager as FoundryLocalManager (SDK)
    participant Catalog as ICatalog
    participant Model as IModel
    participant ChatClient as OpenAIChatClient (SDK)
    participant Adapter as FoundryLocalChatClientAdapter

    rect rgb(240, 240, 240)
        Note over App,Manager: Initialization (on first use)
        App->>Provider: GetModelsAsync() / IsAvailable()
        Provider->>Provider: InitializeAsync()
        Provider->>Client: FoundryClient.CreateAsync()
        Client->>Manager: FoundryLocalManager.CreateAsync(config)
        Manager-->>Client: Instance + IsInitialized
        Client->>Manager: GetCatalogAsync()
        Manager-->>Client: ICatalog
        Client-->>Provider: FoundryClient instance
    end

    rect rgb(250, 240, 240)
        Note over App,Model: Model Preparation (explicit, before inference)
        App->>Provider: EnsureModelReadyAsync(url)
        Provider->>Provider: Check _foundryManager.GetLoadedModel(alias)
        Provider->>Client: EnsureModelLoadedAsync(alias)
        Client->>Client: Acquire _loadLock, check _loadedModels
        Client->>Catalog: GetModelAsync(alias)
        Catalog-->>Client: IModel (variant selected)
        Client->>Model: IsCachedAsync() / LoadAsync()
        Note over Model: Load into GPU/NPU/CPU memory
        Model-->>Client: Loaded
        Client->>Client: Cache in _loadedModels[alias]
        Client-->>Provider: Completed
        Provider-->>App: Ready
    end

    rect rgb(240, 250, 240)
        Note over App,Adapter: Get Chat Client (model must be loaded)
        App->>Provider: GetIChatClient(url)
        Provider->>Client: GetLoadedModel(alias)
        Client-->>Provider: IModel (from _loadedModels)
        Provider->>Model: GetChatClientAsync()
        Model-->>Provider: SDK's OpenAIChatClient
        Provider->>Provider: GetModelMaxOutputTokens(alias)
        Provider->>Adapter: new FoundryLocalChatClientAdapter(chatClient, modelId, maxTokens)
        Adapter-->>Provider: IChatClient wrapper
        Provider-->>App: IChatClient ready for inference
    end

    rect rgb(240, 240, 250)
        Note over App,Model: Inference (direct in-process calls)
        App->>Adapter: GetStreamingResponseAsync(messages, options)
        Adapter->>Adapter: ApplyChatOptions(options) → Set MaxTokens, Temperature, etc.
        Adapter->>Adapter: ConvertToOpenAIMessages(messages)
        Adapter->>ChatClient: CompleteChatStreamingAsync(openAIMessages)
        ChatClient->>Model: Direct API call (in-process)
        Model-->>ChatClient: Streaming chunks
        loop For each chunk
            ChatClient-->>Adapter: StreamingResponse chunk
            Adapter->>Adapter: Extract content, create ChatResponseUpdate
            Adapter-->>App: Yield ChatResponseUpdate
        end
        Adapter->>Adapter: Validate chunkCount > 0
    end

    Note over Provider,Model: Benefits:<br/>- No HTTP/network overhead<br/>- Explicit model lifecycle control<br/>- Direct in-process calls<br/>- Native streaming (no parsing)<br/>- Access to model metadata
Loading

Key Flow Differences

Aspect Before (Web Service) After (Native SDK)
Initialization Service process start → Get service URL Lazy SDK initialization → Get catalog (on first use)
Model Loading Implicit (on first HTTP request to service) Explicit via EnsureModelReadyAsync() before use
Chat Client Creation OpenAI HTTP client wrapper (lightweight) Native SDK chat client from loaded IModel
Communication HTTP POST to localhost service Direct in-process method calls
Streaming SSE text parsing ("data: {...}\n\n") Native async enumerable (IAsyncEnumerable)
Error Handling HTTP status codes (400, 500, etc.) Typed exceptions (InvalidOperationException, etc.)
Resource Management External service process + HTTP connections SDK singleton + IModel instances in memory
Model State Unknown to application (service internal) Explicit tracking in _loadedModels dictionary

Known Limitations & Follow-up Work

IDisposableAnalyzers Build Warnings

  • Root Cause: Microsoft.AI.Foundry.Local.WinML (v0.8.2.1) includes IDisposableAnalyzers (v4.0.8) as a transitive dependency
  • Impact: 237+ analyzer violations across codebase related to improper IDisposable pattern usage
  • Temporary Solution: Suppressed IDISP001, IDISP003, IDISP017 in Directory.Build.props and project files
  • Planned Remediation: Dedicated follow-up PR to address all violations project-wide

Code/Project Generation Features - Deferred to Future PR

  1. Focus on Core Integration: The priority is establishing stable FoundryLocal SDK integration, which requires careful design for complex scenarios like chat clients, model lifecycle, and error handling. Code generation can be added once the foundation is solid.

  2. Project Generator Redesign: Our existing project generator requires refactoring to support multiple AI sources (GitHub Models, Azure OpenAI, FoundryLocal). Rather than retrofit FoundryLocal into the current system, we'll redesign the architecture to cleanly support all providers in a follow-up PR

FoundryLocal SDK Bug: cancel button doesn't stop model downloads

When users click "Cancel" during large model downloads (multi-GB files), the UI shows "canceled" but downloads continue to 100% completion in the background. This impacts our AI Dev Gallery users who cannot stop accidental downloads. Users perceive the cancel functionality as broken, creating confusion and negative experience. The SDK ignores standard .NET cancellation patterns during download operations. issue link


Functional Testing

1. Initialization & Service Availability

  • Application Startup - Foundry Local client auto-initializes
  • Connection Status Detection - Verify Foundry Local availability in model picker
  • IsAvailable() Returns False - When Foundry Local is not installed/initialized
  • SDK Initialization Failure - User messaging when service fails to start
  • Error Recovery - Error messages and retry mechanism after initialization failure

2. Model Catalog & Discovery

  • List Complete Catalog - GetAllModelsInCatalog() returns full catalog including non-downloaded models
  • Model Information Integrity - DisplayName, Alias, FileSizeMb display correctly
  • Cached Status Identification - Downloaded models correctly marked
  • Runtime Info Filtering - Models with null Runtime are correctly filtered
  • ExecutionProvider Display - Correctly shows "CPU", "WebGPU", "QNN", etc.
  • File Size Display - Model file sizes correctly displayed (MB)
  • License Information - License info correctly displayed

3. Model Download Flow

  • Basic Download - Download uncached models from catalog
  • Progress Reporting - Progress callback fires correctly (0% → 100%)
  • Progress Update Frequency - Progress callback fires at reasonable intervals
  • [-] Download Cancellation - CancellationToken correctly cancels download(SDK issue)
  • Skip Re-download - Already-downloaded models return success immediately
  • Download Failure Telemetry - Failure scenarios log telemetry with error messages
  • Concurrent Download Handling - Multiple concurrent downloads without race conditions
  • Network Interruption Handling - Retries x times, then allows clean restart after reconnection
  • Disk Space Exhaustion - Handling when disk is full during large model download
  • Auto-prepare After Download - Model automatically Prepare (Load) after download completes
  • Download Telemetry Event - FoundryLocalDownloadEvent includes ModelAlias, Success, ErrorMessage, FileSizeMb, Duration
  • Failure Log Level - Download failures logged with LogLevel.Critical

4. Model Preparation & Loading

  • First-time Preparation - EnsureModelReadyAsync() successfully prepares model
  • No Deadlock on UI Thread - Preparing model on UI thread context doesn't deadlock
  • Idempotency Check - Already-prepared models skip re-preparation on EnsureModelReadyAsync() call
  • Concurrent Preparation Safety - Multiple concurrent calls for same model handled safely via semaphore lock
  • Unprepared Model Query - GetPreparedModel() returns null for unprepared models
  • Prepared Model Query - GetPreparedModel() returns valid IModel for prepared models
  • x64 Model Loading - LoadAsync completes successfully on x64 platform
  • ARM64 Model Loading - LoadAsync completes successfully on ARM64 platform
  • Error on Unprepared Use - Calling GetIChatClient without preparation throws exception
  • Use After Unload - Using after ClearPreparedModelsAsync requires re-preparation

5. Chat Inference & Streaming

  • Unprepared Exception - GetIChatClient() throws clear exception when model not prepared
  • End-to-End Streaming Inference - Chat streaming correctly returns response chunks
  • MaxOutputTokens Limit - MaxTokens parameter correctly limits output length
  • Default MaxTokens - Uses model's MaxOutputTokens or default 1024 when not set
  • Temperature Parameter - ChatOptions.Temperature correctly applied
  • Model Stop Conditions - Streaming handles model-generated stop conditions gracefully
  • Empty Output Detection - Model with no output throws InvalidOperationException and logs telemetry
  • Empty/Null Message Input - Empty or null message inputs throw appropriate errors
  • System Prompts - System Message correctly passed
  • Multi-turn Conversation - Consecutive messages sent successfully
  • Long Conversation History - Long conversation history processed without truncation

6. Model Cache Management 🆕

  • Settings Page Display - List all Foundry Local cached models in Settings > Storage
  • Integrated List - GetAllModelsAsync() returns both CacheStore and Foundry Local models
  • Cache Size Calculation - Correctly displays Foundry Local model storage usage
  • Total Size Aggregation - Total cache size includes Foundry Local models
  • Delete Single Model - DeleteCachedModelAsync() correctly deletes specific model
  • Unload Before Delete - Loaded models are unloaded before deletion
  • Update List After Delete - _downloadedModels correctly updated after deletion
  • Clear All Cache - ClearAllCacheAsync() deletes all Foundry Local models
  • Clear Telemetry - Clear operation logs deletion count
  • Disable Folder Button - "Open Folder" has no effect for Foundry Local models (managed by SDK)
  • Delete Error Handling - Delete failures log telemetry events

7. Code Generation 🔜 TODO: Next PR

  • [-] GetIChatClientString() returns valid compilable C# code
  • [-] Generated code includes correct SDK initialization pattern
  • [-] Generated code uses correct Alias (not deprecated Name)
  • [-] Generated code includes necessary using statements
  • [-] Generated code runs directly

Regression Testing

8. Project Generation 🔜 TODO: Next PR

  • [-] Foundry Local models generate correct code samples
  • [-] Reference correct NuGet packages (Microsoft.AI.Foundry.Local.WinML, Microsoft.Extensions.AI)
  • [-] Generated projects compile without errors
  • [-] NugetPackageReferences property returns correct package list

9. Multi-Provider Compatibility

  • Other Providers Unaffected - OpenAI, Ollama, GitHub Models, HuggingFace work normally
  • Provider Switching - Switching between Foundry Local and other providers works seamlessly
  • Model Picker UI - All provider types correctly displayed

10. UI/UX Flows

  • Foundry Local Picker View - Displays correctly
  • Download Progress UI - Smooth updates (no flickering or freezing)
  • Preparation Status Indication - User shown status when model not ready
  • Error Message Display - Download/preparation failures shown to user
  • Model Details Rendering - Size, License, Description correctly rendered
  • ExecutionProvider Short Labels - GetShortExecutionProvider() displays correctly

11. Cross-Sample Testing

Test Foundry Local models in the following Samples:

  • Generate (Language Models) - Basic text generation (with MaxOutputTokens setting)
  • Chat - Multi-turn conversation
  • RAG - Retrieval Augmented Generation
  • Other IChatClient Samples - All samples using chat client

12. Platform-Specific Validation

  • Windows x64
  • Windows ARM64 QNN
  • Intel NPU
  • No APPX1101 Errors - No duplicate DLL errors on either platform

⚠️ Edge Cases & Error Handling

13. Invalid Inputs

  • Invalid URL Format - GetIChatClient() throws clear exception
  • Wrong Model Type - DownloadModel() with non-FoundryCatalogModel returns false
  • Non-existent Alias - EnsureModelReadyAsync() throws clear exception
  • Empty Alias - GetIChatClient() with empty/null URL throws exception

14. Resource Management

  • FoundryClient Disposal - Dispose() called correctly (no resource leaks)
  • Semaphore Disposal - _prepareLock disposed correctly
  • SDK Singleton Management - Models managed by SDK singleton, no manual disposal

Reference

Foundry Local SDK reference

@weiyuanyue weiyuanyue closed this Dec 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants