-
Notifications
You must be signed in to change notification settings - Fork 196
[Fix][Refactor]Migrate FoundryLocal integration to official Microsoft.AI.Foundry.Local.WinML SDK #535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…l SDK's ExcludeExtraLibs pattern
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR resolves a critical production-breaking bug that blocked all FoundryLocal model downloads in AI Dev Gallery after upgrading to Foundry Local v0.8.x+. We migrate from fragile custom HTTP API calls to the official
Microsoft.AI.Foundry.Local.WinMLSDK (v0.8.2.1), restoring full functionality and establishing a resilient foundation for future compatibility.Impact: Users can now reliably download, prepare, and use FoundryLocal models. The system is significantly more robust against upstream API changes.
Background & Root Cause
In Jul 2025, Foundry Local changed the internal format of a critical
Namefield in its HTTP API response. While this change was handled internally by Foundry Local, it was not communicated externally, causing silent incompatibility with AIDG's direct HTTP-based integration.(ADO)After upgrading to Foundry Local v0.8.x:
Business Impact:
Solution: SDK Migration
To eliminate this entire class of failures and future-proof the integration, we migrate to the official SDK:
Key Benefits
Screenshots
Technical Changes
1. Architecture Migration: From Web Service to Native SDK
Migrated from HTTP-based integration (launching external service process + OpenAI-compatible REST API) to direct SDK integration:
Benefits:
MaxOutputTokens)2. Explicit Model Lifecycle Management
Replaced implicit model loading (triggered by first inference request) with explicit two-phase lifecycle:
Phase 1 - Preparation:
EnsureModelReadyAsync()inFoundryLocalModelProvidercallsEnsureModelLoadedAsync()inFoundryClient_loadedModelsdictionarySemaphoreSlimprevents duplicate loadsPhase 2 - Usage:
GetIChatClient()retrieves loaded model viaGetLoadedModel(alias)IChatClientadapter wrapping the SDK's nativeOpenAIChatClientThis enables predictable loading behavior with progress indication and avoids inference-time delays.
3. SDK Adapter for Microsoft.Extensions.AI
FoundryLocalChatClientAdapterbridges the SDK'sOpenAIChatClientto ourIChatClientabstraction:ChatOptionsparameters (Temperature, TopP, MaxOutputTokens, etc.) to SDK'sChatSettingsMaxTokensconfiguration (required for output generation) with fallback defaultsIAsyncEnumerable<ChatResponseUpdate>4. Stable Model Identity
Adopted SDK's two-tier model identification:
"qwen2.5-0.5b") used infl://<alias>URLs"qwen2.5-0.5b-instruct-generic-cpu:3") for operationsICatalog.GetModelAsync(alias)handles automatic variant selection. TheTaskfield enables filtering by capability ("chat-completion", "automatic-speech-recognition"). This approach insulates the app from variant naming changes.5. Unified Cache Management
Integrated SDK's cache (managed in
ModelCacheDir) with application's cache UI:GetCachedModelsWithDetails()queries SDK cache viaICatalog.GetCachedModelsAsync()DeleteCachedModelAsync()unloads active models, removes from SDK cache, and cleans internal trackingClearAllCacheAsync()deletes all Foundry Local models with proper state cleanup6. Telemetry
Added comprehensive telemetry in
FoundryLocalEvents.cs:Covers initialization, loading, downloads, deletions, and inference operations.
7. Package Dependencies
Added:
Microsoft.AI.Foundry.Local.WinML0.8.2.1Upgraded:
Microsoft.ML.OnnxRuntimeGenAI.Managed0.10.1 → 0.11.4Microsoft.ML.OnnxRuntimeGenAI.WinML0.10.1 → 0.11.4Removed:
Build Configuration:
nuget.config,ExcludeExtraLibs.props, andDirectory.Packages.propsfor proper packaging and conflict resolution8. Removed Components
FoundryServiceManager.cs)Eliminated failure modes: process crashes, HTTP timeouts, connection issues, SSE parsing errors.
📊 Execution Flow Comparison
Before: Web Service Architecture
sequenceDiagram participant App as Application participant Provider as FoundryLocalModelProvider participant ServiceMgr as FoundryServiceManager participant HTTP as HttpClient (OpenAI SDK) participant Service as Foundry Local Web Service participant Model as Model Inference rect rgb(240, 240, 240) Note over App,ServiceMgr: Initialization (once at startup) Provider->>ServiceMgr: StartService() ServiceMgr->>Service: Launch Web Service Process Service-->>ServiceMgr: Service URL (http://localhost:xxxx) Provider->>Provider: Store service URL end rect rgb(250, 250, 240) Note over App,HTTP: Get Chat Client (no model loading yet) App->>Provider: GetIChatClient(url) Provider->>HTTP: new OpenAIClient(serviceUrl) HTTP-->>Provider: OpenAI IChatClient wrapper Provider-->>App: IChatClient (pointing to service) end rect rgb(240, 250, 240) Note over App,Model: Inference (model loaded on first request) App->>HTTP: GetStreamingResponseAsync(messages) HTTP->>Service: POST /v1/chat/completions Service->>Model: Load model (if not loaded) + Inference Model-->>Service: Response Stream (SSE format) Service-->>HTTP: text/event-stream chunks HTTP->>HTTP: Parse SSE: "data: {...}\n\n" HTTP-->>App: ChatResponseUpdate chunks end Note over Provider,Service: Issues:<br/>- HTTP/SSE overhead for local calls<br/>- Implicit model loading (no control)<br/>- Service process management<br/>- SSE parsing complexityAfter: Native SDK Architecture
sequenceDiagram participant App as Application participant Provider as FoundryLocalModelProvider participant Client as FoundryClient participant Manager as FoundryLocalManager (SDK) participant Catalog as ICatalog participant Model as IModel participant ChatClient as OpenAIChatClient (SDK) participant Adapter as FoundryLocalChatClientAdapter rect rgb(240, 240, 240) Note over App,Manager: Initialization (on first use) App->>Provider: GetModelsAsync() / IsAvailable() Provider->>Provider: InitializeAsync() Provider->>Client: FoundryClient.CreateAsync() Client->>Manager: FoundryLocalManager.CreateAsync(config) Manager-->>Client: Instance + IsInitialized Client->>Manager: GetCatalogAsync() Manager-->>Client: ICatalog Client-->>Provider: FoundryClient instance end rect rgb(250, 240, 240) Note over App,Model: Model Preparation (explicit, before inference) App->>Provider: EnsureModelReadyAsync(url) Provider->>Provider: Check _foundryManager.GetLoadedModel(alias) Provider->>Client: EnsureModelLoadedAsync(alias) Client->>Client: Acquire _loadLock, check _loadedModels Client->>Catalog: GetModelAsync(alias) Catalog-->>Client: IModel (variant selected) Client->>Model: IsCachedAsync() / LoadAsync() Note over Model: Load into GPU/NPU/CPU memory Model-->>Client: Loaded Client->>Client: Cache in _loadedModels[alias] Client-->>Provider: Completed Provider-->>App: Ready end rect rgb(240, 250, 240) Note over App,Adapter: Get Chat Client (model must be loaded) App->>Provider: GetIChatClient(url) Provider->>Client: GetLoadedModel(alias) Client-->>Provider: IModel (from _loadedModels) Provider->>Model: GetChatClientAsync() Model-->>Provider: SDK's OpenAIChatClient Provider->>Provider: GetModelMaxOutputTokens(alias) Provider->>Adapter: new FoundryLocalChatClientAdapter(chatClient, modelId, maxTokens) Adapter-->>Provider: IChatClient wrapper Provider-->>App: IChatClient ready for inference end rect rgb(240, 240, 250) Note over App,Model: Inference (direct in-process calls) App->>Adapter: GetStreamingResponseAsync(messages, options) Adapter->>Adapter: ApplyChatOptions(options) → Set MaxTokens, Temperature, etc. Adapter->>Adapter: ConvertToOpenAIMessages(messages) Adapter->>ChatClient: CompleteChatStreamingAsync(openAIMessages) ChatClient->>Model: Direct API call (in-process) Model-->>ChatClient: Streaming chunks loop For each chunk ChatClient-->>Adapter: StreamingResponse chunk Adapter->>Adapter: Extract content, create ChatResponseUpdate Adapter-->>App: Yield ChatResponseUpdate end Adapter->>Adapter: Validate chunkCount > 0 end Note over Provider,Model: Benefits:<br/>- No HTTP/network overhead<br/>- Explicit model lifecycle control<br/>- Direct in-process calls<br/>- Native streaming (no parsing)<br/>- Access to model metadataKey Flow Differences
EnsureModelReadyAsync()before use_loadedModelsdictionaryKnown Limitations & Follow-up Work
IDisposableAnalyzers Build Warnings
Microsoft.AI.Foundry.Local.WinML(v0.8.2.1) includesIDisposableAnalyzers(v4.0.8) as a transitive dependencyIDisposablepattern usageDirectory.Build.propsand project filesCode/Project Generation Features - Deferred to Future PR
Focus on Core Integration: The priority is establishing stable FoundryLocal SDK integration, which requires careful design for complex scenarios like chat clients, model lifecycle, and error handling. Code generation can be added once the foundation is solid.
Project Generator Redesign: Our existing project generator requires refactoring to support multiple AI sources (GitHub Models, Azure OpenAI, FoundryLocal). Rather than retrofit FoundryLocal into the current system, we'll redesign the architecture to cleanly support all providers in a follow-up PR
FoundryLocal SDK Bug: cancel button doesn't stop model downloads
When users click "Cancel" during large model downloads (multi-GB files), the UI shows "canceled" but downloads continue to 100% completion in the background. This impacts our AI Dev Gallery users who cannot stop accidental downloads. Users perceive the cancel functionality as broken, creating confusion and negative experience. The SDK ignores standard .NET cancellation patterns during download operations. issue link
Functional Testing
1. Initialization & Service Availability
IsAvailable()Returns False - When Foundry Local is not installed/initialized2. Model Catalog & Discovery
GetAllModelsInCatalog()returns full catalog including non-downloaded modelsDisplayName,Alias,FileSizeMbdisplay correctly3. Model Download Flow
CancellationTokencorrectly cancels download(SDK issue)FoundryLocalDownloadEventincludesModelAlias,Success,ErrorMessage,FileSizeMb,DurationLogLevel.Critical4. Model Preparation & Loading
EnsureModelReadyAsync()successfully prepares modelEnsureModelReadyAsync()callGetPreparedModel()returnsnullfor unprepared modelsGetPreparedModel()returns validIModelfor prepared modelsLoadAsynccompletes successfully on x64 platformLoadAsynccompletes successfully on ARM64 platformGetIChatClientwithout preparation throws exceptionClearPreparedModelsAsyncrequires re-preparation5. Chat Inference & Streaming
GetIChatClient()throws clear exception when model not preparedMaxTokensparameter correctly limits output lengthMaxOutputTokensor default 1024 when not setChatOptions.Temperaturecorrectly appliedInvalidOperationExceptionand logs telemetry6. Model Cache Management 🆕
GetAllModelsAsync()returns both CacheStore and Foundry Local modelsDeleteCachedModelAsync()correctly deletes specific model_downloadedModelscorrectly updated after deletionClearAllCacheAsync()deletes all Foundry Local models7. Code Generation 🔜 TODO: Next PR
GetIChatClientString()returns valid compilable C# codeGenerated code includes correct SDK initialization patternGenerated code uses correctAlias(not deprecatedName)Generated code includes necessaryusingstatementsGenerated code runs directlyRegression Testing
8. Project Generation 🔜 TODO: Next PR
Foundry Local models generate correct code samplesReference correct NuGet packages (Microsoft.AI.Foundry.Local.WinML,Microsoft.Extensions.AI)Generated projects compile without errorsNugetPackageReferencesproperty returns correct package list9. Multi-Provider Compatibility
10. UI/UX Flows
GetShortExecutionProvider()displays correctly11. Cross-Sample Testing
Test Foundry Local models in the following Samples:
12. Platform-Specific Validation
13. Invalid Inputs
GetIChatClient()throws clear exceptionDownloadModel()with non-FoundryCatalogModelreturnsfalseEnsureModelReadyAsync()throws clear exceptionGetIChatClient()with empty/null URL throws exception14. Resource Management
Dispose()called correctly (no resource leaks)_prepareLockdisposed correctlyReference
Foundry Local SDK reference