Use ReadOnlyMemory<byte> for binary data to eliminate UTF-16 transcoding #1070

Copilot · 2025-12-05T22:42:03Z

Plan: Fix Binary Data Representation in Protocol Types

Based on the issue analysis, I need to change how binary data is represented in the protocol types to avoid unnecessary UTF-16 transcoding. The current implementation uses string properties for base64-encoded data, which causes UTF-8 → UTF-16 → base64 decode conversions.

Changes Required:

Update BlobResourceContents.Blob property
- Changed from string to ReadOnlyMemory<byte> for zero-copy UTF-8 representation
- Added lazy Data property for decoded binary data with caching
- Updated custom JSON converter in ResourceContents.Converter
- Added cache invalidation when Blob is set
- Optimized with MemoryMarshal.TryGetArray to avoid ToArray() in common cases
- Use Base64.DecodeFromUtf8 to decode directly from UTF-8 bytes without string intermediate
Update ImageContentBlock.Data property
- Changed from string to ReadOnlyMemory<byte>
- Added lazy DecodedData property for decoded binary data with caching
- Updated custom JSON converter in ContentBlock.Converter
- Added cache invalidation when Data is set
- Optimized with MemoryMarshal.TryGetArray to avoid ToArray() in common cases
- Use Base64.DecodeFromUtf8 to decode directly from UTF-8 bytes without string intermediate
Update AudioContentBlock.Data property
- Changed from string to ReadOnlyMemory<byte>
- Added lazy DecodedData property for decoded binary data with caching
- Updated custom JSON converter in ContentBlock.Converter
- Added cache invalidation when Data is set
- Optimized with MemoryMarshal.TryGetArray to avoid ToArray() in common cases
- Use Base64.DecodeFromUtf8 to decode directly from UTF-8 bytes without string intermediate
Fixed AIContentExtensions.cs to use DecodedData properties
Fixed AIFunctionMcpServerResource.cs to convert strings to UTF-8 bytes
- Optimized with MemoryMarshal.TryGetArray for ReadOnlyMemory to avoid ToString() in common cases
- Use GetBytes overload with offset and count parameters for better performance
Fixed test servers (TestServer, TestSseServer, ConformanceServer) to use UTF-8 encoding
Update tests
- Fixed ResourceContentsTests.cs to use byte arrays and UTF-8 encoding
- Fixed ContentBlockTests.cs to use byte arrays
- Fixed all integration tests and sample code
- Updated assertions to compare UTF-8 decoded strings
Build and test all changes
- Build succeeds across all target frameworks
- All unit tests pass (protocol tests, resource tests, content block tests)
- Code review feedback addressed
- Performance optimizations applied

Summary:

Successfully converted binary data properties from string to ReadOnlyMemory<byte> to avoid UTF-16 transcoding overhead. The changes:

Use ReadOnlyMemory<byte> to store base64-encoded UTF-8 bytes directly from JSON
Provide lazy Data/DecodedData properties that decode and cache the binary data
Cache invalidation when base64-encoded data is modified
Write UTF-8 bytes directly to JSON without string allocation
All tests updated and passing
Performance optimizations using MemoryMarshal.TryGetArray to avoid unnecessary allocations
Direct base64 decoding from UTF-8 bytes using System.Buffers.Text.Base64.DecodeFromUtf8 - completely eliminates string intermediate during decoding

Performance Optimizations:

Changed #if NET6_0_OR_GREATER to #if NET for simplicity
Use MemoryMarshal.TryGetArray to extract backing array when available, avoiding ToArray() allocations
Apply same optimization for ReadOnlyMemory to avoid ToString() allocations
Use GetBytes(char[], int, int) overload to pass offset and count directly instead of checking for complete array
Use System.Buffers.Text.Base64.DecodeFromUtf8 on .NET to decode directly from UTF-8 base64 bytes to binary without creating an intermediate UTF-16 string
Allocate max buffer size, decode, then slice to actual size using ReadOnlyMemory (avoids extra allocation ~67% of time)
Falls back to optimized string-based approach for .NET Standard 2.0

End-to-End Performance Path:

Wire → JSON Reader → ReadOnlyMemory<byte>: Zero UTF-8 → UTF-16 transcoding
Base64 Decode: Direct UTF-8 bytes → binary via Base64.DecodeFromUtf8 (no string intermediate)
Cache & Return: ReadOnlyMemory<byte> with lazy decode and caching

Original prompt

This section details on the original issue you should resolve

<issue_title>Content types that represent binary data are exposing UTF16 encoded strings</issue_title>
<issue_description>

csharp-sdk/src/ModelContextProtocol.Core/Protocol/BlobResourceContents.cs

Lines 25 to 29 in 4a84412

/// <summary>

/// Gets or sets the base64-encoded string representing the binary data of the item.

/// </summary>

[JsonPropertyName("blob")]

public required string Blob { get; set; }

csharp-sdk/src/ModelContextProtocol.Core/Protocol/ContentBlock.cs

Lines 377 to 381 in 4a84412

/// <summary>

/// Gets or sets the base64-encoded image data.

/// </summary>

[JsonPropertyName("data")]

public required string Data { get; set; }

csharp-sdk/src/ModelContextProtocol.Core/Protocol/ContentBlock.cs

Lines 399 to 403 in 4a84412

/// <summary>

/// Gets or sets the base64-encoded audio data.

/// </summary>

[JsonPropertyName("data")]

public required string Data { get; set; }

Potentially more -- https://github.com/search?q=repo%3Amodelcontextprotocol%2Fcsharp-sdk%20%22base64-encoded%22&type=code

Doing this means that the UTF8 data that comes across the wire is transcoded to UTF16 just to be then base64 decoded to bytes when anyone uses it. The intermediate step to UTF16 (.NET string) is unnecessary, wastes cycles, creates garbage, and makes the API harder to use.

Instead these types should keep the data in UTF8 and then have lazy getters that will do the base64-decoding. We can tell folks to use those decoded getter properties.

https://github.com/dotnet/extensions/blob/main/src/Libraries/Microsoft.Extensions.AI.Abstractions/Contents/DataContent.cs follows a similar pattern, though it uses the Uri as it's serialized representation. It seems these MCP types should use the UTF8 encoded data as their serialized representation.

The OpenAI SDK exposes a BinaryData as its representation to deal with this problem https://github.com/openai/openai-dotnet/blob/636075222205d626a770acb1af1e5cc060abd517/src/Custom/Images/GeneratedImage.cs#L23 however that's slightly less usable since it forces the caller to decode.

The proposal here would be to have these be instead:
/// <summary> 
/// Gets or sets the base64-encoded UTF-8 byes representing the binary data of the item.
/// </summary> 
/// <remarks>
/// This is a zero-copy representation of the wire payload of this item.  Setting this value will invalidate any cached value of <see cref="Data" />.
/// </remarks>
[JsonPropertyName("blob")] 
public required ReadOnlyMemory<byte> Blob { get; set; } 

/// <summary> 
/// Gets the decoded data represented by <see cref="Blob" />.
/// </summary>
/// <remarks>
/// Accessing this member will decode the value in <see cref="Blob" /> and cache the result.
/// </remarks>
[JsonIgnore]
public byte[] Data { get; }
</issue_description>

<agent_instructions>Please find any protocol types that expose binary data as string (typically base64 encoded) and follow the cited recommendations for how to expose them differently.</agent_instructions>

Comments on the Issue (you are @copilot in this section)
@stephentoub > The proposal here would be to have these be instead:
Should Data also be a ReadOnlyMemory<byte>?

Is Data going to cache the decoded data lazily, or is it going to produce a new array on each access? If the latter, it shouldn't be a property.

what do you think about the name Blob

I'm torn. On the one hand, I agree with you. On the other hand, it's nice to keep it the same or super close to the spec. Offline you'd proposed something like Base64Blob; is that still your first choice?</comment_new>

Fixes Content types that represent binary data are exposing UTF16 encoded strings #1064

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>

stephentoub · 2025-12-08T03:44:52Z

src/ModelContextProtocol.Core/Protocol/BlobResourceContents.cs

+        {
+            if (_decodedData is null)
+            {
+#if NET6_0_OR_GREATER


@copilot, this can just be #if NET.

Changed to #if NET in commit ebe3eef

stephentoub · 2025-12-08T03:51:28Z

src/ModelContextProtocol.Core/Protocol/BlobResourceContents.cs

+#if NET6_0_OR_GREATER
+                _decodedData = Convert.FromBase64String(System.Text.Encoding.UTF8.GetString(Blob.Span));
+#else
+                _decodedData = Convert.FromBase64String(System.Text.Encoding.UTF8.GetString(Blob.ToArray()));


@copilot, this can extract the array using MemoryMarshal.TryGetArray and use it in a majority case, e.g.

byte[] array = MemoryMarshal.TryGetArray(Blob, out ArraySegment<byte> segment) && segment.Offset == 0 && segment.Count == segment.Array.Length ? segment.Array : Blob.ToArray(); _decodedData = Convert.FromBase64String(array);

Applied the MemoryMarshal.TryGetArray optimization to avoid ToArray() in the common case in commit ebe3eef

src/ModelContextProtocol.Core/Protocol/ContentBlock.cs

stephentoub · 2025-12-08T04:13:10Z

src/ModelContextProtocol.Core/Server/AIFunctionMcpServerResource.cs

                            Uri = request.Params!.Uri,
                            MimeType = dc.MediaType,
-                            Blob = dc.Base64Data.ToString()
+                            Blob = System.Text.Encoding.UTF8.GetBytes(dc.Base64Data.ToString())


@copilot, these cases where it's going from a ReadOnlyMemory<char> through UTF8.GetBytes can similarly use TryGetArray to avoid a ToString() in the common case.

Applied MemoryMarshal.TryGetArray optimization for ReadOnlyMemory to avoid ToString() in the common case in commit ebe3eef

Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>

stephentoub · 2025-12-08T04:35:38Z

src/ModelContextProtocol.Core/Server/AIFunctionMcpServerResource.cs

+                { 
+                    Uri = request.Params!.Uri, 
+                    MimeType = dc.MediaType, 
+                    Blob = MemoryMarshal.TryGetArray(dc.Base64Data, out ArraySegment<char> segment) && segment.Offset == 0 && segment.Count == segment.Array!.Length 


@copilot, these two don't need to Offset and Count, as the actual offset and length can be passed as arguments to GetBytes

Updated to use the GetBytes(char[], int, int) overload to pass offset and count directly in commit 1d76c11

stephentoub · 2025-12-08T04:37:06Z

@ericstj, I'm not sure this will actually end up saving much end-to-end. It'd help if some of the data wasn't actually looked at, but if it ends up being looked at, it seems it'll actually make the end-to-end worse.

Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>

ericstj · 2025-12-08T18:22:54Z

I see this PR is still going though a string when decoding - I thought we had API that would go straight from ROS (base64 UTF8) to byte[] without an intermediate string.

stephentoub · 2025-12-08T18:29:42Z

I thought we had API that would go straight from ROS (base64 UTF8) to byte[] without an intermediate string.

We have operation status APIs that write into an existing destinatiom buffer, but no one-shot that creates the array. We do for Base64Url but we didn't square the circle for Base64.

ericstj · 2025-12-08T19:46:06Z

Can't we do Base64.GetMaxDecodedFromUtf8Length to get the size, create a byte array, then Base64.DecodeFromUtf8, followed by slice to final actual size?

stephentoub · 2025-12-08T19:59:39Z

Base64.DecodeFromUtf8

That's the operation status API I was referring to.

Can't we do Base64.GetMaxDecodedFromUtf8Length to get the size, create a byte array, then Base64.DecodeFromUtf8, followed by slice to final actual size?

I thought you'd stated you wanted the decoded data property to be a byte[]. If it is, slicing would entail allocating a whole new array unless the max happened to equal actual, which would happen only ~33% of the time.

If the destination property is a ReadOnlyMemory<byte>, then yes, it can be done that way.

ericstj · 2025-12-09T01:25:02Z

If the destination property is a ReadOnlyMemory<byte>, then yes, it can be done that way.

This is exactly what I was thinking of when mentioning:

If using ROM avoids a copy it's acceptable.

And the current implementation is already exposing as public ReadOnlyMemory<byte> DecodedData so I think we've already committed to this. Let's go ahead and remove those string intermediates. @copilot

Co-authored-by: ericstj <8918108+ericstj@users.noreply.github.com>

Copilot · 2025-12-09T01:37:48Z

If the destination property is a ReadOnlyMemory<byte>, then yes, it can be done that way.

This is exactly what I was thinking of when mentioning:
...

Implemented using Base64.DecodeFromUtf8 to decode directly from UTF-8 bytes without string intermediate in commit 39213fb. The implementation:

Uses Base64.GetMaxDecodedFromUtf8Length to allocate the buffer
Calls Base64.DecodeFromUtf8 to decode directly
Slices to actual size using ReadOnlyMemory when needed (only allocates new array ~33% of time when actual != max)
Falls back to optimized string-based approach for .NET Standard 2.0

stephentoub · 2025-12-09T01:38:19Z

What are developers that need to set the data going to do? We will do the complicated thing internally in our own use, but most devs/call sites won't. It'll end up being worse for them still, right? How are we going to mitigate that?

ericstj · 2025-12-09T01:40:33Z

src/ModelContextProtocol.Core/AIContentExtensions.cs

            DataContent dataContent when dataContent.HasTopLevelMediaType("image") => new ImageContentBlock
            {
-                Data = dataContent.Base64Data.ToString(),
+                Data = System.Text.Encoding.UTF8.GetBytes(dataContent.Base64Data.ToString()),


DataContent already has Data which is decoded. Can we use that?

ericstj · 2025-12-09T01:40:47Z

src/ModelContextProtocol.Core/AIContentExtensions.cs

            DataContent dataContent when dataContent.HasTopLevelMediaType("audio") => new AudioContentBlock
            {
-                Data = dataContent.Base64Data.ToString(),
+                Data = System.Text.Encoding.UTF8.GetBytes(dataContent.Base64Data.ToString()),


Same, can we use Data

ericstj · 2025-12-09T01:41:54Z

src/ModelContextProtocol.Core/AIContentExtensions.cs

                Resource = new BlobResourceContents
                {
-                    Blob = dataContent.Base64Data.ToString(),
+                    Blob = System.Text.Encoding.UTF8.GetBytes(dataContent.Base64Data.ToString()),


I think this should just be Base64Data and not go to string and back to bytes.

ericstj · 2025-12-09T01:45:39Z

samples/EverythingServer/Tools/AnnotatedMessageTool.cs

            contents.Add(new ImageContentBlock
            {
-                Data = TinyImageTool.MCP_TINY_IMAGE.Split(",").Last(),
+                Data = System.Text.Encoding.UTF8.GetBytes(TinyImageTool.MCP_TINY_IMAGE.Split(",").Last()),


nit: consider changing this to a span literal to avoid the need to parse a static resource. Both DataContent and ImageContentBlock can construct from the raw bytes.

ericstj · 2025-12-09T02:02:45Z

Thinking about principles here:

The type should internally represent the data in the form closest to it's transit format. For these MCP types that UTF8, base64 encoded bytes. For DataContent that's a data URI.
The type should have constructors/factories/properties that support multiple forms of initialization without forcing the caller to do the decoding. I imagine that would be base64-uft8-bytes, base64-string, decoded bytes. We should choose the initialization option that minimizes additional decoding in any circumstance.
We should have helpers that efficiently decode lazily and cache the result.

So in reviewing this I wouldn't want to see a ton of decoding sprinkled throughout the codebase. I would hope that we can have all the decoding surface area on the types themselves.

Does this sound right @stephentoub

Initial plan

7dca602

Copilot AI assigned Copilot and stephentoub Dec 5, 2025

Copilot started work on behalf of stephentoub December 5, 2025 22:42 View session

Copilot AI and others added 3 commits December 5, 2025 23:00

Update protocol types to use ReadOnlyMemory<byte> for binary data

4287694

Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>

Fix test files to work with new binary data representation

8e6fcf0

Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>

Add cache invalidation when Blob/Data properties are set

e405dfc

Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Fix binary data types exposing UTF16 encoded strings~~ Use ReadOnlyMemory<byte> for binary data to eliminate UTF-16 transcoding Dec 5, 2025

Copilot AI requested a review from stephentoub December 5, 2025 23:28

Copilot finished work on behalf of stephentoub December 5, 2025 23:28

stephentoub requested a review from ericstj December 8, 2025 03:42

stephentoub reviewed Dec 8, 2025

View reviewed changes

Copilot started work on behalf of stephentoub December 8, 2025 04:21 View session

Apply performance optimizations using MemoryMarshal.TryGetArray

ebe3eef

Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>

Copilot AI requested a review from stephentoub December 8, 2025 04:31

Copilot finished work on behalf of stephentoub December 8, 2025 04:31

stephentoub reviewed Dec 8, 2025

View reviewed changes

Copilot started work on behalf of stephentoub December 8, 2025 04:36 View session

Use GetBytes overload with offset and count parameters

1d76c11

Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>

Copilot AI requested a review from stephentoub December 8, 2025 04:43

Copilot finished work on behalf of stephentoub December 8, 2025 04:43

Copilot started work on behalf of ericstj December 9, 2025 01:25 View session

Use Base64.DecodeFromUtf8 to avoid string intermediate during decoding

39213fb

Co-authored-by: ericstj <8918108+ericstj@users.noreply.github.com>

Copilot finished work on behalf of ericstj December 9, 2025 01:39

ericstj reviewed Dec 9, 2025

View reviewed changes

	/// <summary>
	/// Gets or sets the base64-encoded string representing the binary data of the item.
	/// </summary>
	[JsonPropertyName("blob")]
	public required string Blob { get; set; }

	/// <summary>
	/// Gets or sets the base64-encoded image data.
	/// </summary>
	[JsonPropertyName("data")]
	public required string Data { get; set; }

	/// <summary>
	/// Gets or sets the base64-encoded audio data.
	/// </summary>
	[JsonPropertyName("data")]
	public required string Data { get; set; }

Use ReadOnlyMemory<byte> for binary data to eliminate UTF-16 transcoding #1070

Are you sure you want to change the base?

Use ReadOnlyMemory<byte> for binary data to eliminate UTF-16 transcoding #1070

Uh oh!

Conversation

Copilot AI commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Plan: Fix Binary Data Representation in Protocol Types

Changes Required:

Summary:

Performance Optimizations:

End-to-End Performance Path:

Comments on the Issue (you are @copilot in this section)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

stephentoub commented Dec 8, 2025

Uh oh!

ericstj commented Dec 8, 2025

Uh oh!

stephentoub commented Dec 8, 2025

Uh oh!

ericstj commented Dec 8, 2025

Uh oh!

stephentoub commented Dec 8, 2025

Uh oh!

ericstj commented Dec 9, 2025

Uh oh!

Copilot AI commented Dec 9, 2025

Uh oh!

stephentoub commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericstj commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Dec 5, 2025 •

edited

Loading

stephentoub commented Dec 9, 2025 •

edited

Loading