-
Notifications
You must be signed in to change notification settings - Fork 586
Use ReadOnlyMemory<byte> for binary data to eliminate UTF-16 transcoding #1070
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
| { | ||
| if (_decodedData is null) | ||
| { | ||
| #if NET6_0_OR_GREATER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot, this can just be #if NET.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to #if NET in commit ebe3eef
| #if NET6_0_OR_GREATER | ||
| _decodedData = Convert.FromBase64String(System.Text.Encoding.UTF8.GetString(Blob.Span)); | ||
| #else | ||
| _decodedData = Convert.FromBase64String(System.Text.Encoding.UTF8.GetString(Blob.ToArray())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot, this can extract the array using MemoryMarshal.TryGetArray and use it in a majority case, e.g.
byte[] array = MemoryMarshal.TryGetArray(Blob, out ArraySegment<byte> segment) && segment.Offset == 0 && segment.Count == segment.Array.Length ? segment.Array : Blob.ToArray();
_decodedData = Convert.FromBase64String(array);There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied the MemoryMarshal.TryGetArray optimization to avoid ToArray() in the common case in commit ebe3eef
| Uri = request.Params!.Uri, | ||
| MimeType = dc.MediaType, | ||
| Blob = dc.Base64Data.ToString() | ||
| Blob = System.Text.Encoding.UTF8.GetBytes(dc.Base64Data.ToString()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot, these cases where it's going from a ReadOnlyMemory<char> through UTF8.GetBytes can similarly use TryGetArray to avoid a ToString() in the common case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied MemoryMarshal.TryGetArray optimization for ReadOnlyMemory to avoid ToString() in the common case in commit ebe3eef
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
| { | ||
| Uri = request.Params!.Uri, | ||
| MimeType = dc.MediaType, | ||
| Blob = MemoryMarshal.TryGetArray(dc.Base64Data, out ArraySegment<char> segment) && segment.Offset == 0 && segment.Count == segment.Array!.Length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot, these two don't need to Offset and Count, as the actual offset and length can be passed as arguments to GetBytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use the GetBytes(char[], int, int) overload to pass offset and count directly in commit 1d76c11
|
@ericstj, I'm not sure this will actually end up saving much end-to-end. It'd help if some of the data wasn't actually looked at, but if it ends up being looked at, it seems it'll actually make the end-to-end worse. |
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
|
I see this PR is still going though a string when decoding - I thought we had API that would go straight from ROS (base64 UTF8) to byte[] without an intermediate string. |
We have operation status APIs that write into an existing destinatiom buffer, but no one-shot that creates the array. We do for Base64Url but we didn't square the circle for Base64. |
|
Can't we do Base64.GetMaxDecodedFromUtf8Length to get the size, create a byte array, then Base64.DecodeFromUtf8, followed by slice to final actual size? |
That's the operation status API I was referring to.
I thought you'd stated you wanted the decoded data property to be a byte[]. If it is, slicing would entail allocating a whole new array unless the max happened to equal actual, which would happen only ~33% of the time. If the destination property is a |
This is exactly what I was thinking of when mentioning:
And the current implementation is already exposing as |
Co-authored-by: ericstj <8918108+ericstj@users.noreply.github.com>
Implemented using Base64.DecodeFromUtf8 to decode directly from UTF-8 bytes without string intermediate in commit 39213fb. The implementation:
|
|
What are developers that need to set the data going to do? We will do the complicated thing internally in our own use, but most devs/call sites won't. It'll end up being worse for them still, right? How are we going to mitigate that? |
| DataContent dataContent when dataContent.HasTopLevelMediaType("image") => new ImageContentBlock | ||
| { | ||
| Data = dataContent.Base64Data.ToString(), | ||
| Data = System.Text.Encoding.UTF8.GetBytes(dataContent.Base64Data.ToString()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataContent already has Data which is decoded. Can we use that?
| DataContent dataContent when dataContent.HasTopLevelMediaType("audio") => new AudioContentBlock | ||
| { | ||
| Data = dataContent.Base64Data.ToString(), | ||
| Data = System.Text.Encoding.UTF8.GetBytes(dataContent.Base64Data.ToString()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, can we use Data
| Resource = new BlobResourceContents | ||
| { | ||
| Blob = dataContent.Base64Data.ToString(), | ||
| Blob = System.Text.Encoding.UTF8.GetBytes(dataContent.Base64Data.ToString()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should just be Base64Data and not go to string and back to bytes.
| contents.Add(new ImageContentBlock | ||
| { | ||
| Data = TinyImageTool.MCP_TINY_IMAGE.Split(",").Last(), | ||
| Data = System.Text.Encoding.UTF8.GetBytes(TinyImageTool.MCP_TINY_IMAGE.Split(",").Last()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: consider changing this to a span literal to avoid the need to parse a static resource. Both DataContent and ImageContentBlock can construct from the raw bytes.
|
Thinking about principles here:
So in reviewing this I wouldn't want to see a ton of decoding sprinkled throughout the codebase. I would hope that we can have all the decoding surface area on the types themselves. Does this sound right @stephentoub |
Plan: Fix Binary Data Representation in Protocol Types
Based on the issue analysis, I need to change how binary data is represented in the protocol types to avoid unnecessary UTF-16 transcoding. The current implementation uses
stringproperties for base64-encoded data, which causes UTF-8 → UTF-16 → base64 decode conversions.Changes Required:
Update
BlobResourceContents.BlobpropertystringtoReadOnlyMemory<byte>for zero-copy UTF-8 representationDataproperty for decoded binary data with cachingResourceContents.ConverterUpdate
ImageContentBlock.DatapropertystringtoReadOnlyMemory<byte>DecodedDataproperty for decoded binary data with cachingContentBlock.ConverterUpdate
AudioContentBlock.DatapropertystringtoReadOnlyMemory<byte>DecodedDataproperty for decoded binary data with cachingContentBlock.ConverterFixed AIContentExtensions.cs to use DecodedData properties
Fixed AIFunctionMcpServerResource.cs to convert strings to UTF-8 bytes
Fixed test servers (TestServer, TestSseServer, ConformanceServer) to use UTF-8 encoding
Update tests
ResourceContentsTests.csto use byte arrays and UTF-8 encodingContentBlockTests.csto use byte arraysBuild and test all changes
Summary:
Successfully converted binary data properties from
stringtoReadOnlyMemory<byte>to avoid UTF-16 transcoding overhead. The changes:ReadOnlyMemory<byte>to store base64-encoded UTF-8 bytes directly from JSONData/DecodedDataproperties that decode and cache the binary dataSystem.Buffers.Text.Base64.DecodeFromUtf8- completely eliminates string intermediate during decodingPerformance Optimizations:
#if NET6_0_OR_GREATERto#if NETfor simplicitySystem.Buffers.Text.Base64.DecodeFromUtf8on .NET to decode directly from UTF-8 base64 bytes to binary without creating an intermediate UTF-16 stringEnd-to-End Performance Path:
Original prompt
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.