|
| 1 | +--- |
| 2 | +RFC: Default File Encoding |
| 3 | +Author: James Truher |
| 4 | +Status: Draft |
| 5 | +Area: FileSystem |
| 6 | +Comments Due: 4/16/2017 |
| 7 | +--- |
| 8 | + |
| 9 | +# Default file encoding which optionally includes Byte Order Mark (BOM) |
| 10 | + |
| 11 | +Ensuring file creation is proper for the platform, including whether the BOM should be written. |
| 12 | + |
| 13 | +## Motivation |
| 14 | + |
| 15 | +Current PowerShell behavior is that a BOM is created by default when a file is created for those encodings where the BOM is needed. |
| 16 | +This is a problem for Linux systems where the default encoding is UTF8 but a BOM is not written when a file is created. |
| 17 | +Creating files on Linux with a BOM makes it difficult to interact with the native tools, as the following example illustrates. |
| 18 | + |
| 19 | +```powershell |
| 20 | +PS> "ĝoo" > file.txt |
| 21 | +PS> get-content file.txt |
| 22 | +ĝoo |
| 23 | +PS> exit |
| 24 | +james@jimtru-ops2:~$ /bin/cat file.txt |
| 25 | +▒▒oo |
| 26 | +``` |
| 27 | + |
| 28 | +This is due to the BOM being written into the file: |
| 29 | + |
| 30 | +```powershell |
| 31 | +PS /home/james> format-hex file.txt |
| 32 | +
|
| 33 | + Path: /home/james/file.txt |
| 34 | + 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F |
| 35 | +00000000 FF FE 1D 01 6F 00 6F 00 0A 00 .þ..o.o... |
| 36 | + ^^ ^^ |
| 37 | +``` |
| 38 | +The native tools on Linux try to render the BOM as actual content, which results in mistranslated characters. |
| 39 | +If the BOM could be written when the platform expects it, interaction with native tools will be less problematic. |
| 40 | + |
| 41 | +## Specification |
| 42 | + |
| 43 | +A new global variable `$PSDefaultFileEncoding` shall be available which allows the user to define the encoding for their system. |
| 44 | +The allowed values for this variable shall be defined by the `Microsoft.PowerShell.Commands.FileSystemCmdletProviderEncoding` enum, with the following additions: |
| 45 | + |
| 46 | +* UTF8NoBOM |
| 47 | +* Legacy |
| 48 | + |
| 49 | +The following is the complete list of `FileSystemCmdletProviderEncoding` members: |
| 50 | +* Ascii |
| 51 | +* BigEndianUnicode |
| 52 | +* BigEndianUTF32 |
| 53 | +* Byte |
| 54 | +* Default |
| 55 | +* Legacy |
| 56 | +* Oem |
| 57 | +* String |
| 58 | +* Unicode |
| 59 | +* Unknown |
| 60 | +* UTF32 |
| 61 | +* UTF7 |
| 62 | +* UTF8 |
| 63 | +* UTF8NoBOM |
| 64 | + |
| 65 | +When `$PSDefaultFileEncoding` is set to `UTF8NoBOM`, the file shall be created with UTF8 encoding and no BOM shall be written. |
| 66 | + |
| 67 | +When `$PSDefaultFileEncoding` is set to `Legacy`, the behavior shall be: |
| 68 | + |
| 69 | +``` |
| 70 | +CmdletName Encoding |
| 71 | +---------- -------- |
| 72 | +Add-Content ASCII |
| 73 | +Export-Clixml UTF16 |
| 74 | +Export-CSV ASCII |
| 75 | +Out-File UTF16 |
| 76 | +Set-Content ASCII |
| 77 | +Export-PSSession UTF8 (with BOM) |
| 78 | +Redirection UTF16 |
| 79 | +``` |
| 80 | +This persists the irregular file encoding on non-Windows platforms, and allows Linux files to be used on Windows with the same encoding as exists in previous releases of PowerShell. |
| 81 | + |
| 82 | +The default on Windows systems shall remain unchanged (the value for `$PSDefaultFileEncoding` shall be set to `Legacy`), non-Windows platforms shall set `$PSDefaultFileEncoding` to `UTF8NoBOM`. |
| 83 | +If the `$PSDefaultFileEncoding` is not set, `UTF8NoBOM` shall be the default for non-Windows systems, and `Legacy` (the current behavior) on Windows. |
| 84 | + |
| 85 | +Naturally, specific use of the `-encoding` parameter when invoking the cmdlet shall override `$PSDefaultFileEncoding`. |
| 86 | + |
| 87 | +### Exclusions |
| 88 | + |
| 89 | +Cmdlets which do not create a file are excluded from this change, so the `*-WebRequest` and `*-RestMethod` cmdlets shall not be changed. |
| 90 | +Only those cmdlets listed in the table above are to be changed, any other cmdlet which create files with a specific encoding are out of scope. |
| 91 | +Remoting protocol cmdlets shall also be unaffected with this change. |
| 92 | + |
| 93 | +### Optional |
| 94 | + |
| 95 | +We should take this opportunity to rationalize our use of the `Encoding` parameter, and change the cmdlets which use Encoding as `string` or `System.Text.Encoding` type to use `Microsoft.PowerShell.Commands.FileSystemCmdletProviderEncoding`. |
| 96 | +The following cmdlets use various types for the parameter `Encoding` |
| 97 | + |
| 98 | +```PowerShell |
| 99 | +PS> Get-Command -type cmdlet | ?{$\_.parameters} |?{$\_.source -match "microsoft"}|ft name,{$\_.parameters['encoding'].ParameterType} |
| 100 | +
|
| 101 | +Name $_.parameters['encoding'].ParameterType |
| 102 | +---- --------------------------------------- |
| 103 | +Add-Content Microsoft.PowerShell.Commands.FileSystemCmdletProviderEncoding |
| 104 | +Export-Clixml System.String |
| 105 | +Export-Csv System.String |
| 106 | +Export-PSSession System.String |
| 107 | +Get-Content Microsoft.PowerShell.Commands.FileSystemCmdletProviderEncoding |
| 108 | +Import-Csv System.String |
| 109 | +Out-File System.String |
| 110 | +Select-String System.String |
| 111 | +Send-MailMessage System.Text.Encoding |
| 112 | +Set-Content Microsoft.PowerShell.Commands.FileSystemCmdletProviderEncoding |
| 113 | +``` |
| 114 | + |
| 115 | +It will make these cmdlets easier to maintain over time. |
| 116 | + |
| 117 | +### Examples |
| 118 | +--- |
| 119 | +Creating a file without a BOM on a Linux system (the default): |
| 120 | +```powershell |
| 121 | +PS> "ĝoo" > file.txt |
| 122 | +PS> get-content file.txt |
| 123 | +ĝoo |
| 124 | +PS> exit |
| 125 | +james@jimtru-ops2:~$ cat file.txt |
| 126 | +ĝoo |
| 127 | +``` |
| 128 | + |
| 129 | +Additional details: |
| 130 | +```powershell |
| 131 | +PS /home/james> "©opyright" > c.txt |
| 132 | +PS /home/james> format-hex c.txt |
| 133 | + Path: /home/james/c.txt |
| 134 | + 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F |
| 135 | +00000000 C2 A9 6F 70 79 72 69 67 68 74 0A ©opyright. |
| 136 | +
|
| 137 | +PS /home/james> /bin/cat c.txt |
| 138 | +©opyright |
| 139 | +PS /home/james> get-content c.txt |
| 140 | +©opyright |
| 141 | +PS /home/james> bash |
| 142 | +james@jimtru-ops2:~$ date >> c.txt |
| 143 | +james@jimtru-ops2:~$ cat c.txt |
| 144 | +©opyright |
| 145 | +Thu Feb 16 15:02:58 PST 2017 |
| 146 | +james@jimtru-ops2:~$ exit |
| 147 | +exit |
| 148 | +PS /home/james> get-content -Encoding utf8 c.txt |
| 149 | +©opyright |
| 150 | +Thu Feb 16 15:02:58 PST 2017 |
| 151 | +``` |
| 152 | + |
| 153 | +Creating a file with a BOM on a Linux System, this will specifically put the BOM in the file and will render the file problematic on Linux: |
| 154 | +```powershell |
| 155 | +$PSDefaultFileEncoding = "UTF8" |
| 156 | +PS> "ĝoo" > file.txt |
| 157 | +PS> get-content file.txt |
| 158 | +ĝoo |
| 159 | +PS> exit |
| 160 | +james@jimtru-ops2:~$ cat file.txt |
| 161 | +▒▒oo |
| 162 | +``` |
| 163 | + |
| 164 | +This mimics our current behavior and is due to the BOM being written into the file. |
| 165 | +This file _would_ be suitable for use on a Windows system. |
| 166 | + |
| 167 | +Creating a file without a BOM on Windows: |
| 168 | +```powershell |
| 169 | +PS> "ĝoo" |out-file -encoding UTF8NoBOM file.txt |
| 170 | +``` |
| 171 | + |
| 172 | +### Commentary |
| 173 | + |
| 174 | +`UTF8NoBOM` and `Legacy` are, of course, not actual encodings but neither are a number of the other values for `FileSystemCmdletProviderEncoding`. |
| 175 | +However, it is somewhat descriptive of our behavior. |
| 176 | + |
| 177 | +### Alternate Approaches |
| 178 | +The setting need not be a PowerShell variable, it could be an environment variable or part of the configuration proposed by [PowerShell-StartupConfig](https://github.com/PowerShell/PowerShell-RFC/blob/master/1-Draft/RFC0015-PowerShell-StartupConfig.md). |
| 179 | +However, this is the simplest approach and these alternatives can be done at later time. |
0 commit comments