Skip to content

Commit 4b65091

Browse files
authored
Merge pull request #70 from JamesWTruher/jameswtruher/FileEncodingRFC
rationalizing the file encoding story for cmdlets
2 parents d8d4d76 + 0c7b82c commit 4b65091

File tree

1 file changed

+179
-0
lines changed

1 file changed

+179
-0
lines changed

1-Draft/DefaultFileEncoding.md

Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
---
2+
RFC: Default File Encoding
3+
Author: James Truher
4+
Status: Draft
5+
Area: FileSystem
6+
Comments Due: 4/16/2017
7+
---
8+
9+
# Default file encoding which optionally includes Byte Order Mark (BOM)
10+
11+
Ensuring file creation is proper for the platform, including whether the BOM should be written.
12+
13+
## Motivation
14+
15+
Current PowerShell behavior is that a BOM is created by default when a file is created for those encodings where the BOM is needed.
16+
This is a problem for Linux systems where the default encoding is UTF8 but a BOM is not written when a file is created.
17+
Creating files on Linux with a BOM makes it difficult to interact with the native tools, as the following example illustrates.
18+
19+
```powershell
20+
PS> "ĝoo" > file.txt
21+
PS> get-content file.txt
22+
ĝoo
23+
PS> exit
24+
james@jimtru-ops2:~$ /bin/cat file.txt
25+
▒▒oo
26+
```
27+
28+
This is due to the BOM being written into the file:
29+
30+
```powershell
31+
PS /home/james> format-hex file.txt
32+
33+
Path: /home/james/file.txt
34+
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
35+
00000000 FF FE 1D 01 6F 00 6F 00 0A 00 .þ..o.o...
36+
^^ ^^
37+
```
38+
The native tools on Linux try to render the BOM as actual content, which results in mistranslated characters.
39+
If the BOM could be written when the platform expects it, interaction with native tools will be less problematic.
40+
41+
## Specification
42+
43+
A new global variable `$PSDefaultFileEncoding` shall be available which allows the user to define the encoding for their system.
44+
The allowed values for this variable shall be defined by the `Microsoft.PowerShell.Commands.FileSystemCmdletProviderEncoding` enum, with the following additions:
45+
46+
* UTF8NoBOM
47+
* Legacy
48+
49+
The following is the complete list of `FileSystemCmdletProviderEncoding` members:
50+
* Ascii
51+
* BigEndianUnicode
52+
* BigEndianUTF32
53+
* Byte
54+
* Default
55+
* Legacy
56+
* Oem
57+
* String
58+
* Unicode
59+
* Unknown
60+
* UTF32
61+
* UTF7
62+
* UTF8
63+
* UTF8NoBOM
64+
65+
When `$PSDefaultFileEncoding` is set to `UTF8NoBOM`, the file shall be created with UTF8 encoding and no BOM shall be written.
66+
67+
When `$PSDefaultFileEncoding` is set to `Legacy`, the behavior shall be:
68+
69+
```
70+
CmdletName Encoding
71+
---------- --------
72+
Add-Content ASCII
73+
Export-Clixml UTF16
74+
Export-CSV ASCII
75+
Out-File UTF16
76+
Set-Content ASCII
77+
Export-PSSession UTF8 (with BOM)
78+
Redirection UTF16
79+
```
80+
This persists the irregular file encoding on non-Windows platforms, and allows Linux files to be used on Windows with the same encoding as exists in previous releases of PowerShell.
81+
82+
The default on Windows systems shall remain unchanged (the value for `$PSDefaultFileEncoding` shall be set to `Legacy`), non-Windows platforms shall set `$PSDefaultFileEncoding` to `UTF8NoBOM`.
83+
If the `$PSDefaultFileEncoding` is not set, `UTF8NoBOM` shall be the default for non-Windows systems, and `Legacy` (the current behavior) on Windows.
84+
85+
Naturally, specific use of the `-encoding` parameter when invoking the cmdlet shall override `$PSDefaultFileEncoding`.
86+
87+
### Exclusions
88+
89+
Cmdlets which do not create a file are excluded from this change, so the `*-WebRequest` and `*-RestMethod` cmdlets shall not be changed.
90+
Only those cmdlets listed in the table above are to be changed, any other cmdlet which create files with a specific encoding are out of scope.
91+
Remoting protocol cmdlets shall also be unaffected with this change.
92+
93+
### Optional
94+
95+
We should take this opportunity to rationalize our use of the `Encoding` parameter, and change the cmdlets which use Encoding as `string` or `System.Text.Encoding` type to use `Microsoft.PowerShell.Commands.FileSystemCmdletProviderEncoding`.
96+
The following cmdlets use various types for the parameter `Encoding`
97+
98+
```PowerShell
99+
PS> Get-Command -type cmdlet | ?{$\_.parameters} |?{$\_.source -match "microsoft"}|ft name,{$\_.parameters['encoding'].ParameterType}
100+
101+
Name $_.parameters['encoding'].ParameterType
102+
---- ---------------------------------------
103+
Add-Content Microsoft.PowerShell.Commands.FileSystemCmdletProviderEncoding
104+
Export-Clixml System.String
105+
Export-Csv System.String
106+
Export-PSSession System.String
107+
Get-Content Microsoft.PowerShell.Commands.FileSystemCmdletProviderEncoding
108+
Import-Csv System.String
109+
Out-File System.String
110+
Select-String System.String
111+
Send-MailMessage System.Text.Encoding
112+
Set-Content Microsoft.PowerShell.Commands.FileSystemCmdletProviderEncoding
113+
```
114+
115+
It will make these cmdlets easier to maintain over time.
116+
117+
### Examples
118+
---
119+
Creating a file without a BOM on a Linux system (the default):
120+
```powershell
121+
PS> "ĝoo" > file.txt
122+
PS> get-content file.txt
123+
ĝoo
124+
PS> exit
125+
james@jimtru-ops2:~$ cat file.txt
126+
ĝoo
127+
```
128+
129+
Additional details:
130+
```powershell
131+
PS /home/james> "©opyright" > c.txt
132+
PS /home/james> format-hex c.txt
133+
Path: /home/james/c.txt
134+
00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
135+
00000000 C2 A9 6F 70 79 72 69 67 68 74 0A ©opyright.
136+
137+
PS /home/james> /bin/cat c.txt
138+
©opyright
139+
PS /home/james> get-content c.txt
140+
©opyright
141+
PS /home/james> bash
142+
james@jimtru-ops2:~$ date >> c.txt
143+
james@jimtru-ops2:~$ cat c.txt
144+
©opyright
145+
Thu Feb 16 15:02:58 PST 2017
146+
james@jimtru-ops2:~$ exit
147+
exit
148+
PS /home/james> get-content -Encoding utf8 c.txt
149+
©opyright
150+
Thu Feb 16 15:02:58 PST 2017
151+
```
152+
153+
Creating a file with a BOM on a Linux System, this will specifically put the BOM in the file and will render the file problematic on Linux:
154+
```powershell
155+
$PSDefaultFileEncoding = "UTF8"
156+
PS> "ĝoo" > file.txt
157+
PS> get-content file.txt
158+
ĝoo
159+
PS> exit
160+
james@jimtru-ops2:~$ cat file.txt
161+
▒▒oo
162+
```
163+
164+
This mimics our current behavior and is due to the BOM being written into the file.
165+
This file _would_ be suitable for use on a Windows system.
166+
167+
Creating a file without a BOM on Windows:
168+
```powershell
169+
PS> "ĝoo" |out-file -encoding UTF8NoBOM file.txt
170+
```
171+
172+
### Commentary
173+
174+
`UTF8NoBOM` and `Legacy` are, of course, not actual encodings but neither are a number of the other values for `FileSystemCmdletProviderEncoding`.
175+
However, it is somewhat descriptive of our behavior.
176+
177+
### Alternate Approaches
178+
The setting need not be a PowerShell variable, it could be an environment variable or part of the configuration proposed by [PowerShell-StartupConfig](https://github.com/PowerShell/PowerShell-RFC/blob/master/1-Draft/RFC0015-PowerShell-StartupConfig.md).
179+
However, this is the simplest approach and these alternatives can be done at later time.

0 commit comments

Comments
 (0)