# NAME Sys::Binmode - A fix for Perl窶冱 system call character encoding <div> <a href='https://coveralls.io/github/FGasper/p5-Sys-Binmode?branch=master'><img src='https://coveralls.io/repos/github/FGasper/p5-Sys-Binmode/badge.svg?branch=master' alt='Coverage Status' /></a> </div> # SYNOPSIS use Sys::Binmode; my $foo = "\xff"; $foo .= "\x{100}"; chop $foo; # Prints a single octet (0xFF) and a newline: print $foo, $/; # In Perl 5.32 this may print the same single octet, or it may # print UTF-8-encoded U+00FF. With Sys::Binmode, though, it always # gives the single octet, just like print: exec 'echo', $foo; # DESCRIPTION tl;dr: Use this module in **all** new code. # BACKGROUND Ideally, a Perl application doesn窶冲 need to know how the interpreter stores a given string internally. Perl can thus store any Unicode code point while still optimizing for size and speed when storing 窶彙ytes-compatible窶� strings窶琶.e., strings whose code points all lie below 256. Perl窶冱 窶徙ptimized窶� string storage format is faster and less memory-hungry, but it can only store code points 0-255. The 窶忖noptimized窶� format, on the other hand, can store any Unicode code point. Of course, Perl doesn窶冲 _always_ optimize 窶彙ytes-compatible窶� strings; Perl can also, if it wants, store such strings 窶忖noptimized窶� (i.e., in Perl窶冱 internal 窶徑oose UTF-8窶� format), too. For code points 0-127 (ASCII printables, controls, and DEL) there窶冱 actually no difference between the two forms, but for 128-255 the formats differ. (cf. ["The "Unicode Bug"" in perlunicode](https://metacpan.org/pod/perlunicode#The-Unicode-Bug)) This means that anything that reads Perl窶冱 internals **MUST** differentiate between the two forms in order to use the string correctly. Alas, that differentiation doesn窶冲 always happen. When it doesn窶冲, Perl outputs code points 128-255 differently depending on whether the containing string is 窶徙ptimized窶� or not. Remember, though: Perl applications _should_ _not_ _care_ about Perl窶冱 string storage internals like optimized/unoptimized. (This is why, for example, the [bytes](https://metacpan.org/pod/bytes) pragma is discouraged.) The catch, though, is that without that knowledge, **the** **application** **can窶冲** **know** **what** **it** **actually** **says** **to** **the** **outside** **world!** Thus, applications must either monitor Perl窶冱 string-storage internals or accept unpredictable behavior, both of which are categorically bad. (Perl窶冱 documentation calls the 窶忖noptimized窶� format 窶忖pgraded窶�, while it calls the 窶徙ptimized窶� format 窶彭owngraded窶�. The rest of this document will favor Perl窶冱 terms.) # HOW THIS MODULE (PARTLY) FIXES THE PROBLEM This module provides predictable behavior for Perl窶冱 built-in functions by downgrading all strings before giving them to the operating system. It窶冱 equivalent to窶巴ut faster than!窶廃refixing your system calls with `utf8::downgrade()` (cf. [utf8](https://metacpan.org/pod/utf8)) on all arguments. Predictable behavior is **always** a good thing; ergo, you should use this module in **all** new code. # CAVEAT: CHARACTER ENCODING If you apply this module injudiciously to existing code you may see exceptions or character corruption where previously things worked fine. This can happen if you窶况e neglected to encode one or more strings before sending them to the OS. Without Sys::Binmode, Perl sends upgraded strings to the OS in UTF-8 encoding. In essence, it窶冱 an implicit UTF-8 auto-encode, which is kind of nice, except that it depends on Perl窶冱 internals, which are unpredictable. Sys::Binmode removes that implicit UTF-8 auto-encode, which of course will break things that need it. The fix is to apply an explicit UTF-8 encode prior to the system call that throws the error. This is what we should do _anyway_; Sys::Binmode just enforces that better. ## Example: The [utf8](https://metacpan.org/pod/utf8) Pragma The widely-used [utf8](https://metacpan.org/pod/utf8) pragma particularly exemplifies this problem. If you have code like this: use utf8; mkdir "テゥpテゥe"; 窶ヲ then adding this module will change your program窶冱 behavior in ways you窶冤l probably dislike. Consider the string `テゥpテゥe`. Without the `utf8` pragma (but assuming that the code _is_ actually written in UTF-8) this is 6 characters because the two `テゥ`s are 2 bytes each (so 2 + 1 + 2 + 1), and without the `utf8` pragma each byte in a string constant becomes its own character, even if multiple bytes make up a single UTF-8 character. Since nothing _probably_ upgrades that string on its way to `mkdir()`, the OS will receive the intended 6 bytes and create a directory with a UTF-8-encoded name. _With_ `utf8`, though, `テゥpテゥe` is **4** characters, not 6, because this string is now UTF-8-decoded. Those 4 characters all lie beneath 256, so the string is still bytes-compatible. Thus, if you `print()` that string you窶冤l get 4 bytes of Latin-1, which probably **isn窶冲** what you want. `mkdir()`, though, _probably_ still creates a directory with a 6-byte (UTF-8) name. This happens when Perl itself stores `テゥpテゥe` in upgraded (i.e., 窶忖noptimized窶�) form. If that窶冱 the case, that means Perl窶冱 _internal_ buffer of `テゥpテゥe` is still the 6 bytes of UTF-8, even though to the Perl _application_ it窶冱 a 4-character string. Perl窶冱 `mkdir()` doesn窶冲 care about characters, though; it just gives Perl窶冱 internal buffer to the OS窶冱 create-directory function. So by violating its own abstraction, Perl happens to achieve something that is _sometimes_ useful. There are still two problems, though: - 1. Inconsistency: `print()` sends 4 bytes to the OS while `mkdir()` (again, _probably_) outputs 6. - 2. Uncertainty: `テゥpテゥe` _could_ be stored downgraded rather than upgraded, which would cause `mkdir()` to send 4 bytes instead. `print()`窶冱 outputting of 4 bytes here is actually the **correct** behavior because it doesn窶冲 depend on whether Perl stores the string upgraded or downgraded. Sys::Binmode extends that correct behavior to `mkdir()` and other such Perl commands. Of course, in the end, we want `mkdir()` to receive 6 bytes of UTF-8, not 4 bytes of Latin-1. To achieve that, just do as you normally do with `print()`: encode your string before you give it to the OS. use utf8; use Encode; mkdir encode("UTF-8", "テゥpテゥe"); This is what your code should look like, regardless of Sys::Binmode; the omitted encoding step was a bug that Perl窶冱 own abstraction-violation bug _might_ have obscured for you. Sys::Binmode fixes Perl窶冱 bug, which makes you fix your own bug, too. ## Non-POSIX Operating Systems (e.g., Windows) In a POSIX operating system, an application窶冱 communication with the OS happens entirely through byte strings. Thus, treating all OS-destined strings as byte strings is good and natural. In Windows, though, things are weirder. For example, Windows exposes multiple APIs for creating a directory, and the one Perl uses (as of 5.32, anyway) only accepts code points 0-255. In this context Sys::Binmode doesn窶冲 _break_ anything, but it does reinforce one of Perl窶冱 unfortunate limitations on Windows. Sys::Binmode is a good idea anywhere that Perl sends byte strings to the OS. For now, as far as I know, that窶冱 everywhere that Perl runs. If that窶冱 not true, please file a bug. # WHERE ELSE THIS PROBLEM CAN APPEAR The unpredictable-behavior problem that this module fixes in core Perl is also common in [CPAN](http://cpan.org)窶冱 XS modules due to rampant use of [the SvPV macro](https://perldoc.perl.org/perlapi#SvPV) and variants. SvPV is basically Perl窶冱 [bytes](https://metacpan.org/pod/bytes) pragma in C: it gives you the string窶冱 internal bytes with no regard for what those bytes represent. This, of course, is problematic for the same reason why the [bytes](https://metacpan.org/pod/bytes) pragma is. XS authors _generally_ should prefer [SvPVbyte](https://perldoc.perl.org/perlapi#SvPVbyte) or [SvPVutf8](https://perldoc.perl.org/perlapi#SvPVutf8) in lieu of SvPV unless the C code in question handles Perl窶冱 encoding abstraction. Note in particular that, as of Perl 5.32, the default XS typemap converts scalars to C `char *` and `const char *` via an SvPV variant. This means that any module that uses that conversion logic also has this problem. So XS authors should also avoid the default typemap for such conversions. (Again, though, use of the default typemap in this context is regrettably commonplace.) Before Perl 5.18 this problem also affected %ENV. 5.18 introduced an auto-downgrade when setting %ENV similar to what this module does. # LEXICAL SCOPING If, for some reason, you _want_ Perl窶冱 unpredictable default behavior, you can disable this module for a given block via `no Sys::Binmode`, thus: use Sys::Binmode; system 'echo', $foo; # predictable/sane/happy { # You should probably explain here why you窶决e doing this. no Sys::Binmode; system 'echo', $foo; # nasal demons } # AFFECTED BUILT-INS - `exec`, `system`, and `readpipe` - `do` and `require` - File tests (e.g., `-e`) and the following: `chdir`, `chmod`, `chown`, `chroot`, `ioctl`, `link`, `lstat`, `mkdir`, `open`, `opendir`, `readlink`, `rename`, `rmdir`, `stat`, `symlink`, `sysopen`, `truncate`, `unlink`, `utime` - `bind`, `connect`, `setsockopt`, and `send` (last argument) - `syscall` ## Omissions - `crypt` already does as Sys::Binmode would make it do. - `select` (the 4-argument one) has the bug that Sys::Binmode fixes, but since it窶冱 a performance-sensitive call where upgraded strings are unlikely, this library doesn窶冲 wrap it. # KNOWN ISSUES [autodie](https://metacpan.org/pod/autodie) creates functions named, e.g., `chmod` in the namespace of the module that `import()`s it. Those functions lack the compiler 窶徂int窶� that tells Sys::Binmode to do its work; thus, [autodie 窶彡lobbers窶� Sys::Binmode](https://github.com/pjf/autodie/issues/113). `CORE::*` functions will still have Sys::Binmode, but of course they won窶冲 throw exceptions. # TODO - `dbmopen` and the System V IPC functions aren窶冲 covered here. If you窶囘 like them, ask. - There窶冱 room for optimization, if that窶冱 gainful. - Ideally this behavior should be in Perl窶冱 core distribution. - Even more ideally, Perl should adopt this behavior as _default_. Maybe someday! # ACKNOWLEDGEMENTS Thanks to Leon Timmermans (LEONT) and Paul Evans (PEVANS) for some debugging and design help. # LICENSE & COPYRIGHT Copyright 2021 Gasper Software Consulting. All rights reserved. This library is licensed under the same license as Perl.