# NAME

Sys::Binmode - A fix for Perl窶冱 system call character encoding

<div>
    <a href='https://coveralls.io/github/FGasper/p5-Sys-Binmode?branch=master'><img src='https://coveralls.io/repos/github/FGasper/p5-Sys-Binmode/badge.svg?branch=master' alt='Coverage Status' /></a>
</div>

# SYNOPSIS

    use Sys::Binmode;

    my $foo = "\xff";
    $foo .= "\x{100}";
    chop $foo;

    # Prints a single octet (0xFF) and a newline:
    print $foo, $/;

    # In Perl 5.32 this may print the same single octet, or it may
    # print UTF-8-encoded U+00FF. With Sys::Binmode, though, it always
    # gives the single octet, just like print:
    exec 'echo', $foo;

# DESCRIPTION

tl;dr: Use this module in **all** new code.

# BACKGROUND

Ideally, a Perl application doesn窶冲 need to know how the interpreter stores
a given string internally. Perl can thus store any Unicode code point while
still optimizing for size and speed when storing 窶彙ytes-compatible窶�
strings窶琶.e., strings whose code points all lie below 256. Perl窶冱
窶徙ptimized窶� string storage format is faster and less memory-hungry, but it
can only store code points 0-255. The 窶忖noptimized窶� format, on the other
hand, can store any Unicode code point.

Of course, Perl doesn窶冲 _always_ optimize 窶彙ytes-compatible窶� strings;
Perl can also, if
it wants, store such strings 窶忖noptimized窶� (i.e., in Perl窶冱 internal
窶徑oose UTF-8窶� format), too. For code points 0-127 (ASCII printables,
controls, and DEL) there窶冱 actually no
difference between the two forms, but for 128-255 the formats differ. (cf.
["The "Unicode Bug"" in perlunicode](https://metacpan.org/pod/perlunicode#The-Unicode-Bug)) This means that anything that reads
Perl窶冱 internals **MUST** differentiate between the two forms in order to
use the string correctly.

Alas, that differentiation doesn窶冲 always happen. When it doesn窶冲, Perl
outputs code points 128-255 differently depending on whether the
containing string is 窶徙ptimized窶� or not.

Remember, though: Perl applications _should_ _not_ _care_ about
Perl窶冱 string storage internals like optimized/unoptimized. (This is why,
for example, the [bytes](https://metacpan.org/pod/bytes)
pragma is discouraged.) The catch, though, is that without that knowledge,
**the** **application** **can窶冲** **know** **what** **it** **actually** **says**
**to** **the** **outside** **world!**

Thus, applications must either monitor Perl窶冱 string-storage internals
or accept unpredictable behavior, both of which are categorically bad.

(Perl窶冱 documentation calls the 窶忖noptimized窶� format 窶忖pgraded窶�, while
it calls the 窶徙ptimized窶� format 窶彭owngraded窶�. The rest of this document
will favor Perl窶冱 terms.)

# HOW THIS MODULE (PARTLY) FIXES THE PROBLEM

This module provides predictable behavior for Perl窶冱 built-in functions by
downgrading all strings before giving them to the operating system. It窶冱
equivalent to窶巴ut faster than!窶廃refixing your system calls with
`utf8::downgrade()` (cf. [utf8](https://metacpan.org/pod/utf8)) on all arguments.

Predictable behavior is **always** a good thing; ergo, you should
use this module in **all** new code.

# CAVEAT: CHARACTER ENCODING

If you apply this module injudiciously to existing code you may see
exceptions or character corruption where previously things worked fine.

This can
happen if you窶况e neglected to encode one or more strings before
sending them to the OS. Without Sys::Binmode, Perl sends upgraded
strings to the OS in UTF-8 encoding. In essence, it窶冱 an implicit
UTF-8 auto-encode, which is kind of nice, except that it depends on
Perl窶冱 internals, which are unpredictable. Sys::Binmode removes
that implicit UTF-8 auto-encode, which of course will break things
that need it.

The fix is to apply an explicit UTF-8 encode prior to the system call
that throws the error. This is what we should do _anyway_;
Sys::Binmode just enforces that better.

## Example: The [utf8](https://metacpan.org/pod/utf8) Pragma

The widely-used [utf8](https://metacpan.org/pod/utf8) pragma particularly exemplifies this problem.

If you have code like this:

    use utf8;

    mkdir "テゥpテゥe";

窶ヲ then adding this module will change your program窶冱 behavior in ways you窶冤l
probably dislike.

Consider the string `テゥpテゥe`. Without the `utf8` pragma (but assuming that
the code _is_ actually written in UTF-8) this is 6
characters because the two `テゥ`s are 2 bytes each (so 2 + 1 + 2 + 1),
and without the `utf8` pragma each byte in a string constant becomes its own
character, even if multiple bytes make up a single UTF-8 character. Since
nothing _probably_ upgrades that string on its way to
`mkdir()`, the OS will receive the intended 6 bytes and create a directory
with a UTF-8-encoded name.

_With_ `utf8`, though, `テゥpテゥe` is **4** characters, not 6, because
this string is now UTF-8-decoded. Those 4 characters all lie beneath 256,
so the string is still bytes-compatible. Thus, if you `print()` that string
you窶冤l get 4 bytes of Latin-1, which probably **isn窶冲** what you want.

`mkdir()`, though, _probably_ still creates a directory with a 6-byte (UTF-8)
name. This happens when Perl itself stores `テゥpテゥe` in upgraded (i.e.,
窶忖noptimized窶�) form. If that窶冱 the case, that means Perl窶冱 _internal_ buffer
of `テゥpテゥe` is still the 6 bytes of UTF-8, even though to the Perl
_application_ it窶冱 a 4-character string. Perl窶冱 `mkdir()` doesn窶冲 care
about characters, though; it just gives Perl窶冱 internal buffer to the
OS窶冱 create-directory function. So by violating its own abstraction, Perl
happens to achieve something that is _sometimes_ useful.

There are still two problems, though:

- 1. Inconsistency: `print()` sends 4 bytes to the OS while
`mkdir()` (again, _probably_) outputs 6.
- 2. Uncertainty: `テゥpテゥe` _could_ be stored downgraded rather than
upgraded, which would cause `mkdir()` to send 4 bytes instead.

`print()`窶冱 outputting of 4 bytes here is actually the **correct** behavior
because it doesn窶冲 depend on whether Perl stores the string upgraded or
downgraded. Sys::Binmode extends that correct behavior to `mkdir()` and
other such Perl commands.

Of course, in the end, we want `mkdir()` to receive 6 bytes of UTF-8, not
4 bytes of Latin-1. To achieve that, just do as you normally do with
`print()`: encode your string before you give it to the OS.

    use utf8;
    use Encode;

    mkdir encode("UTF-8", "テゥpテゥe");

This is what your code should look like, regardless of Sys::Binmode;
the omitted encoding step was a bug that Perl窶冱 own abstraction-violation
bug _might_ have obscured for you. Sys::Binmode fixes Perl窶冱 bug,
which makes you fix your own bug, too.

## Non-POSIX Operating Systems (e.g., Windows)

In a POSIX operating system, an application窶冱 communication with the
OS happens entirely through byte strings. Thus, treating all
OS-destined strings as byte strings is good and natural.

In Windows, though, things are weirder. For example, Windows
exposes multiple APIs for creating a directory, and the one Perl uses (as of
5.32, anyway) only accepts code points 0-255. In this context Sys::Binmode
doesn窶冲 _break_ anything, but it does reinforce one of Perl窶冱 unfortunate
limitations on Windows.

Sys::Binmode is a good idea anywhere that Perl sends byte strings to the OS.
For now, as far as I know, that窶冱 everywhere that Perl runs. If that窶冱 not
true, please file a bug.

# WHERE ELSE THIS PROBLEM CAN APPEAR

The unpredictable-behavior problem that this module fixes in core Perl is
also common in [CPAN](http://cpan.org)窶冱 XS modules due to rampant
use of [the SvPV macro](https://perldoc.perl.org/perlapi#SvPV) and
variants. SvPV is basically Perl窶冱 [bytes](https://metacpan.org/pod/bytes) pragma in C: it gives
you the string窶冱
internal bytes with no regard for what those bytes represent. This, of course,
is problematic for the same reason why the [bytes](https://metacpan.org/pod/bytes) pragma is. XS authors
_generally_ should prefer
[SvPVbyte](https://perldoc.perl.org/perlapi#SvPVbyte)
or [SvPVutf8](https://perldoc.perl.org/perlapi#SvPVutf8) in lieu of
SvPV unless the C code in question handles Perl窶冱 encoding abstraction.

Note in particular that, as of Perl 5.32, the default XS typemap converts
scalars to C `char *` and `const char *` via an SvPV variant. This means
that any module that uses that conversion logic also has this problem.
So XS authors should also avoid the default typemap for such conversions.
(Again, though, use of the default typemap in this context is regrettably
commonplace.)

Before Perl 5.18 this problem also affected %ENV. 5.18 introduced
an auto-downgrade when setting %ENV similar to what this module does.

# LEXICAL SCOPING

If, for some reason, you _want_ Perl窶冱 unpredictable default behavior,
you can disable this module for a given block via
`no Sys::Binmode`, thus:

    use Sys::Binmode;

    system 'echo', $foo;        # predictable/sane/happy

    {

        # You should probably explain here why you窶决e doing this.
        no Sys::Binmode;

        system 'echo', $foo;    # nasal demons
    }

# AFFECTED BUILT-INS

- `exec`, `system`, and `readpipe`
- `do` and `require`
- File tests (e.g., `-e`) and the following:
`chdir`, `chmod`, `chown`, `chroot`, `ioctl`,
`link`, `lstat`, `mkdir`, `open`, `opendir`, `readlink`, `rename`,
`rmdir`, `stat`, `symlink`, `sysopen`, `truncate`,
`unlink`, `utime`
- `bind`, `connect`, `setsockopt`, and `send` (last argument)
- `syscall`

## Omissions

- `crypt` already does as Sys::Binmode would make it do.
- `select` (the 4-argument one) has the bug that Sys::Binmode fixes,
but since it窶冱 a performance-sensitive call where upgraded strings are
unlikely, this library doesn窶冲 wrap it.

# KNOWN ISSUES

[autodie](https://metacpan.org/pod/autodie) creates functions named, e.g., `chmod` in the
namespace of the module that `import()`s it. Those functions lack
the compiler 窶徂int窶� that tells Sys::Binmode to do its work; thus,
[autodie 窶彡lobbers窶� Sys::Binmode](https://github.com/pjf/autodie/issues/113).
`CORE::*` functions will still have Sys::Binmode, but of course they won窶冲
throw exceptions.

# TODO

- `dbmopen` and the System V IPC functions aren窶冲 covered here.
If you窶囘 like them, ask.
- There窶冱 room for optimization, if that窶冱 gainful.
- Ideally this behavior should be in Perl窶冱 core distribution.
- Even more ideally, Perl should adopt this behavior as _default_.
Maybe someday!

# ACKNOWLEDGEMENTS

Thanks to Leon Timmermans (LEONT) and Paul Evans (PEVANS) for some
debugging and design help.

# LICENSE & COPYRIGHT

Copyright 2021 Gasper Software Consulting. All rights reserved.

This library is licensed under the same license as Perl.