rfc822 — RFC 822 parsing library
#include <rfc822.h> #include <rfc2047.h> g++ ... -lrfc822
The rfc822 library provides C++ classes for parsing E-mail headers in the RFC 822 format. This library also includes some functions to help with encoding and decoding 8-bit text, as defined by RFC 2047.
The format used by E-mail headers to encode sender and recipient
information is defined by
RFC 822
(and its successor,
RFC 2822).
The format allows the actual E-mail
address and the sender/recipient name to be expressed together, for example:
John Smith <jsmith@example.com>
The main purposes of the rfc822 library is to:
Parse a text string containing a list of RFC 822-formatted addresses into its logical components: names and E-mail addresses.
Access those individual components.
Allow some limited modifications of the parsed structure, and then convert it back into a text string.
std::string_view header;
rfc822::tokens tokens{header};
for (rfc822::token &t:token)
;
rfc822::tokens is a container of tokenized
parts of E-mail addresses. It is constructed from a
std::string_view that contains E-mail
addresses.
The underlying text string must not be destroyed as long as
the rfc822::tokens object is in scope.
struct rfc822::token{}; int type; // RFC 822 atom
std::string_viewstr; // underlying text
The type field contains one of the RFC 822
atoms, such as “@” or “;”. The
str field contains atom's text. It references
a substring of the original string that was passed to
rfc822::tokens constructor.
str references a substring for
'\0', '"', '('
atoms. In all other cases, str is an empty
string. Possible values of type:
'\0'This is a simple atom - a sequence of non-special characters that is delimited by whitespace or special characters (see below).
'"'This is a quoted string.
'('
This is an old style comment. A deprecated form of E-mail
addressing uses - for example -
"john@example.com (John Smith)" instead of
"John Smith <john@example.com>".
This old-style notation defined
parenthesized content as arbitrary comments.
The rfc822::token with
type set to '(' is created for the
entire comment, including the parentheses.
The remaining possible values of type
include all the characters in RFC 822 headers that have special
significance.
rfc822::addresses addresses{tokens};
for (rfc822::address &a:addresses)
;
rfc822::addresses is a container of E-mail
addresses that were parsed from a rfc822::tokens
object.
struct rfc822::address{}; rfc822::tokensname; // Name portion of an address
rfc822::tokensaddress; // E-mail address
The rfc822::address class has two fields:
name and
address.
name contains the name portion of an address,
which is a sequence of tokens.
address contains the E-mail address itself,
which is also a sequence of tokens.
For example, the following is a valid E-mail header:
To: recipient-list: tom@example.com, john@example.com;
Typically, all of this, except for "To:",
gets parsed by creating a rfc822::tokens object,
then a rfc822::addresses object.
The "recipient-list:" and the trailing semicolon is a
legacy mailing list specification that is no longer in widespread use, but
must still must be accounted for.
The resulting rfc822::addresses object will have four
rfc822::address structures: one for
"recipient-list:";
one for each address; and one for the trailing semicolon.
If address in a
rfc822::address
is an empty container, then this structure represents some non-address
portion of the original header, such as
"recipient-list:" or a
semicolon. Otherwise it contains a tokenized representation of the E-mail
address.
name either contains the tokenized form of a
non-address portion of the original header, or a tokenized form of the
recipient's name.
name will be an empty container if the
recipient name was not provided.
For example, for the following address:
Tom Jones <tjones@example.com> - the
address field contains the tokenized form of
"tjones@example.com",
and name contains the tokenized form of
"Tom Jones".
const auto &[string, error] = rfc2047::encode(U"header", "utf-8", rfc2047::qp_allow_any);
The rfc2047::encode() function template and the
rfc2047::decode() function object provide
additional logic to encode or decode 8-bit content
in 7-bit RFC 822 headers, as specified in RFC 2047.
rfc2047::encode()'s first parameter is a
std::string in the character set specified
by the second parameter. The third parameter is a function that
returns true if the character should be encoded. The following functions
are predefined:
rfc2047::qp_allow_anyAll characters are allowed to be unencoded, except a small number of characters that have special meaning in RFC 2047: control characters, eight-bit characters, and several characters that would break the tokenization of the header.
rfc2047::qp_allow_commentsAlso parenthesis and quotes are allowed to be unencoded.
rfc2047::qp_allow_wordAllow only characters used in base64 encoded MIME entities, and a few other characters.
Instead of a single string of text, an overloaded
rfc2047::encode() function template accepts a
beginning and an ending iterator for a sequence of characters to be
encoded.
rfc2047::decode() parses a string in RFC 2047
format. It is a somewhat complicated template that implements a
callback-based parser. Consult the inline comments for a more
detailed explanation of how to use it.
rfc2047::decode_unicode() does the same but it
decodes to a Unicode string, and ignores the character set and
language of the encoded word (the character set effects the conversion
to a Unicode character stream, and the language is immaterial).
std::u32string ustr; rfc822::tokens name, address; address.unicode_address(std::back_inserter(ustr)); name.unicode_name(std::back_inserter(ustr), false); std::string str; address.display_address(unicode_default_chset(), std::back_inserter(str)); name.display_name(unicode_default_chset(), std::back_inserter(str), false); display_header_unicode("To:", "nobody@example.com", std::back_inserter(ustr), [] { } ); display_header("To", "nobody@example.com", unicode_default_chset(), std::back_inserter(str), [] { } ); std::vector<std::u32string> ulines; rfc2047::wrap_header_unicode("Subject", "Hello world", 80, std::back_inserter(ulines) ); std::vector<std::string> lines; rfc2047::wrap_header("Subject:", "Hello world", 80, unicode_default_chset(), std::back_inserter(lines) ); rfc822::address &address; address.encode(unicode_default_chset(), std::back_inserter(str));
The “rfc2047” namespace contains several functions that handle various kinds of encoding and decoding between 8-bit content in 7-bit RFC 822 headers. These functions implement the specification in RFC 2047, and related standards. These functions write their output to an output iterator that gets passed as one of the parameters. If an output iterator gets passed by value, the function returns the value of the output iterator after it has been advanced for each character written to it. If the output iterator is passed by reference, the function returns void, and the output iterator itself is modified.
The following functions are available:
rfc822::tokens::unicode_address,
rfc822::tokens::unicode_name
A method of the rfc822::tokens class,
used to convert the parsed contents of an RFC 822 address or
name into a sequence of Unicode characters.
unicode_name() uses RFC 2047 to
decode any RFC 2047-encoded words.
unicode_address() uses IDN encoding
to convert any IDN-encoded domain names into Unicode.
rfc822::tokens::display_address,
rfc822::tokens::display_name
Convert a sequence of rfc822::tokenss
containing either an IDN-encoded address or an RFC-2047 encoded
name into a sequence of characters in the specified character set.
display_name()'s third parameter is a flag.
A true value strips off quotes or parentheses
from the display name.
display_header_unicode
This function takes the name of a header and its contents, and
converts the contents into a sequence of Unicode characters.
The passed in header name determines how the header gets parsed.
Headers containing addresses are handled by parsing them as
addresses,
then converting the result into a sequence of Unicode characters
using unicode_name() and
unicode_address(). Other headers are parsed
as unstructured headers, using RFC 2047 to decode any RFC
2047-encoded words.
The fourth parameter is an optional callback that gets invoked at every line-breaking opportunity. The callback gets invoked after writing, to the output iterator, the sequence of characters that end in a line-breaking opportunity, and before writing the first character after the potential line-break.
rfc2047::display_header
Take an arbitrary header, and convert it into a sequence of
characters in the specified character set. This is, basically,
the same as display_header_unicode()
but the returned string is in the specified character set.
rfc2047::wrap_header_unicode
This uses display_header_unicode, but then
it wraps the resulting sequence of characters into a sequence of
lines, using the passed in maximum line width. The passed in
output iterator iterates over a sequnce of unicode strings.
rfc2047::wrap_header
This uses wrap_header_unicode, but then
it converts the resulting sequence of Unicode characters into a
sequence of 8-bit strings, using the passed in character set.
The passed in output iterator iterates over a sequence of
8-bit strings.
rfc2047::encode
Encode a sequence of 8-bit strings into a sequence of RFC 2047-encoded words. The passed in output iterator iterates over a sequence of RFC 2047-encoded words.
rfc822::address::encode
This method takes the name and address portion of a
rfc822::address, that was encoded in the
given character set, and encodes them using RFC 2047 and IDN,
as appropriate. The output is written to the passed in
output iterator.
const auto &[str, flags]=rfc822::coresubj("Re: your message");
const auto &[str, flags]=rfc822::coresubj_nouc("Re: your message");
const auto &[str, flags]=rfc822::coresubj_keepblobs("Re: your message");
These functions take the contents of the subject header, and return
the "core" subject header that's used in the specification of the IMAP
THREAD function. These functions are designed to strip all subject line
artifacts that might've been added in the process of forwarding or
replying to a message.
These functions return a tuple of a string and an int
flag value.
Currently, rfc822::coresubj() performs the
following transformations:
Leading and trailing whitespace is removed. Consecutive whitespace characters are collapsed into a single whitespace character.
These artifacts (and several others) are removed from the subject line.
rfc822::coresubj
This is the original version of this function. It is preserved for binary compatibility with existing programs.
rfc822::coresubj_nouc
The returned string does not get converted to uppercase.
rfc822::coresubj_keepblobs
This is like rfc822::coresubj_nouc(), except
that it does not remove [blob] markers from the returned
subject line.
Note that these functions do NOT do MIME decoding. In order to
implement IMAP THREAD, it is necessary to call something like
rfc2047_decode() before
calling rfc822::coresubj().
The returned flag value is a bitmask:
CORESUBJ_RE
This indicates that the original subject line starts with
Re: .
CORESUBJ_FWD
This indicates that the original subject line contained a
(fwd) marker.