Name

rfc2045 — RFC 2045 (MIME) parsing library

Synopsis

#include <rfc822.h>
#include <rfc2045.h>

g++ ... -lrfc2045 -lrfc822 -lcourier-unicode

DESCRIPTION

The rfc2045 library parses MIME-formatted messages. The rfc2045 library is used to:

1) Parse the structure of a MIME formatted message

2) Examine the contents of each MIME section

3) Optionally rewrite and reformat the message.

Creating an rfc2045 structure

#include <rfc2045.h>

rfc2045::entity entity;

std::istreambuf_iterator<char> b{input_stream}, e;

rfc2045::entity::line_iter<false> parser{b, e};

entity.parse(parser);

The rfc2045::entity object represents a MIME object or entity. It's created from a message that's defined by a beginning iterator and an ending iterator for the message's contents. The iterators are used to create a rfc2045::entity::line_iter template instance. Its template parameter specifies whether the message uses a LF (false) or CRLF (true) line sequence. If these iterators are passed by reference to the constructor, they must exist until the message is fully parsed. After parse() returns, a successfully-parsed message results in the beginning iterator advanced to the ending iterator's value. If these iterators are passed by value to the constructor they must be copyable and they are copied into the parser object/

rfc2045::entity_parser<false> parser;

parser.parse(std::istreambuf_iterator<char><{input_stream},
std::istreambuf_iterator<char><{});

rfc2045::entity e=parser.parsed_entity();

rfc2045::entity_parser is an alternative, push-based approach for creating an entity object. Its parse() method gets repeatedly invoked with a pair of beginning and ending iterator values, that incrementally defines the contents of the MIME message. Afterwards, parsed_entity() returns the parsed MIME entity object, at which point the entity parser object is no longer usable and can only be destroyed.

Note

rfc2045::entity_parser uses a separate execution thread for creating the new MIME entity object. Each call to parse() copies the entire sequence of characters from the beginning/ending iterator pair into an internal buffer that the background execution thread digests. Use reasonably- sized sequences, and while the main execution thread assembles the next chunk, the background execution thread eats the previous one.

Note

Although the iterator pair can be anything that meets the definition of a beginning and an ending iterator, several rfc2045::entity methods demand a std::streambuf from which the MIME entity was constructed from. In the case of a pair of std::istreambuf_iterators, obtaining the original input stream's rdbuf() meets this criteria.

Default MIME character set

std::string charset=rfc2045::default_charset;

rfc2045::default_charset gives the default MIME character set, initially set to utf-8. The library uses the default_charset when it is not specified in the MIME message. There's rarely a need to change that.

Structure of a MIME message

rfc2045::entity entity;

for (auto &subentity:entity.subentities)
{
    // ...
    rfc2045::entity *ptr=subentity.get_parent_entity();
}

rfc2045::entity::errors_t code=entity.errors.code;

code=entity.all_errors();

std::vector<std::string> messages=entity.errors.describe();

The rfc2045::entity objects has many members, only some are publicly documented. An entity may have sub-entities. get_parent_entity returns a sub-entity's parent entity (or a nullptr in the case of a top level MIME entity).

Various errors that occured while parsing the MIME entity are collected into an error code, which is a bitmask. all_errors() returns a combined bitmask from the MIME entity and all of its subentities.

See the rfc2045.h header file for a complete list of parsing errors.

Basic MIME information

size_t startpos=entity.start_pos,
       startbody=entity.startbody,
       endbody=entity.endbody,

       nlines=entity.nlines,
       nbodylines=entity.nbodylines;

rfc2231::header content_type=entity.content_type;

std::string mime_type=content_type.value;

for (auto &[paramname, paramvalue]:content_type.parameters)
{
    size_t index=paramvalue.index;
    std::string value=paramvalue.value,
        charset=paramvalue.charset,
        language=paramvalue.language;

    value=paramvalue.value_in_charset();
    value=paramvalue.value_in_charset("iso-8859-1");
}

std::string_view charset=entity.content_type_charset();

rfc2045::cte content_transfer_encoding=entity.content_transfer_encoding;

The following rfc2045::entity class members define the position of each MIME entity in its character sequence

startpos

This is the starting position of the MIME entity's headers. The top level MIME entity's starting position is always 0.

startbody

This is the starting position of the body portion of the MIME entity.

endbody

This is one-past-the-end position of the body portion of the MIME entity.

nlines and nbodylines also have the number of lines in the MIME entity (header+body) and just the MIME entity's body portion.

content_type is an object that has the parsed contents of the Content-Type header. The rfc2231::header has two members:

value

The header's value (the part before the semicolon.

parameters

The header's parameters (if any).

parameters is an associative container. The key is the parameter name. The container's value has the following members.

index

This member counts each parameter, in the order of its appearance in the header. MIME parameters are parsed according to the rules in RFC 2231, so multiple parameters get combined into a single parameter and value. It is unspecified which part's original index represents the parameter.

value

This is the value of the parameter. If the parameter's value was split into parts using RFC 2231 this is the reassembled value.

charset

This is the value character set, as specified in an RFC 2231-encoded parameter. charset defaults to utf-8 if unspecified.

language

This is the RFC 2231-encoded value's language. language is an empty string if it's unspecified or if the parameter wasn't encoded using RFC 2231.

value_in_charset()

This method returns the value converted to rfc2045::default_charset.

value_in_charset(charset)

This method returns the value converted to the specified character set.

content_type_charset() is a convenient shortcut for returning the MIME entity's charset content type parameter. content_transfer_encoding is one of the following values that reflects the MIME entity's encoding:

  • rfc2045::cte::sevenbit
  • rfc2045::cte::eightbit
  • rfc2045::cte::qp
  • rfc2045::cte::base64

Note

rfc2045::cte::eightbit is also encoded for the rare binary encoding. A rfc2045::cte::error value indicates an invalid encoding.

std::string content_id=entity.content_id;

std::string content_disposition=entity.content_disposition;

std::string content_description=entity.content_description;

std::string content_base=entity.content_base;

std::string content_location=entity.content_location;

std::string content_md5=entity.content_md5;

std::string content_language=entity.content_language;

These class members provide the contents of the corresponding MIME headers, if they exist. Notably, content_disposition can be used to instantiate an rfc2231::header object in order to parse this header.

Decoding a MIME section

rfc2045::mime_decoder decoder{
    []
    (const char *bytes, size_t bytecnt)
    {
    },
    *input_stream.rdbuf(),
    "utf-8"
};

rfc2045::mime_unicode_decoder decoder{
    []
    (const char32_t *bytes, size_t bytecnt)
    {
    },
    *input_stream.rdbuf()
};

decoder.decode_header=true;
decoder.decode_body=true;
decoder.add_eol=false;
decoder.header_name_lc=true;
decoder.header_name_suppress=false;
decoder.decode_subentities=true;
decoder.headerfilter=[](std::string_view name, std::string_view content)
          -> bool
    {
        return true;
    };
decoder.headerdone=[](std::string_view name)
    {
    };

// ...

decoder.decode<false>(entity);

rfc2045::mime_decoder and rfc2045::mime_unicode_decoder extract the contents of the headers and/or the body portion of a MIME entity. Extraction involves:

  • Decoding RFC 2047-encoded headers, and decoding IDN-encoded domain names. Unfolding headers that are folded across multiple lines.

  • Decoding the MIME entity's body transfer encoding, and optionally converting it to a specific character set, if the optional third parameter to rfc2045::mime_decoder's constructor exists. If it doesn't exist no character set mapping takes place.

The first constructor parameter is a callable object, the output sink. The output sink gets repeatedly invoked from decode() with the contents of the MIME entity's header and/or body, in the original or mapped character set (rfc2045::mime_decoder), or as Unicode characters (rfc2045::mime_unicode_decoder). decode's template parameter must match rfc2045::entity::line_iter template's parameter that was used to create the MIME entity object.

Note

rfc2045::mime_decoder and rfc2045::mime_unicode_decoder are actually templates. Their template parameters are deduced from their constructors' parameters.

The following class members are available to be set prior to calling decode():

decode_header (default: true)

Whether the MIME entity's headers should be decoded.

decode_body (default: true)

Whether the MIME entity's body should be decoded.

decode_subentities (default: true)

Whether to decode, recursively, the MIME entity's subentities.

add_eol (default: false)

Whether to include an extra newline after decoding each MIME entity.

header_name_lc (default: true)

Whether to convert the name of each decoded header to lowercase.

header_name_suppress (default: false)

Whether to include only the contents of each header, to not include the header's name (possibly converted to lowercase) and the colon that separates the header's name from its contents.

headerfilter (default: [](std::string_view, std::string_view){ return true; })

This is a callable object that's called before extracting each header. Return true includes it in the decoded content given to the output sink. Together with header_name_suppress this provides for targeted means to extract the decoded contents of specific headers, only.

headerdone (default: [](std::string_view) {})

This is called after extracting each header

Rewriting MIME messages

After parsing an rfc2045::entity, it can be rewritten in order to convert 8-bit-encoded data to 7-bit encoding, or to convert 7-bit encoded data to full 8-bit data, if possible.

if (entity.autoconvert_check(rfc2045::convert::standardize))
{
    rfc2045::entity::autoconvert_meta metadata;

    metadata.appid="courier";

    rfc2045::entity::line_iter<false>::autoconvert(
       entity,
       []
       (const char *ptr, size_t l)
       {
          // ...
       },
       input_streambuf,
       metadata);
}

autoconvert_check returns true if the MIME entity must be rewritten in order to comply with the requested format. A false return indicates that the MIME entity already complies. The subsequent call to autoconvert() effects the rewrite (its template parameter must match the one that was used to parse the original MIME entity). autoconvert()'s parameters are: the MIME entity to rewrite, a callable object that gets repeatedly invoked with the contents of the rewritten MIME entity, and a std::streambuf-like object that corresponds to the current, parsed, MIME entity.

autoconvert() can only be called after a prior autoconvert_check(), that defines autoconvert()'s marching orders:

rfc2045::convert::standardize

Do not change the content encoding of the MIME object, only add default values for any missing Content-Type and Content-Transfer-Encoding headers.

rfc2045::convert::sevenbit

Reencode 8bit with quoted-printable. Also replace any 7bit content with excessively long lines.

rfc2045::convert::8bit

Replace quoted-printable with 8bit unless doing so will produce excessively long lines.

rfc2045::convert::8bit_always

Always replace quoted-printable with 8bit evenf if doing so will produce excessively long lines.