CocoonFS format specification

Copyright 2023-2025 SUSE LLC

Licensed under CC BY-SA 4.0

Introduction

CocoonFS is a special purpose filesystem format designed for securely storing small items of highly sensitive data such as, but not limited to, a software TPM’s state and UEFI variables in a confidential Trusted Execution Environment (TEE) setting.

Security properties

The most noteworthy features distinguishing CocoonFS from common existing Full Disk Encryption (FDE) solutions designed primarily for mass storage deployments are:

Regarding cryptographic algorithms, the CocoonFS format supports the full set of block cipher and hash algorithms specified by the TCG Algorithm Registry [TCGALG25], with implementations explicitly being permitted to support a subset thereof only.

All filesystem metadata is encrypted (and authenticated for that matter), even though some might be very well predictable.

Details and rationale follow below.

Random IVs

Commonly used Full Disk Encryption (FDE) techniques operating exclusively at the block layer implement certain trade-offs in order to maintain a reasonable level of efficiency on mass storage devices, but that comes at the cost of sacrificing some cryptographic security properties not so important for data at rest. For a good discussion of the encryption aspect refer to [KHATI16] and [FRUHWIRTH05], but in a nutshell it all boils down to that indistinguishability of encryptions (“IND”), c.f. [GOLDREICH09] chapter 5, cannot be achieved at full within a given fixed sector. That is, due to the fact that a sector’s encryption tweak gets derived exclusively from its location – and is deterministic in particular – it is possible to infer some relations between two plaintexts encrypted successively to the same sector: e.g. whether the two plaintexts share a common prefix (CBC-ESSIV) or some common data (XTS) at block cipher block granularity. Note that this is not much of a concern for data at rest, because an adversary cannot observe two consecutive ciphertexts by definition.

The TEE threat model assumed for CocoonFS deployments is different in that respect though, in that an eavesdropper might be capable of recording any individual storage write request issued from a TEE. Therefore a fresh, random IV is getting generated for each encryption operation. Storing those random IVs for individual blocks would be too much overhead, hence they are associated with logical filesystem entities – either some metadata structure or a file – instead. More specifically, each entity is encrypted with a random IV in Cipher Block Chaining (CBC) mode. This does imply that partial file updates are not possible, and neither are seeks for reading because of the choice of CBC mode. With the kind of files anticipated to get typically stored on a CocoonFS target, i.e. small ones, the potential overhead incurred with full file reads or writes is expected to be bearable though.

Authentication

Authentication serves two vital purposes:

For protecting application state, it is desirable that the units of authentication comprise individual files as a whole at the very least: alternative approaches like the one taken by the Linux kernel’s dm-integrity, which authenticates individual device blocks, would potentially allow an adversary to revert selected parts of a file back to the state from a previous write, i.e. to combine old with new contents.

Likewise, for IND-CCA security it is mandatory to authenticate the respective units of encryption (“messages” in common crypto terminology) as a whole each – that is, it is not sufficient to authenticate individual parts at e.g. device block granularity independently from each other. With the units of encryption corresponding to the logical filesystem entities, i.e. metadata or files, as described in the previous subsection, there are effectively two natural choices for defining the units of authentication left:

Both alternatives have some pros and cons each. The authentication of individual filesystem entities is clearly more efficient, in terms of storage space as well as computational work required (even though the inline storage of authentication tags would inevitably introduce some significant padding waste to align allocations to an integral multiple of the unit of allocation, which is 128B at least). A Merkle tree based scheme is more appealing from a security property point of view, because it yields a single authentication tag binding the state of the CocoonFS instance as a whole. This enables applications to distribute their state across multiple files while still being guaranteed a globally coherent view. Note that in particular, that would allow for moving frequently written data, like a software TPM’s current time value, into a dedicated file, thereby avoiding the need to write out the complete, mostly unchanged state upon each and every update. Furthermore, having a single root authentication digest for the whole CocoonFS image available might perhaps serve as a basis for interesting future research projects in the area of rollback protection protocols: any digest updates could get sent to and recorded at a trusted remote party (with the journal to be introduced later providing a means to reliably recover from lost ACKs due to network failures).

It’s expected that the storage backing CocoonFS deployments will typically be relatively small, i.e. that the height of the Merkle tree will remain within affordable bounds. To get a rough idea on the numbers: five levels with an assumed node size of 1kB and fanout of 16 would cover 128MB worth of data already. Moreover tree node caches can certainly help with mitigating the overhead at the read side, as they can get organized such that especially the nodes at the upper layers will have a good probability of cache residency.

With these considerations, the design choice made for CocoonFS is to accept the additional cost inherent to the Merkle tree approach in favor of achieving better security guarantees.

As a minor technical detail, note that a few filesystem metadata items still need additional inline authentication tags for preserving IND-CCA when e.g. finding the location of the authentication tree or reading the journal during bootstrap, i.e. when opening the filesystem.

Key derivation

Some of the data stored on a CocoonFS instance will have low entropy, which might perhaps enable adversaries to acquire plaintext-ciphertext pairs to conduct a cryptanalysis on. Examples would include e.g. the filesystem metadata structures, but also certain application files’ contents. In order to confine the effects of key wear-out, a unique subkey is derived from a root key for each combination of filesystem entity and cryptographic purpose. The Key Derivation Function (KDF) used for that is the KDFa() specified in [TCGTPM19A].

The initially mentioned algorithm agility support, i.e. the possibility to use any of the algorithms from the TCG Algorithm Registry [TCGALG25] with CocoonFS, introduces a potential risk of downgrade attacks: by overwriting algorithms with weak ones in the CocoonFS header, an adversary might perhaps be able to recover some subkey or even the root key. In order to thwart such attacks, the externally supplied raw root key material, assumed to have full entropy, is not taken as is, but first run through the KDFa() with a fixed hash algorithm, namely SHA-512, with the other algorithms as found in the image header as additional input.

Filesystem model

The filesystem model implemented by CocoonFS is a very limited one: there’s no directory hierarchy and “file names” are simply 32 bit integers, i.e. inode numbers.

It is expected that some inode numbers or ranges thereof get statically assigned to a specific application purpose. For example, when storing a software TPM’s state, it would be natural to reserve the ranges 0x01000000-0x01ffffff for the storage of NV indices and 0x81000000-0x81ffffff for persistent objects, c.f. [TCGTPM19B].

Note that for the anticipated CocoonFS usage scenarios, i.e. the storage of core TEE state, it will likely always be possible to make such static assignments at development time and thus, it is certainly desirable to avoid the overhead of updating directory metadata structures.

Journal

For robustness against service interruptions, e.g. power cuts, crashes and alike, CocoonFS implements a journal.

In fact, it is not so much of a journal in the traditional sense as in “a journal with multiple update records”: the CocoonFS journal’s capacity is limited to tracking a single pending transaction at a time, where that single transaction can comprise an arbitrary number of accumulated filesystem operations.

Unlike it’s the case with other journal implementations, there’s no dedicated, preallocated journal area except for a small (a single IO Block) entry structure at a fixed location, the journal log head, potentially linking to further, dynamically allocated parts describing the pending transaction.

That journal log head is supposed to get written last, after the remainder of the journal has been setup. Once in place, the filesystem update is considered effective and success may get reported back to the issuing application at this point already.

Note that the journalling functionality might provide a convenient basis for implementing a rollback protection protocol robust against network failures.

For example, upon update, the filesystem implementation could

  1. Setup the journal. At this point it is possible to transition to the new state or to revert back.
  2. Send the new authentication root digest to the remote trusted party.
  3. Wait for an ACK reply. In the common case that an ACK is received, apply the journal. If no ACK is being received, retry to reach the remote party until it answers.

For handling any power cuts encountered after the new root digest has been sent, but before a reply has been received back, make the filesystem open code to query the remote trusted party for the latest root authentication digest at filesystem opening time, depending on the answer either apply or cancel the pending journal.

Confidentiality of allocations and block trimming

Ideally it should not be possible for an adversary to infer the allocation status of any blocks at any time, because that would e.g. allow for fingerprinting the TEE’s workload.

The metadata tracking the allocations, i.e. the allocation bitmap, is encrypted, but in the assumed TEE threat model of an active adversary able to eavesdrop on IO requests, it is difficult to specify clear, well-defined security semantics with respect to the confidentiality of the overall allocations state: for example, whenever a given block is read or written, an adversary observing the IO may readily infer it’s allocated at that point in time.

For that reason, CocoonFS does not define any security guarantees regarding the confidentiality of allocations with respect to adversaries able to eavesdrop or even alter IO communication.

As a side note: this relaxation is a prerequisite for the support of a dynamically allocated journal: the blocks used temporarily for the journal can be in any state, therefore their contents cannot get authenticated reliably. More generally, unallocated block’s contents cannot get considered as input for the authentication, only the fact that they are unallocated can, but this implies that a given block’s authentication tag returns to a previous value when deallocated, something which would otherwise have been highly unlikely.

CocoonFS does however provide optional confidentiality of allocations in the “data at rest model”. If enabled, certain additional measures with some computational overhead need to get taken – most notably a reencryption of data copies in the journal in order to prevent the identification of journal blocks by matching them to duplicates.

Implementations may issue trim commands for deallocated blocks to the underlying storage device. Note however that trimming is mutually exclusive with the confidentiality of allocations, because a trimmed block is usually recognizable as such.

Fundamental structures and encodings

The core CocoonFS metadata structures are:

Inodes 0 to 5 (inclusive) are reserved for CocoonFS internal use. The authentication tree, the allocation bitmap and the inode index root have entries in the inode index and are assigned inode numbers 1, 2 and 3 respectively. The journal log has inode number 5 associated with it, but there’s no explicit entry for it in the inode index – the number is used only for key derivation subject purposes.

For completeness in this context: inode number 0 is reserved for a special “no inode” value, inode number 4 is currently not allocated and reserved. Note that the minimum inode index B+-tree node fill-level is such that inodes 1 to 4 will always be found in the leftmost leaf, which is referred to as the inode index entry leaf node. The location of the inode index entry leaf node is referenced from the mutable image header and enables discovering all the other metadata structures at filesystem opening time.

An extent is a non-empty, physically contiguous range on storage. Any inode, except for the special inode index root node inode, stores its data in one or more extents.

The term block is used to denote an extent which is a power of two in length and whose length is implicit from the context. Block boundaries on storage are not necessarily aligned to the block size in the general case, any alignment requirements on extents or blocks are stated explicitly where they apply.

The unit of allocations is an Allocation Block, a power of two multiple of 128B, specified as a configuration parameter in the static image header. The allocation bitmap has one bit entry associated with each possible Allocation Block in the filesystem image: if set, the corresponding Allocation Block is allocated, otherwise it’s free.

Unless otherwise noted, all locations on storage are specified in units of Allocation Blocks. The maximum supported filesystem image size is \(2^{64} - 1\) rounded down to the Allocation Block size.

An IO Block is a power of two multiple of the Allocation Block size, specified as a filesystem parameter in the static image header and defining an upper bound on the supported backing device’s IO granularity. More specifically, it is assumed that a write to an aligned IO Block doesn’t affect the contents of any other IO Block. Implementations must reject to open a filesystem if the backing storage device supports only larger IO sizes than the IO Block size recorded in the static image header.

The filesystem image size must always be a multiple of the IO Block size. Note that this effectively limits the maximum supported image size to \(2^{64} - 1\) rounded down to the IO Block size.

An Authentication Tree Data Block is a power of two multiple of the Allocation Block size, specified as a filesystem parameter in the static image header and defining the unit of authentication, i.e the fan-out factor from the authentication tree leafs’ entries.

All of the authentication tree’s extents’ boundaries must be aligned to the larger of the IO Block and the Authentication Tree Data Block size.

All of the allocation bitmap’s extents’ boundaries must be aligned to the Authentication Tree Data Block size, as is required for bootstrapping the authentication at filesystem opening time: in general an Authentication Tree Data Block’s individual Allocation Blocks’ respective allocation status must be known each for computing its authentication digest, but with the aligned allocation bitmap file extents all overlapping Authentication Tree Data Blocks’ Allocation Blocks are known a priori to be allocated.

Storage location encodings

Encoded extent pointer

An encoded extent pointer is a packed 64 bit encoding format for specifying the location and type of an extent of up to 64 Allocation Blocks.

The type of an extent referenced from an encoded extent pointer is either “direct” or “indirect”. A direct extent contains payload data, an indirect one (the beginning of) an encoded extent list specifying the extents containing the actual payload data. Indirect extents are used if the payload data either exceeds 64 Allocation Blocks or spans multiple extents and can be referenced only from entries in the inode index.

An encoded extent pointer is formed as follows:

The value of all-zeros denotes a special “NIL” value – as the static image header is located at the filesystem image’s beginning, no extent can ever start at position 0.

Note that with a maximum supported filesystem image size of \(2^{64} - 1\) rounded down to the Allocation Block size, and a minimum Allocation Block size of 128B, any Allocation Block index always has its upper 7 bits clear, so the shift in the first step wouldn’t shift non-zero bits out.

Encoded block pointer

An encoded block pointer is a packed encoding format for specifying the location a block, i.e. some extent whose length is a fixed power of two implicit from the context.

An encoded block pointer is formed by shifting the block’s beginning on storage in units of Allocation Blocks represented as a 64 bit integer to the left by 7 bits. The lower 7 bits are reserved for future use.

The value of all-zeros denotes a special “NIL” value – as the static image header is located at the filesystem image’s beginning, no block can ever start at position 0.

Encoded extents list

An encoded extents list specifies the location of one or more extents.

The extents lists is encoded as a sequence of (beginning, length) pairs, one for each extent.

An extent’s beginning is specified in terms of the difference relative to the previous extent’s end, if any, or zero for the first entry, in units of Allocation Blocks and represented in two’s complement modulo \(2^{64}\), encoded in signed LEB128 format. An extent’s length is specified in units of Allocation Blocks and encoded in unsigned LEB128 format. It must not be zero.

The encoded list is terminated by two zero bytes.

Authentication contexts

Any data digested for authentication gets extended by an authentication context for the purpose of the digesting operation. The authentication context uniquely encodes the type and format of the authenticated data. Its exact format depends on the authenticated subject, but it always ends with one of the authentication subject identifiers defined in the table below. An authentication context’s subject identifier uniquely identifies the format and semantics of its remainder, which in turn binds the format and semantics of the authenticated data.

Name Value Authentication subject description
AUTH_SUBJECT_ID_IMAGE_CONTEXT 1 Collection of filesystem configuration parameters describing the layout as well as the locations of certain entities needed for bootstrapping at filesystem opening time. The image context digest gets in turn digested into the authentication tree root digest.
AUTH_SUBJECT_ID_AUTH_TREE_ROOT_NODE 2 The authentication tree root node.
AUTH_SUBJECT_ID_AUTH_TREE_DESCENDANT_NODE 3 Non-root authentication tree node.
AUTH_SUBJECT_ID_AUTH_TREE_DATA_BLOCK 4 Authentication tree data block.
AUTH_SUBJECT_ID_ENCRYPTION_ENTITY_CHAINED_EXTENTS 5 Extent in a sequence of encrypted chained extents.
AUTH_SUBJECT_ID_INODE_INDEX_NODE 6 A node in the inode index tree. Used only for authenticating the contents of the inode index entry leaf node.
AUTH_SUBJECT_ID_JOURNAL_LOG_FIELD 7 A field in the journal log.

Encryption entity formats

There are three different encryption entity formats defined: * one for encrypted blocks, where it is assumed that the length of the encrypted payload is implicit from the context, * one for a sequence of encrypted extents, where the locations of all extents the sequence are determined by some means external to the encrypted entity and * one for a linked list of extents, where only the head extent’s locations is determined by some external means while the tail is to be found by traversing the linked list.

For example, the block encryption format is used for nodes in the inode index B+-tree, the encrypted extents for inode data, and chained encrypted extents for storing inode extents lists as well as the journal log.

Encrypted block

An encrypted block’s payload length is assumed to be known a priori from the context.

The format is stored as follows:

  1. The IV for the CBC block cipher chaining mode.
  2. Randomized padding to align the remainder of the block’s length to a multiple of the block cipher block length. Note that the padding is empty for all possibly supported block cipher algorithms currently defined in the TCG Algorithm Registry.
  3. An integral multiple of block cipher blocks storing the result of encrypting the fixed-length payload in CBC block cipher chaining mode.
  4. An integral multiple of remainder block cipher blocks extending up to the encryption entity block’s end and filled with random data.

As the payload length is assumed to be fixed and implicit from the context, it is unspecified how to pad the encrypted payload to align with the block cipher block length. It is also unspecified how the remainder of the encrypted entity block is filled with random data. Implementations might e.g. simply continue the CBC encryption on a sequence of zeros, or they make invoke a cryptographic random number generator (CSPRNG).

Encrypted extents

The locations of all extents in the sequence is assumed to be determined by some means external to the encrypted entity, the encrypted payload may be of any length, zero included.

The IV is stored at the beginning of the first extent. Randomized padding is inserted in any extent to align the remainder of the extents’ lengths to a multiple of the block cipher block size each – after the IV for the first extent, at the extents’ beginnings for any subsequent extent. Note that the padding is empty for all possibly supported block cipher algorithms currently defined in the TCG Algorithm Registry. The remainders of all extents are concatenated to collectively form the ciphertext. The ciphertext is the result of encrypting the payload, amended by a PKCS#7 padding and an integral multiple of zero-filled block cipher blocks as appropriate, in CBC block cipher chaining mode.

Encrypted chained extents

Only the location of the first extent in a sequence of chained extents is assumed to be determined by some means external to the encrypted entity: the chained extents form a singly linked list with the pointers to the respective next extent being part of the encrypted plaintexts each.

Encrypted chained extents come in two flavors: with and without inline authentication. The variant with inline authentication is used for implementing IND-CCA security for certain filesystem structures at filesystem opening time before the full Merkle tree based authentication has been bootstrapped.

The first extent may store some optional plaintext header at its beginning, like e.g. a magic for the journal log head, followed by the mandatory IV. For the inline authenticated variant an authentication tag is stored in each extent in the chain: after the IV for the first extent, at the extents’ beginnings for any subsequent continuation extent. Randomized padding is then inserted at the current position in each extent in order to align its remainder to an integral multiple of the block cipher block size.

The (aligned) remainder of each extent constitutes its ciphertext. The ciphertext is the result of encrypting the plaintext associated with an extent in CBC mode, with the IV being the output IV from the previous extent’s CBC encryption, if any, or the IV stored inline to the first extent otherwise. The extents’ plaintexts are formed by prepending an encoded extent pointer with its “indirect” bit clear and linking to the next extent in the chain, if any, or set to NIL if not, to a chunk of payload data from the encrypted entity. Chunks of entity payload data are consumed greedily, that is, each but the last extent’s capacity is exhausted in full. A PKCS#7 padding is appended to the last extent’s plaintext, followed by an integral multiple of zero-filled block cipher blocks as appropriate.

For the inline-authenticated variant, the extents’ authentication tags are computed as HMACs over the following data:

  1. The extent’s stored contents, with the authentication tag replaced by all-zeros for the first extent, or by the previous extent’s authentication tag for any subsequent continuation extent.
  2. An additional authentication context constructed as follows:
    1. For a continuation extent only: the IV used for the extent’s CBC encryption, i.e. the IV output from the previous extent’s encryption.
    2. Possibly empty authenticated associated data common to all extents.
    3. A single byte set to 0 for the first extent, or to 1 for any subsequent continuation extent.
    4. An authentication context format version identifier byte of constant 0.
    5. An authentication context subject identifier byte of constant AUTH_SUBJECT_ID_ENCRYPTION_ENTITY_CHAINED_EXTENTS identifying the authenticated subject.

The security strength of authenticating the individual extents with a HMAC is that of the underlying hash’s preimage resistance, i.e. the digest size in bits usually. Note however that because intermediate authentication tags in the chain are public, including the previous one in the digest for the next extent in the chain provides authenticity for the chain as a whole only at the underlying hash’s collision resistance, which is at most half the underlying hash’s digest length. In fact the inclusion of the previous tags is not needed for correctness or soundness at all, but implements an integrity protection measure for the sequence of chained extents as a whole at little additional cost, which is good to have from a robustness perspective, especially for the journal log.

Image header

The filesystem image header is split into two parts: a static and a mutable one. The static image header is located at the beginning of the image, padded to an integral multiple of the IO Block size so that no neighboring writes will ever alter its contents. This is important, as the configuration found in the static image header is needed for determining the filesystem layout and locating the fixed position of the journal log head.

The mutable header is located past the static image header’s padding, and changes to it are tracked through the journal, just as is the case for any update. It contains changing values such as the filesystem image size or the authentication tree root hash.

Static image header

The static image header starts at offset zero, it’s format is:

Length in bytes Description
8 Magic string 'COCOONFS' (without a terminating zero byte).
1 The filesystem format version. Fixed to 0.
20 The set of filesystem image layout parameters, c.f. further below.
1 The salt length.
variable The salt.

After the static image header, some padding is inserted up to the next IO Block alignment boundary. None of the IO Blocks overlapping with the static image header, including that padding, may ever get written to.

The set of filesystem configuration parameters, referred to as the image layout, is encoded as follows:

Encoded length in bytes Name Description
1 allocation_block_size_128b_log2 Size of an Allocation Block, specified as the base-2 logarithm of the size in units of 128B.
1 io_block_allocation_blocks_log2 Size of an IO Block, specified as the base-2 logarithm in units of Allocation Blocks.
1 auth_tree_node_io_blocks_log2 Size of an authentication tree node, specified as the base-2 logarithm in units of IO Blocks.
1 auth_tree_data_block_allocation_blocks_log2 Size of an Authentication Tree Data Block, specified as the base-2 logarithm in units of Allocation Blocks. Must be <= 6.
1 allocation_bitmap_file_block_allocation_blocks_log2 Size of an allocation bitmap file block, specified as the base-2 logarithm in units of Allocation blocks.
1 index_tree_node_allocation_blocks_log2 Size of an inode index B+-tree node, specified as the base-2 logarithm in units of Allocation blocks. Must be <= 64.
2 auth_tree_node_hash_alg TCG algorithm identifier of the hash algorithm to be used for digesting the descendant, non-root nodes of the authentication tree, referred to as the “authentication tree node hash algorithm”. Encoded in big-endian format.
2 auth_tree_data_hmac_hash_alg TCG algorithm identifier of the hash algorithm to be used for digesting Authentication Tree Data Blocks, referred to as the “authentication tree data hash algorithm”. Encoded in big-endian format.
2 auth_tree_root_hmac_hash_alg TCG algorithm identifier of the hash algorithm to be used for digesting the root node of the authentication tree, the “authentication tree root hash algorithm”. Encoded in big-endian format.
2 preauth_cca_protection_hmac_hash_alg TCG algorithm identifier of the hash algorithm to be used for the inline authentication of various metadata items for maintaining IND-CCA during bootstrapping, referred to as the “preauthentication CCA protection hash algorithm”. Encoded in big-endian format.
2 kdf_hash_alg TCG algorithm identifier of the hash algorithm to be used for deriving subkeys by means of the TCG KDFa(), referred to as the “key derivation hash algorithm”.
4 block_cipher_alg TCG algorithm identifier and key size of the block cipher algorithm used for any encryption. Encoded as a pair of two 16 bit integers in big-endian format each.

Mutable image header

The mutable image header is located at the first IO Block aligned boundary following the static image header. It gets updated through the general journalling mechanics, hence it may be in an inconsistent state at filesystem opening time.

The mutable image header’s format is:

Length Description
Digest length produced by auth_tree_root_hmac_hash_alg. The authentication tree root HMAC digest.
Digest length produced by preauth_cca_protection_hmac_hash_alg. HMAC digest over the inode index entry leaf node, used for maintaining IND-CCA security when first decrypting the node at filesystem opening time.
8B Encoded block pointer to the inode index entry leaf node.
8B The filesystem image size in units of Allocation Blocks, encoded as a 64 bit integer in little-endian format.
variable Padding to align the mutable image header’s length to a multiple of the Allocation Block size.

Key derivation

As outlined in the introduction, the root key gets processed once through the TCG KDFa() with SHA-512 in order to thwart downgrade attacks. Furthermore subkeys are derived from that for every combination of filesystem entity and cryptographic purpose in order to confine a potential key wear-out.

For reference in what follows, the input parameters to the KDFa() are specified in [TCGTPM19A] as follows:

label is set to one of the following single-byte constants identifying the cryptographic purpose the derived key is to be used for:

Name Value Description
KEY_PURPOSE_DERIVATION 1 The key will be used for deriving further keys from it, by means of of kdf_hash_alg.
KEY_PURPOSE_AUTH_ROOT 2 The key will be used for forming the authentication tree root HMAC with auth_tree_root_hmac_hash_alg.
KEY_PURPOSE_AUTH_DATA 3 The key will be used for forming an HMAC over an Authentication Tree Data Block with auth_tree_data_hmac_hash_alg.
KEY_PURPOSE_PREAUTH_CCA_PROTECTION_AUTH 4 The key will be used for some inline authentication HMAC formed with preauth_cca_protection_hmac_hash_alg.
KEY_PURPOSE_ENCRYPTION 5 The key will be used for encrypting with block_cipher_alg.

Root key derivation

The root key is derived from the externally supplied key material by invoking KDFa() with its parameters set to:

Subkey derivation

The input parameters to subkey derivation are

Unless otherwise noted, the domain is set to an inode number associated with the to be derived key and subdomain is one of

Name Value Description
INODE_KEY_SUBDOMAIN_DATA 1 The key will be used for processing the inode’s data.
INODE_KEY_SUBDOMAIN_EXTENTS_LIST 2 The key will be used for processing the inode’s extents list.

The subkey is derived from the root key by invoking KDFa() with its parameters set to:

Allocation bitmap

The allocation bitmap tracks the filesystem image’s Allocation Blocks’ allocation status in an associated bit each: a bit is set if the corresponding Allocation Block is allocated. The bitmap is organized as an array of unsigned 64 bit integers, henceforth denoted by “allocation bitmap words”, with any excess bits corresponding to a region beyond the filesystem image size set to 0. The region at the start of the filesystem image storing the image headers, as well as the one where the journal log head is located, are always allocated.

The Allocation Bitmap is stored in inode number 2, and the inode index B+-tree minimum fill level is such that it can always be found through the inode index entry leaf node, in turn referenced from the mutable image header.

In order to support partial allocation bitmap updates, the encryption format deviates from the standard “encrypted extents” one generally used for inodes: the allocation bitmap files’ extents are divided into blocks of size as specified by the allocation_bitmap_file_block_allocation_blocks_log2 filesystem configuration parameter, and each one gets encrypted individually with the common encrypted block format. Even though the maximum possible payload size of such an encrypted block is not necessarily a multiple of the bitmap word size, only an integral number of bitmap words is stored in each block, the maximum possible to be specific. The encryption key used for encrypting the individual file blocks a a subkey derived from the root key with the domain parameter set to 2, i.e. the allocation bitmap file’s associated inode number, a subdomain value of INODE_KEY_SUBDOMAIN_DATA, and a key purpose of KEY_PURPOSE_ENCRYPTION.

Note that this scheme on encrypting the allocation bitmap file block-wise does allow an adversary to determine which parts of the allocations bitmap have changed between two updates, but is in line with the previously discussed relaxation that allocations are kept confidential only for data at reset.

Special constraints apply to the allocation bitmap file’s extents on storage: their boundaries must all be aligned to the Authentication Tree Data Block size and their lengths must all be a multiple of the allocation bitmap file block size. Note that the alignment to the Authentication Tree Data Block size is required for bootstrapping authentication at filesystem opening time: in general an Authentication Tree Data Block’s individual Allocation Blocks’ respective allocation status must be known each for computing its authentication digest, but with the aligned allocation bitmap file extents all overlapping Authentication Tree Data Blocks’ Allocation Blocks are known a priori to be allocated.

Authentication tree

A Merkle tree construction is used for the authentication. All nodes’ sizes are equal and is as specified by the auth_tree_node_io_blocks_log2. Leaf nodes store a number of digests of lengths as produced by auth_tree_data_hmac_hash_alg, as many as fit a node rounded down to the next power of two. Each digest entry in a leaf node authenticates an associated Authentication Tree Data Block, which may comprise one or more Allocation Blocks as specified by the auth_tree_data_block_allocation_blocks_log2 filesystem configuration parameter. Internal nodes store a number of digests as produced by auth_tree_node_hash_alg, as many as fit a node rounded down to the next power of two. Each digest entry in an internal node authenticates an associated child node’s contents. A root HMAC digest is performed over the tree’s root node (alongside some auxiliary information needed for bootstrapping), which can be either a leaf or an internal node. The root HMAC digest fully captures all of the filesystem’s contents and is stored in the mutable image header.

The authentication tree is stored in inode number 1 and inode index B+-tree minimum fill level is such that it that it can always be found through the inode index entry leaf node, in turn referenced from the mutable image header.

All of the authentication tree’s extents’ boundaries must be aligned to the larger of an IO Block and an Authentication Tree Data Block. No leaf entry is associated with any storage region overlapping with some of the authentication tree’s extents – the authenticated data ranges comprise all of the filesystem image with the authentication tree’s extents skipped. An authenticated Allocation Block’s associated index on physical storage is converted to its containing Authentication Tree Data Block’s index in the Authentication Tree Data Block index domain by subtracting from it the accumulated lengths of any authentication tree extents located before it on physical storage and converting the result to units of Authentication Tree Data Blocks, i.e. by shifting it to the right by a distance of auth_tree_data_block_allocation_blocks_log2 bits.

Any leaf node (data) digest entry is implicitly associated with an Authentication Tree Data Block index domain index by means of its position in the tree with respect to tree order.

Authentication Tree Data Block digests

When forming a given Authentication Tree Data Block’s digest, only its allocated Allocation Blocks’ contents are considered. The reason is that in order to support the journalling functionality, any Authentication Tree Data Block’s digest must be reconstructible, but the contents of unallocated Allocation Blocks must be assumed indeterminate – it could have e.g. been used for temporarily storing the dynamically allocated parts of the journal (or even been trimmed to begin with).

If a plain hash was directly used for digesting the Authentication Tree Data Blocks, an adversary could immediately figure the allocation status of each of its constituent Allocation Blocks, simply by attempting to recreate the digest found in the authentication tree by brute-forcing over a small search space. As the allocations state is considered confidential for data at rest, the Authentication Tree Data Blocks are digested with a keyed HMAC instead. It should be stressed at this point that the HMAC serves only as a device to efficiently obfuscate the digests, none of its security properties beyond those provided by the underlying hash are relied upon authentication-wise. In particular, given that the allocations are considered confidential only in the data at rest model, implementations are not required to make any provisions prohibiting their use as an HMACcing oracle. That is, they may produce and expose HMAC digests for any data found on storage without verifying its authenticity first.

Digests for (virtual) Authentication Tree Data Block’s located completely after the filesystem image’s end, as determined by the image size field in the mutable image header, are set to all-zeros. Note that this is relevant for any “excess” entries in the authentication tree’s leaf nodes tail. Digests for Authentication Tree Data Blocks within the filesystem image range are formed as described in what follows.

The key used for producing an HMAC over an Authentication Tree Data Block is a subkey derived from the root key with the domain parameter set to 1, i.e. the authentication tree’s associated inode number, a subdomain value of 0, and a key purpose of KEY_PURPOSE_AUTH_DATA.

The HMAC is formed with a hash algorithm of auth_tree_data_hmac_hash_alg over

  1. the Authentication Tree Data Block’s constituent allocated Allocation Blocks contents, while skipping over the unallocated ones, as well as those allocated to the image header or journal log head,
  2. and an authentication context formed as follows:
    1. A 64 bit allocation bitmap word specifying the allocation status of each of the Authentication Tree Data Block’s constituent Allocation Blocks, encoded in little-endian format.
    2. The Authentication Tree Data Block’s Authentication Tree Data Block’s index in the Authentication Tree Data Block index domain, encoded as a 64 bit integer in little-endian format.
    3. An authentication context format version identifier byte of constant 0.
    4. An authentication context subject identifier byte of constant AUTH_SUBJECT_ID_AUTH_TREE_DATA_BLOCK identifying the authenticated subject.

Authentication tree non-root node digests

The internal, i.e. non-leaf nodes store digests over their child nodes each. Note that each subtree rooted at a child effectively covers a certain associated data range, with that range’s length in units of Authentication Tree Data Blocks being a power of two.

Digests for children covering a data range located completely after the filesystem image’s end, as determined by the image size field in the mutable image header, are set to all-zeros. Digests for child nodes whose an associated data range overlaps with the filesystem image range are formed as described in what follows.

The digest over a child node is produced by computing the (regular) hash with auth_tree_node_hash_alg over

  1. the digests stored in the child node back to back
  2. and an authentication context formed as follows:
    1. The index in the Authentication Tree Data Block index domain of the child’s last entry’s associated data region’s beginning, encoded as a 64 bit integer in little-endian format. Observe that this uniquely fixes the position of the child node in the tree.
    2. An authentication context format version identifier byte of constant 0.
    3. An authentication context subject identifier byte of constant AUTH_SUBJECT_ID_AUTH_TREE_DESCENDANT_NODE identifying the authenticated subject.

Authentication tree root node digest

The authentication tree root digest is created by computing a HMAC with an underlying hash of auth_tree_root_hmac_hash_alg and a a subkey derived from the root key with the domain parameter set to 1, i.e. the authentication tree’s associated inode number, a subdomain value of 0, and a key purpose of KEY_PURPOSE_AUTH_ROOT over

  1. the digests stored in the root node back to back
  2. and an authentication context formed as follows:
    1. The index in the Authentication Tree Data Block index domain of the root node’s last entry’s associated data region’s beginning modulo \(2^{64}\), encoded as a 64 bit integer in little-endian format. Observe that this uniquely fixes the position of the root node in the tree.
    2. The “image context”, a digest over filesystem configuration parameters computed as described below.
    3. An authentication context format version identifier byte of constant 0.
    4. An authentication context subject identifier byte of constant AUTH_SUBJECT_ID_AUTH_TREE_ROOT_NODE identifying the authenticated subject.

The image context digest is produced by forming a HMAC with same hash algorithm and subkey as for the root node above over

  1. The magic 'COCOONFS', without a terminating zero byte.
  2. A single byte filesystem format version identifier of constant 0.
  3. The encoded image layout.
  4. The encoded block pointer to the inode index entry leaf node, as found in the mutable image header.
  5. The filesystem image size in units of Allocation Blocks, as found in the mutable image header.
  6. The authentication tree’s extents, represented in the common extents list format.
  7. The allocation bitmap file’s extents, represented in the common extents list format.
  8. An authentication context format version identifier byte of constant 0.
  9. An authentication context subject identifier byte of constant AUTH_SUBJECT_ID_IMAGE_CONTEXT identifying the authenticated subject.

Note that the image contexts binds

Note that in principle the encoded image context could have been included verbatim in the data the authentication tree root digest is getting computed over. However, for reasons of efficiency, an intermediate digest is performed over the image context first, which gets then in turn considered in the computation of the authentication tree root digest – the image context’s contents are expected to change only infrequently if at all, whereas the root digest needs to get recomputed upon every filesystem update.

Organization of the authentication tree storage

The authentication tree’s nodes are serialized back to back in the tree’s DFS PRE order into the tree’s associated extents. A node’s digests are located back to back at the beginning of its associated storage area, any padding to align the node size up to the length specified by auth_tree_node_io_blocks_log2 is retained on storage.

Note that it is possible to deduce the authentication tree’s dimensions from its extents’ total length and implementations are required to do so at filesystem opening time. The alignment constraints on the extents’ boundaries may result in more nodes being stored than is required to cover all of the filesystem image. As outlined in the previous sections, excess digest entries are to be set to all zeros.

When deducing the authentication tree dimensions from the extents’ total length, implementations must cap the height at \[\textrm{min}\left\{\left\lceil\frac{W}{c}\right\rceil, \left\lceil\frac{W - d - a}{c}\right\rceil + 1\right\}\] with \(W=64\), \(c\) denoting the base-2 logarithm of the number of digest entries in a non-leaf node, \(d\) denoting the base-2 logarithm of the number of digests in a leaf node and \(a\) the base-2 logarithm of the number of Allocation Blocks in an Authentication Tree Data Block, i.e. auth_tree_data_block_allocation_blocks_log2. Imposing this upper limit on the tree height does not restrict the data range coverable by an authentication tree (to below the maximum supported filesystem image size), but ensures that

Inode index

Inodes are identified by positive 32 bit integers. The inode index tracks the allocated inodes and their associated locations on storage each, either by means of a direct encoded extent pointer or by an “indirect” pointer pointing to the head of some chained extents storing the inode’s extents list.

The inode index is organized as a B+-tree, with a node size as specified by the index_tree_node_allocation_blocks_log2 filesystem configuration parameter.

The minimum leaf node fill level is constrained to be >= 4, so that inodes 1-4 are always found in the leftmost leaf node, the inode index entry leaf node. The location of the inode index entry leaf node is specified in the mutable image header. Among those four special inodes is inode 3 allocated to the inode index root node. The index entry for this inode must always specify the location of the index root node in terms of a direct encoded extent pointer, an indirect extents list is not permitted for it.

The index’ individual nodes are encrypted as blocks each, with a a subkey derived from the root key with the domain parameter set to 3, i.e. the inode index’ associated inode number, a subdomain value of INODE_KEY_SUBDOMAIN_DATA, and a key purpose of KEY_PURPOSE_ENCRYPTION.

Inode index leaf node format

Let \(B\) denote a decrypted index node’s maximum possible payload size in units of bytes. The maximum number of entries in a leaf node is then given by \(M_{\textrm{leaf}} = \left\lfloor\frac{B - 12}{12}\right\rfloor\). The minimum leaf node fill level is set to \(m_\textrm{leaf} = \left\lceil\frac{M_\textrm{leaf}}{2}\right\rceil\). \(B\) must be large enough so that the constraint \(m_\textrm{leaf} >= 4\) holds. Note that with a minimum inode index block size of 128B, and a maximum IV length of 32B, the \(m_\textrm{leaf} >= 4\) is automatically fulfilled.

The leaf node format is as follows:

Range in units of bytes Description
\(0\) to \(8\) Encoded block pointer to the next leaf node in tree order, if any, or NIL otherwise.
\(8\) to \(8 + 8\cdot M_\textrm{leaf}\) The inode entries associated encoded extent pointers.
\(8 + 8\cdot M_\textrm{leaf}\) to \(8 + 12\cdot M_\textrm{leaf}\) The inode entries associated keys, i.e. the inode numbers, encoded as 32 bit integers in little endian format.
\(8 + 12\cdot M_\textrm{leaf}\) to \(12 + 12\cdot M_\textrm{leaf}\) The node level, fixed to 1 for leaf nodes, encoded as a 32 bit integer in little-endian format.

For \(i\in\{0\ldots M_\textrm{leaf} - 1\}\), the i’th encoded extent pointer is associated with the i’th key. Unoccupied entries have a key value of 0, i.e. the special “no inode” value, and an encoded extent pointer value of NIL. The unoccupied slots must all be at the tail. The occupied entries must be sorted by the inode number. No leaf node with less than \(m_\textrm{leaf}\) nodes may exist, except for possibly at the tree root.

Inode index internal node format

Let \(B\) denote a decrypted index node’s maximum possible payload size in units of bytes again. The maximum number of entries, i.e. separating keys, in an internal node is then given by \(M_{\textrm{internal}} = \left\lfloor\frac{B - 12}{12}\right\rfloor\). The minimum internal node fill level is set to \(m_\textrm{internal} = \left\lfloor\frac{M_\textrm{internal} - 1}{2}\right\rfloor\).

Implementations might want to preemptively split full nodes or merge pairs of nodes at minimum fill level when walking down a path from the root for insertion or deletion. By coincidence, \(M_\textrm{internal} = M_\textrm{leaf}\) and from the constraint \(m_\textrm{leaf} >= 4\), it follows that \(M_\textrm{internal} = M_\textrm{leaf} >= 7 > 2\), as is required for supporting preemptive node splitting of full nodes. Note that \(m_\textrm{internal}\) has been defined specifically in a way to enable preemptive splitting of full nodes as well as merging nodes at the minimum fill level, even for even values of \(M_\textrm{internal}\).

Range in units of bytes Description
\(0\) to \(8 + 8\cdot M_\textrm{internal}\) Encoded block pointers to the node’s children.
\(8 + 8\cdot M_\textrm{internal}\) to \(8 + 12\cdot M_\textrm{internal}\) The separator keys, encoded as 32 bit integers in little-endian format.
\(8 + 12\cdot M_\textrm{internal}\) to \(12 + 12\cdot M_\textrm{internal}\) The node level, counted 1-based from the leaf upwards, encoded as a 32 bit integer in little-endian format.

For \(i\in\{0\ldots M_\textrm{internal} - 1\}\), the i’th key is the separator key between the \(i\)’th and \((i + 1)\)’th child – inode entries stored under the left child have an inode value all strictly less than the separator key, inode entries stored under the right child have an inode value greater than or equal to the separator key value. Unoccupied entries have a key value of 0, i.e. the special “no inode” value, and the block pointer to the associated right child is NIL. The unoccupied slots must all be at the tail. The occupied entries must be sorted by the inode number. No internal node with less than \(m_\textrm{internal}\) occupied key entries may exist except for possibly at the tree root, which must still have at least two children and a separator key inbetween.

Inode index entry leaf node preauthentication CCA protection digest

The inode index entry leaf node, i.e. the first leaf node in tree order, serves as the entry point when opening the filesystem, and its location is specified in the mutable image header accordingly. However, it needs to get decrypted before the full authentication tree based authentication has become functional, and therefore must first get authenticated by some other means before starting the decryption.

For this purpose, a HMAC over the entry leaf node is stored in the mutable image header. This HMAC is created with an underlying hash algorithm of preauth_cca_protection_hmac_hash_alg and a subkey derived from the root key with the domain parameter set to 3, i.e. the inode index’ associated inode number, a subdomain value of INODE_KEY_SUBDOMAIN_DATA, and a key purpose of KEY_PURPOSE_PREAUTH_CCA_PROTECTION_AUTH, over

  1. the encrypted entry leaf node’s data, of size as determined by index_tree_node_allocation_blocks_log2
  2. and an authentication context formed as follows:
    1. block_cipher_alg encoded as a pair of two 16 bit integers, encoded in big-endian format each.
    2. An authentication context format version identifier byte of constant 0.
    3. An authentication context subject identifier byte of constant AUTH_SUBJECT_ID_INODE_INDEX_NODE identifying the authenticated subject.

Inode extents lists

The inode entries stored in the index tree’s leaf nodes have an associated encoded extent pointer each for specifying an inode’s location on storage. This extent pointer can either be direct or indirect. If direct, it encodes the location and dimensions of an inode’s (single) extent on storage. If indirect, it points to the head extent in a list of encrypted chained extents collectively storing the inode’s encoded extents list.

For encryption in the encrypted chained extents format, a a subkey is derived from the root key with the domain parameter set to the respective inode number, a subdomain value of INODE_KEY_SUBDOMAIN_EXTENTS_LIST, and a key purpose of KEY_PURPOSE_ENCRYPTION.

The authentication tree and allocation bitmap files need to get located at filesystem opening time in order to bootstrap the full tree based authentication. Therefore, the inline authenticated variant of the encrypted chained extents format is used for storing these two files’ extents lists, if any. The hash algorithm used for this is the preauth_cca_protection_hmac_hash_alg, the HMAC key is set to a subkey derived from the root key with the domain parameter set to the inode number, a subdomain value of INODE_KEY_SUBDOMAIN_EXTENTS_LIST, and a key purpose of KEY_PURPOSE_PREAUTH_CCA_PROTECTION_AUTH. The authenticated associated data for the encrypted chained extents’ inline authentication is set to

  1. The inode number as a 32 bit integer, encoded in little-endian format.
  2. A format version identifier byte of constant 0.
  3. An identifier byte of constant 2 identifying the type of authenticated associated data for the encrypted chained extents’ inline authentication.

Regular file encryption

Except for the special case of the allocation bitmap file, any file’s data is stored in encrypted extents, with the set of extents specified by its corresponding entry in the inode index.

The key used for the encryption is a subkey derived from the root key with the domain parameter set to the inode number, a subdomain value of INODE_KEY_SUBDOMAIN_DATA, and a key purpose of KEY_PURPOSE_ENCRYPTION.

Journal

The journal contains all the information needed to apply pending changes at filesystem opening time following a possible service interruption. From a high level, it consists of staging copies of the to be written data at IO block granularity and a log specifying what needs to get written where as well as information about which parts of the the authentication tree need a reconstruction.

The journal log is stored in a list of encrypted chained extents, with the head extent, the journal log head, being located at a fixed position in the filesystem image, which can get determined from the information provided in the static image header. More specifically, the journal log head extent is located at the first IO Block and Authentication Tree Data Block aligned boundary following the mutable image header’s. Its length is set to the least multiple of the larger of the IO block and Authentication Tree Data Block size sufficient for storing the minimum possible encrypted chained extents’ head. The journal log tail extents as well as the data staging copies are stored at arbitrary, otherwise unallocated locations, with the constraint that all extents’ boundaries must be aligned to the IO Block size. Note that the journal log head extent is constrained to be aligned to the Authentication Tree Data Block size only for the convenience of implementations: for the purpose of computing “authentication tree data block digests”, the journal log head extent is to be considered as if unallocated as a special case, and letting it occupy an integral multiple of the Authentication Tree Data Block size alleviates the need to implement range checks for detecting this special case when updating potentially neighboring data.

The journal log’s encrypted chained extents are encrypted with a a subkey is derived from the root key with the domain parameter set to 5, i.e. the inode number allocated to the (virtual) journal log file, a subdomain value of INODE_KEY_SUBDOMAIN_DATA, and a key purpose of KEY_PURPOSE_ENCRYPTION. The constituent extents are inline authenticated with a HMAC with underlying hash algorithm as specified by preauth_cca_protection_hmac_hash_alg, a subkey is derived from the root key with the domain parameter also set to 5, a subdomain value of INODE_KEY_SUBDOMAIN_DATA, and a key purpose of KEY_PURPOSE_PREAUTH_CCA_PROTECTION_AUTH. The authenticated associated data for the encrypted chained extents’ inline authentication is set to

  1. The encoded image layout.
  2. A format version identifier byte of constant 0.
  3. An identifier byte of constant 1 identifying the type of authenticated associated data for the encrypted chained extents’ inline authentication.

The journal is to be considered non-empty and to get applied upon the next filesystem opening following a possible service interruption whenever its head extent begins with a magic of 'CCFSJRNL', without a terminating zero byte, and the head extent’s inline authentication digest successfully verifies its contents. Note that the head extent’s inline authentication digest serves as an integrity protection measure here, in particular it is not an error if the head extent’s authentication fails – in this case the journal is simply considered as having been only partially written and is disregarded.

It is expected that implementations would write all of the journal’s data before the journal log head extent, issuing write barriers as is appropriate for the underlying hardware device inbetween. Similarly, after the journal has been applied, either online or following a service interruption, it is expected that implementations would first invalidate the journal log head extent before proceeding to reusing any of the storage areas occupied by the journal’s remainder, likewise issuing write barriers as needed. In particular failure to authenticate a journal log tail extent when the head authenticated successfully is a fatal error.

The journal log plaintext is organized as sequence of tag-length-value (TLV) encoded fields. The tag and length are encoded as unsigned integers in LEB128 format, the format of the value depends on the field. The fields must be stored in the journal in the order induced by increasing tag values. The defined tag values are:

Name Value Description
JOURNAL_LOG_FIELD_TAG_AUTH_TREE_EXTENTS 1 Mandatory. The authentication tree’s encoded extents list.
JOURNAL_LOG_FIELD_TAG_ALLOC_BITMAP_FILE_EXTENTS 2 Mandatory. The allocation bitmap file’s encoded extents list.
JOURNAL_LOG_FIELD_TAG_ALLOC_BITMAP_FILE_FRAGMENTS_AUTH_DIGESTS 3 Mandatory. Authentication digests of allocation bitmap file fragments needed for reconstructing the authentication tree.
JOURNAL_LOG_FIELD_TAG_APPLY_WRITES_SCRIPT 4 Mandatory. List of data updates to replay.
JOURNAL_LOG_FIELD_TAG_UPDATE_AUTH_DIGESTS_SCRIPT 5 Mandatory. List of authentication tree updates to replay.
JOURNAL_LOG_FIELD_TAG_TRIM_SCRIPT 6 Optional. List of trim operations to issue after the journal has been replayed.
JOURNAL_LOG_FIELD_TAG_JOURNAL_STAGING_COPY_DISGUISE 7 Optional. Encryption parameters used for disguising the journal’s data staging copies.

Allocation bitmap file fragments’ authentication digests

Whenever replaying any authentication tree node entry update from the journal at filesystem opening time, the complete node must get reconstructed in full from scratch: the nodes’ boundaries are aligned to the IO Block size each, but some previous attempt to apply the journal might have failed and left the nodes’ contents in an indeterminate state.

For recreating the data digests at the leaf level, i.e. computing the Authentication Tree Data Blocks’ digests, the allocation status of each constituent Allocation Block must be known. Therefore, those parts of the allocation bitmap file that track the data ranges covered by the affected authentication tree leaf nodes must get read. Remember that the allocation bitmap is encrypted in units of allocation bitmap file blocks, so it is possible to decrypt only those fragments needed for the authentication tree reconstruction during journal replay. However, for maintaining IND-CCA security, the encrypted contents must get authenticated before a decryption is attempted.

The journal log field with tag JOURNAL_LOG_FIELD_TAG_ALLOC_BITMAP_FILE_FRAGMENTS_AUTH_DIGESTS provides all of the Authentication Tree Data Block digests overlapping with any of the allocation bitmap file fragments needed for reconstructing the authentication tree from journal replay. Those digests are over the updated allocation bitmap data, i.e. over the allocation bitmap ciphertext fragments as if the journal had been replayed already.

The field’s value contents is constructed by concatenating

  1. A sequence of records, one for each of the allocation bitmap file’s covered Authentication Tree Data Block, encoded as pairs of location and digest each. The location of the respective Authentication Tree Data Block is encoded as the difference between its beginning on physical storage relative to the previous record’s associated Authentication Tree Data Block’s end, if any, or 0 otherwise, converted to units of Authentication Tree Data Blocks, and encoded as an unsigned integer in LEB128 format. The Authentication Tree Data Block digest, of size as determined by the auth_tree_data_block_allocation_blocks_log2 parameter, is serialized right after that.

    Remember that all of the allocation bitmap file’s extents’ boundaries are constrained to be aligned to the Authentication Tree Data Block size, hence no such block can overlap only partially with the file. It is an error if any of the allocation bitmap file’s encryption blocks is covered only partially by the collective set of records in this journal log field: any must be either not covered at all or in full.

  2. An authentication HMAC with an underlying hash algorithm of preauth_cca_protection_hmac_hash_alg and a subkey is derived from the root key with the domain parameter set to 2, i.e. the allocation bitmap files associated inode number, a subdomain value of INODE_KEY_SUBDOMAIN_DATA, and a key purpose of KEY_PURPOSE_PREAUTH_CCA_PROTECTION_AUTH over

    1. The encoded image layout.
    2. The allocation bitmap file’s encoded extents list.
    3. The journal log field’s records encoded as above.
    4. An authentication context constructed as follows:
      1. A format version identifier byte of constant 0 identifying the inner format.
      2. The value JOURNAL_LOG_FIELD_TAG_ALLOC_BITMAP_FILE_FRAGMENTS_AUTH_DIGESTS represented as a single byte
      3. A format version identifier byte of constant 0 identifying the outer “envelope” format.
      4. The value of AUTH_SUBJECT_ID_JOURNAL_LOG_FIELD encoded as a single byte.

Writes application script

The journal log contains instructions how to apply pending data updates, more specifically which journal staging copies to write to which target location. This information is encoded in a journal log field with a tag value of JOURNAL_LOG_FIELD_TAG_APPLY_WRITES_SCRIPT as a sequence of records, each one consisting of

  1. The beginning of the target location on physical storage, represented as the difference to the previous record’s associated target range’s end, if any, or 0 otherwise, converted to units of IO Blocks, and encoded as an unsigned integer in LEB128 format.
  2. The beginning of the source location on physical storage, i.e. of the journal staging copy, represented as the difference to the previous record’s associated source range’s end, if any, or 0 otherwise, in units of IO Blocks and taken modulo \(2^{64}\), encoded as a signed integer in LEB128 format.
  3. The range’s length, in units of IO Blocks and encoded as an unsigned integer in LEB128 format. It must not be zero.

The sequence is terminated with a termination records of three consecutive zero bytes.

No target region may overlap with any source region, with the exception that the two may be equal for a given record, in which case implementations must skip it.

Authentication tree update script

The journal log field with tag JOURNAL_LOG_FIELD_TAG_UPDATE_AUTH_DIGESTS_SCRIPT contains all information needed to update the authentication tree, namely a list of Authentication Tree Data Blocks whose digests have changed – either because their contents got updated or because some of the constituent Allocation Blocks got deallocated, or both.

The journal’s writes application script is not considered for the purpose of determining which Authentication Tree Data Blocks’s digests have changed, and is queried only for possibly retrieving the updated contents of any data range referenced from the authentication tree update script. In particular, the authentication tree update script must be complete.

The field contains a sequence of records, each encoded as

  1. The beginning of the authenticated data range on physical storage, represented as the difference to the previous record’s associated range’s end, if any, or 0 otherwise, converted to units of Authentication Tree Data Blocks and encoded as an unsigned integer in LEB128 format.
  2. The length of the authenticated range in units of Authentication Tree Data Blocks, encoded as an unsigned integer in LEB128 format. It must not be zero.

The sequence is terminated with a termination records of two consecutive zero bytes.

It should be stressed that implementations must always reconstruct any modified authentication tree node from scratch when replaying the journal: a previous attempt to write to it might have failed and the contents must therefore be assumed to be in an indeterminate state.

Trim script

A journal log may contain an optional field of tag JOURNAL_LOG_FIELD_TAG_TRIM_SCRIPT for specifying a sequence of ranges to issue trim commands on after the journal has been replayed and the journal log head invalidated.

The field contains a sequence of records, each encoded as

  1. The beginning of the to be trimmed range on physical storage, represented as the difference to the previous record’s associated range’s end, if any, or 0 otherwise, converted to units of IO Blocks and encoded as an unsigned integer in LEB128 format.
  2. The length of the to be trimmed range in units of IO Blocks, encoded as an unsigned integer in LEB128 format. It must not be zero.

The sequence is terminated with a termination records of two consecutive zero bytes.

If the field is present, then any of the journal staging copies referenced from records in the writes application script, with the target range not being equal to the journal staging copy range, get implicitly added to it, as do the journal log’s tail extents.

Journal staging copy disguising

For each record in the writes application script, the specified journal staging copy is to get copied over to the target destination during journal replay. This enables adversaries to identify the journal staging copy blocks as such, simply by attempting to find pairs of blocks with identical contents. As the journal staging copies’ backing storage is always tracked as unallocated, this leaks information about the allocations state, which violates the principle that allocations are considered confidential for data at rest.

To restore the confidentiality of allocations for data at rest, the journal staging copies may – at users’ option – get encrypted with a one-time key. Note that this additional layer of encryption is computationally expensive, hence the feature is optional and can get enabled on a per-transaction basis.

If a journal log field with a tag of JOURNAL_LOG_FIELD_TAG_JOURNAL_STAGING_COPY_DISGUISE is present, then journal staging copy disguising is enabled, with the encryption algorithm and key as specified in the encoded field value:

  1. The block cipher algorithm as a TCG algorithm identifier, encoded as a 16 bit integer in big-endian format.
  2. The selected key size in bits, encoded as a 16 bit integer in big-endian format.
  3. The encryption key of length as specified by the key size.
  4. The IV generation encryption key of length as specified by the key size.

The journal staging copy disguising is implemented by encrypting each Allocation Block from the journal staging copy with the specified block cipher in CBC mode and an IV obtained as follows:

  1. Concatenate the Allocation Block’s target location to its journal staging copy location, both represented as Allocation Block indices within the filesystem image and encoded as 64 bit integers in little-endian format.
  2. Truncate or pad the result at the beginning so that its length becomes equal to the block cipher block size.
  3. Encrypt the single block cipher block with the IV generation encryption key.

Opening the filesystem (informative)

In order to illustrate how the various pieces needed for opening the filesystem fit together, especially with respect to bootstrapping the authentication, the opening process is outlined below:

  1. Read the static image header, i.e. the various filesystem configuration parameters and the salt.

  2. Derive the root key from externally supplied key material.

  3. Determine the locations of the journal log head extent as well as of the mutable image header.

  4. Check whether there’s a valid journal at the journal log head extent location, if so apply it.

  5. Read the mutable image header. This yields the authentication tree root digest, the location of the inode index entry leaf node and its preauthentication CCA protection digest.

  6. Read the inode index entry leaf node, authenticate it against its preauthentication CCA protection digest and decrypt it.

  7. Lookup the authentication tree and allocation bitmap file inodes in the decrypted inode index entry leaf node. Remember that the minimum inode index B+-tree minimum leaf node fill level is defined so that inodes 1-4 will always be found in that node.

  8. If the inode index entries for the authentication tree or allocation bitmap files contain indirect references to extents lists, stored in the inline authenticated variant of the encrypted chained extents, then decrypt while verifying against the inline authentication.

  9. The allocation bitmap file’s extents are constrained to all be aligned to the Authentication Tree Data Block size, in particular any Authentication Tree Data Block overlapping with it is a proper subrange and has all of its constituent Allocation Blocks allocated. Therefore, all information needed for digesting the Authentication Tree Data Blocks overlapping with the allocation bitmap file’s extents and authenticating them through the authentication tree is available. Furthermore, the authentication tree root digest binds the location of the allocation bitmap file’s extents on storage.

    Authenticate the allocation bitmap file’s contents through the authentication tree and decrypt it afterwards. Once the allocation bitmap has been read, the information needed for digesting any Authentication Tree Data Block is available.

  10. Note that the authentication tree root digest also binds the location of the inode index entry leaf node. Re-authenticate that node’s encrypted contents, this time through the authentication tree.

  11. The authentication of the filesystem through the authentication tree has been bootstrapped now.

  12. Lookup the inode index root node inode in the inode index entry leaf node. Read it, authenticate it through the authentication tree and decrypt it.

  13. The filesystem is now operational.

Bibliography