"Version 6" UUIDs

UPDATE

A proposal has been created here in the form of an IETF draft. It is newer than the format described on this page and the result of that proposal will likely determine the direction for this idea.

UUIDs: A Common Case

Universally Unique Identifiers are useful in many scenarios. And RFC 4122 describes 5 versions, indicated for use in various cases.

This document describes a proposed sixth version which in the author's opinion addresses a relatively minor but nonetheless significant shortcoming in the UUID specification, specifically when compared to the Version 1 UUIDs from the RFC.

Motivation

RFC 4122 is dated July 2005. Times have changed. Today, CPUs are faster and applications are frequently far more distributed. This results in an increased need to generate globally unique identifiers on different nodes in a distributed system and have these contain properties appropriate for a database primary key.

The existing RFC comes close, but stops short of describing a UUID suitable to this scenario. Specifically, a value useful for this case would have the following properties:

Sorting by its raw bytes results in a sequence equivalent to sorting by its embedded timestamp, thus making it naturally more useful as a primary key and providing improved reference locality and thus insert performance. (I.e. primary keys that are out of sequence are not just useless for sorting, but can also cause unnecessary disk access due to occupying significantly different locations in the index. [example])
The embedded time should be extractable for use as the creation time of the record.
Global uniqueness, which is of course a basic requirement of all types of UUIDs.

Unfortunately, none of the existing 5 UUID versions meet all 3 requirements. The main issue being that the byte sequence for the time field makes meeting the first requirement impossible with the existing specification.

Specification for 'Version 6'

TL;DR: 'Version 6' UUIDs have the date/time encoded from high to low bytes (bit-shifting around the version field in order to preserve its location) and thus sort correctly by time when treated as an opaque bunch of bytes; the clock sequence can be used to avoid duplicates generated at the same time; it's okay to use random data from a good PRNG in place of the MAC address; the rest is the same as RFC 4122.

The description here will be kept primarily to the differences from what is given in the RFC.

Timestamp

The only fields that change layout are the ones related to the timestamp. The version field retains its same position for backward compatibility, so applications can test for the version number in a consistent way. Assuming the timestamp begins as a 60-bit unsigned integer (usually implemented as a 64-bit unsigned int with the top 4 bits unused), the most significant 48 bits of the timestamp appear first (in order from most significant first, toward least significant), followed by the 4-bit version field, followed by the remaining 12 bits from the timestamp.

That gives a layout as follows: (timestamp shown in red)

Bytes 0-7: (each digit shown is hex, 4 bits)
    00 00 00 00  00 00 00 00
    |                | ||  |
     ----------------  ||  |
    timestamp          ||  |
    bits 59-12         ||  |
                 version|  |
                         --
                  timestamp
                  bits 11-0

Bytes 8-15: (same as RFC 4122)
    00 00 00 00  00 00 00 00
    ||  | |                |
    ||  |  ________________
    ||  |       node
    | --
    | clock seq bits 11-0
    2 bits variant, 2 bits
    are 13-12 of clock seq

NOTE: Just to be perfectly clear, even though the bit numbers
are given from high to low, what is being indicated is that
the byte sequence is big-endian, NOT that the bits should somehow
be reversed bit by bit.  I would have used bytes but the version
field is only 4 bits, so it's necessary to describe the fields
in terms of their bit sizes.   Hopefully all that is apparent.

Assuming t is a 64-bit unsigned integer (uint64_t) containing the timestamp as described in the RFC (in hundred-nanosecond intervals since midnight 15 October 1582 UTC), the C code to make the adjustment described above would be:

((t << 4) & 0xFFFFFFFFFFFF0000) | (t & 0x0FFF) | 0x6000

n.b. (the 0x6000 value is of course the version number - 6)

This would then be written in network byte order/big-endian to output the first 8 bytes of the UUID, i.e. the first byte would get the value (t >> 56) the second would be ((t >> 48) & 0xFF), etc.

Clock Sequence

In order to prevent duplicates on modern systems where it is very possible to have two UUIDs created within the same hundred-nanosecond interval, a minor difference is indicated in the behavior of the clock sequence field (see RFC section 4.1.5). Instead of returning an error or blocking (both of which are undesirable and unnecessary) if a duplicate UUID would be returned due to this case, instead the clock sequence should be incremented and used for the would-be duplicate ID - thus making it unique. This is the same action described in the RFC should the clock move backward and it works just as well in this case. In practice, this can be implemented as simple as (C) code like the following, run in a critical section:

if (last_timestamp >= this_timestamp) {
    clock_sequence++;
}

Node Field Value

In section 4.5 of the RFC it describes an alternate method for populating the node field if the use of a MAC address (IEEE 802) is not desired. Statements earlier in the document (e.g. "For systems with no IEEE address, a randomly or pseudo-randomly generated value may be used") seem to indicate that using the MAC address on the system is the more correct or normal case.

So just for clarity I want to specifically point out here that if there is a concern about privacy, just use a random value instead - that is endorsed as a correct way of generating UUIDs in Version 6. As specified in the RFC, random bytes should be generated from a high quality source (i.e. use /dev/urandom instead of something based on a deterministic algorithm [Mersenne Twister].) The multicast bit should be set as indicated in the RFC.

Implementation

A reference implementation has been written in Go. The actual code involved is relatively simple and seems to perform decently, feedback and pull requests welcome.

Hacks

It is somewhat trivial to convert between Version 1 and Version 6 UUIDs, since the information they contain is the same just in different positions. In environments where an implementation of Version 1 exists, converting this output to Version 6 may be the simplest implementation. Here are some examples of how to do this in various languages:

C:

#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
#include <arpa/inet.h>
#include <uuid/uuid.h>

/* Converts UUID version 1 to version 6 in place. */
void uuidv1tov6(uuid_t u) {

  uint64_t ut;
  unsigned char *up = (unsigned char *)u;

  // load ut with the first 64 bits of the UUID
  ut = ((uint64_t)ntohl(*((uint32_t*)up))) << 32;
  ut |= ((uint64_t)ntohl(*((uint32_t*)&up[4])));

  // dance the bit-shift...
  ut = 
    ((ut >> 32) & 0x0FFF) | // 12 least significant bits
    (0x6000) | // version number
    ((ut >> 28) & 0x0000000FFFFF0000) | // next 20 bits
    ((ut << 20) & 0x000FFFF000000000) | // next 16 bits
    (ut << 52); // 12 most significant bits

  // store back in UUID
  *((uint32_t*)up) = htonl((uint32_t)(ut >> 32));
  *((uint32_t*)&up[4]) = htonl((uint32_t)(ut));

}

int main(int argc, char **argv) {

  uuid_t u;
  char str[37];

  uuid_generate_time(u);

  uuid_unparse(u, str);
  printf("UUIDv1: %s\n", str);

  uuidv1tov6(u);

  uuid_unparse(u, str);
  printf("UUIDv6: %s\n", str);

  return 0;
}

Python 2/3:

import uuid

def uuidv1tov6(u):
  uh = u.hex
  tlo1 = uh[:5]
  tlo2 = uh[5:8]
  tmid = uh[8:12]
  thig = uh[13:16]
  rest = uh[16:]
  uh6 = thig + tmid + tlo1 + '6' + tlo2 + rest
  return uuid.UUID(hex=uh6)

u = uuidv1tov6(uuid.uuid1())

MySQL/MariaDB/Percona:

n.b. Basic idea borrowed from here, adjusted to match definition above.

DELIMITER //
CREATE DEFINER=`root`@`localhost` FUNCTION `uuidv1atov6b`(u1 BINARY(36))
RETURNS BINARY(16) DETERMINISTIC
RETURN UNHEX(CONCAT(
  SUBSTR(u1, 16, 3),
  SUBSTR(u1, 10, 4),
  SUBSTR(u1, 1, 5),
  '6',
  SUBSTR(u1, 6, 3),
  SUBSTR(u1, 20, 4),
  SUBSTR(u1, 25)
  ));
//
DELIMITER ;

DELIMITER //
CREATE DEFINER=`root`@`localhost` FUNCTION `uuidbtoa`(u BINARY(16))
RETURNS BINARY(36) DETERMINISTIC
RETURN CONCAT(
  HEX(SUBSTR(u, 1, 4)),
  "-",
  HEX(SUBSTR(u, 5, 2)),
  "-",
  HEX(SUBSTR(u, 7, 2)),
  '-',
  HEX(SUBSTR(u, 9, 2)),
  "-",
  HEX(SUBSTR(u, 11, 6))
  );
//
DELIMITER ;

-- for use as primary key:
select uuidv1atov6b(uuid());

-- to display:
select uuidbtoa(uuidv1atov6b(uuid()));

UUID "Version 6"