How FerretDB stores BSON in JSONB
In this article, we show how FerretDB stores and translates MongoDB's BSON format into JSONB in PostgreSQL.
At FerretDB, we are converting MongoDB wire protocol queries into SQL, to store information from BSON to PostgreSQL. To achieve this, we created our own mapping called PJSON which translates MongoDB's BSON format into PostgreSQL's JSONB.
Using BSON with JSONB offers us significantly faster operation, greater range of data types compared to regular JSON, and more flexibility in modeling data of any structure.
In this article, we'll show you how FerretDB converts and stores MongoDB data in PostgreSQL.
What is BSON?
BSON holds a collection of field name/value pairs which is called a document. It contains length information allowing serializer/deserializer to utilize it for the performance benefit. Additionally, it preserves the order of the fields. BSON also supports additional data types such as DateTime and binary.
Let's look at an example of how a JSON is encoded in BSON.
Here the hexadecimal \x00
notation represents 0000 0000
in bits and similarly \x12
represents 0001 0010
.
In BSON, \x00
is a byte used as a terminator to indicate the end of a document.
In BSON, field names use cstring
which are UTF-8 characters followed by \x00
.
{ "foo": "bar" }
\x12\x00\x00\x00 | Document length in little-endian int32 (18 bytes document) |
\x02 | string field type \x02 , see the BSON spec |
foo\x00 | Cstring field name |
\x04\x00\x00\x00bar\x00 | String length in little-endian int32 \x04\x00\x00\x00 (4 bytes string) followed by string value and trailing \x00 |
\x00 | Document terminator |
You notice that BSON is binary serialized so it is not human readable. It contains information about the length, and an explicitly defined field type.
At FerretDB, we use PostgreSQL as a database engine and we store JSON data in JSONB data type. In light of this, we need to store BSON equivalent information in JSON format without losing type information.
Introducing PJSON
To embed necessary information from BSON, we introduced special keys prefixed with $
.
PJSON contains $
followed by a character to embed information about types and the order of the fields.
PJSON is designed to be serialized to JSONB.
Let's look at a simple BSON with an ObjectId
field and see how it is represented in PJSON.
{"_id": ObjectId("635202c8f75e487c16adc141")}
\x16\x00\x00\x00\x07_id\x00\x63\x52\x02\xc8\xf7\x5e\x48\x7c\x16\xad\xc1\x41\x00
In PJSON, we store the order of fields in the $k
field, and ObjectId
in the $o
field.
The duplicate fields are not allowed in PJSON.
{
"$k": [
"_id"
],
"_id": {
"$o": "635202c8f75e487c16adc141"
}
}
The $
prefixes apply to types that are not native to JSON.
The current mappings to PJSON are defined below.
Document | {"$k": ["<key 1>", "<key 2>", …], "<key 1>": <value 1>, "<key 2>": <value 2>, …} |
Array | JSON array |
64-bit binary floating point | {"$f": JSON number} |
UTF-8 string | JSON string |
Binary data | {"$b": "<base 64 string>", "s": <subtype number>} // s is binary subtype, the detail is found in the BSON spec |
ObjectId | {"$o": "<ObjectID as 24 character hex string"} |
Boolean | JSON true / false values |
UTC datetime | {"$d": milliseconds since epoch as JSON number} |
Null | JSON null |
Regular expression | {"$r": "<string without terminating 0x0>", "o": "<string without terminating 0x0>"} |
Timestamp | {"$t": "<number as string>"} |
64-bit integer | {"$l": "<number as string>"} |
In addition to $
prefixes, binary data have an additional field s
to indicate the type of binary.
Also, regular expressions have an additional field o
to specify options such as case sensitivity.