Azure Cosmos DB (NoSQL) – How it works (RU, Partition, Index, and Replica)

By Tsuyoshi Matsuzaki on 2016-12-07 • ( 1 Comment )

Azure Cosmos DB includes a schema-less NoSQL database, which also supports the MongoDB wired protocol and tools including mongoose, mongochef, and others.

One of the key questions about Azure Cosmos DB NoSQL is what capabilities exposes for optimization. In this post, I shortly focus on the value proposition and how it works.
You will find that it’s a globally and elastically scaled database, and you can take much more benefits from this reliable database when you see how it works internally.

In this post I show you samples by REST HTTP raw for the purpose of understanding internal procedure. But remember you can use Cosmos DB SDK (Node.js, .NET, Java, etc), which encapsulates so much difficulties (searching partitions, parallelism, etc) and simplifies your programming.

Guarantee – SLA and Reserved Throughput

Note : Azure Cosmos DB NoSQL has 3 pricing models – provisioned throughput, auto-scale throughput, and serverless. Please select an appropriate pricing model depending on your workloads.
This post (this writing) assumes the provisioned throughput, which is suitable for stable or predictable workloads.

Azure Cosmos DB NoSQL has SLA of 99.99% availability, and reserved throughput with less than 10ms on reads and 15ms on writes. These service level is completely transparent. (You don’t need to manage the details, and the database does.)

The performance level (reserved throughput) is determined when the container is provisioned. Let’s see the following example. (First, this is creating database and container.)

Get endpoints from Cosmos DB account

GET https://myaccount01.documents.azure.com/
x-ms-date: Tue, 06 Dec 2016 12:18:29 GMT
authorization: type%3dmas...

HTTP/1.1 200 OK
Content-Type: application/json

{
  "_self": "",
  "id": "myaccount01",
  "_rid": "myaccount01.documents.azure.com",
  "media": "//media/",
  "addresses": "//addresses/",
  "_dbs": "//dbs/",
  "writableLocations": [
    {
      "name": "East US",
      "databaseAccountEndpoint": "https://myaccount01-eastus.documents.azure.com:443/"
    }
  ],
  "readableLocations": [
    {
      "name": "East US",
      "databaseAccountEndpoint": "https://myaccount01-eastus.documents.azure.com:443/"
    }
  ],
  ...

}

Create a database using endpoint

POST https://myaccount01-eastus.documents.azure.com/dbs
x-ms-date: Tue, 06 Dec 2016 12:18:29 GMT
authorization: type%3dmas...
Accept: application/json

{"id":"db01"}

HTTP/1.1 201 Created
Content-Type: application/json

{
  "id": "db01",
  "_rid": "4Gt4AA==",
  "_self": "dbs/4Gt4AA==/",
  "_etag": ""00000900-0000-0000-0000-583698a10000"",
  "_colls": "colls/",
  "_users": "users/",
  "_ts": 1479973022
}

Create a container in database

POST https://myaccount01-eastus.documents.azure.com/dbs/db01/colls
x-ms-offer-throughput: 10000
x-ms-date: Tue, 06 Dec 2016 12:18:29 GMT
authorization: type%3dmas...
Accept: application/json

{
  "id": "test01"
}

HTTP/1.1 201 Created
Content-Type: application/json

{
  "id": "test01",
  "indexingPolicy": {
    "indexingMode": "consistent",
    "automatic": true,
    "includedPaths": [
      {
        "path": "/*",
        "indexes": [
          {
            "kind": "Range",
            "dataType": "Number",
            "precision": -1
          },
          {
            "kind": "Hash",
            "dataType": "String",
            "precision": 3
          }
        ]
      }
    ],
    "excludedPaths": [
      
    ]
  },
  "_rid": "6DpzAIPfiQA=",
  "_ts": 1480590914,
  "_self": "dbs/6DpzAA==/colls/6DpzAIPfiQA=/",
  "_etag": ""0000dc00-0000-0000-0000-5840064b0000"",
  "_docs": "docs/",
  "_sprocs": "sprocs/",
  "_triggers": "triggers/",
  "_udfs": "udfs/",
  "_conflicts": "conflicts/"
}

Note : The authorization request header value is the url-encoded string of “type=master&ver=1.0&sig={signature}” format. The steps of getting this value is :
1) Construct your payload string. This string depends on the x-ms-date or Date header value, so it expires soon.
2) Create base64 encoded string of HMAC with SHA256 algorithm using the previous payload.
3) Set this encoded string as signature (sig).
Please see the official document “Access Control in the Cosmos DB API” for details. (If you’re using PHP, please refer my previous post “How to use Azure Storage without SDK” too.)

As you can see (see x-ms-offer-throughput request header), this example is assigning 10,000 request units (RUs) for this container (named “test01”).

Request unit (RU) is the measurement of “how much resource (CPU utilization, memory consumption, etc) is used in the API request processing”. Hence this depends on the number of requests, required read/write (IO) transactions, computation complexity of issued SQL (query) and so forth. (There’s no simple way to calculate RU consumption.)
RU affects to your charge (billing) of Cosmos DB consumption. As you know, you can estimate required RU using Request Units Calculator Site, however you should remember that the calculated result of this site is just a reference for your cost estimation, and it’s not perfect.

In order to measure exact RU, you should perform the actual query with actual volume of data as follows.
You can see how much RU are consumed using x-ms-request-charge response header in CRUD request operations, and the following request is consuming 3.08 RUs in this single query.

POST https://myaccount01-eastus.documents.azure.com/dbs/db01/colls/test01/docs
x-ms-date: Tue, 06 Dec 2016 03:29:35 GMT
authorization: type%3dmas...
Accept: application/json
Content-Type: application/query+json

{"query":"SELECT * FROM root ... "}

HTTP/1.1 200 Ok
Content-Type: application/json
x-ms-request-charge: 3.08
...

{
  "_rid": "GZsc...",
  "Documents": [
    ...

  ],
  "_count": ...
}

The estimation of RU is very important, because if it’s not sufficient, you will waste the cost (money) or fall into performance degradation by the frequent errors.
If you exceeds the reserved throughput per second, the error “Request rate is large” (HTTP Status 429) occurs. In such a case, please check x-ms-retry-after-ms response header and retry after the specified time.
In most cases, you may use Cosmos DB SDK, not raw REST API. SDK does this error tracking and retry internally. (i.e, It is transparent for developers.) With such frequent errors and retries, you might think that “Cosmos DB is very slow”. But this is misunderstanding and this is just because of the lack of RUs (throughput units).
On the other hand, too much RU doesn’t always result into performance improvement. For this mechanism, too much RU won’t make sense.

You should carefully determine how much RU is needed with the trade-off between cost and performance, before production.
RU is charged hourly, then you can also automate changing RU to fit business requirements by CLI commands.

Partitioning

Cosmos DB has the partitioning capability. One container can take multiple physical partitions (i.e, multiple servers), and data (document) is distributed on these partitions by the partition key.

Let’s say that we set /divisionid as the partition key in the container.
The following REST creates the container (named “test01”) with this partition key.

POST https://myaccount01-eastus.documents.azure.com/dbs/db01/colls HTTP/1.1
authorization: type%3dmas...
Accept: application/json

{
  "id": "test01",
  "partitionKey": {
    "paths": [
      "/divisionid"
    ],
    "kind": "Hash"
  }
}

HTTP/1.1 201 Created
Content-Type: application/json

{
  "id": "test01",
  "indexingPolicy": {
    "indexingMode": "consistent",
    "automatic": true,
    "includedPaths": [
      {
        "path": "/*",
        "indexes": [
          {
            "kind": "Range",
            "dataType": "Number",
            "precision": -1
          },
          {
            "kind": "Hash",
            "dataType": "String",
            "precision": 3
          }
        ]
      }
    ],
    "excludedPaths": [
      
    ]
  },
  "partitionKey": {
    "paths": [
      "/divisionid"
    ],
    "kind": "Hash"
  },
  "_rid": "GZscAJ56rgA=",
  "_ts": 1480675514,
  "_self": "dbs/GZscAA==/colls/GZscAJ56rgA=/",
  "_etag": ""00001202-0000-0000-0000-584150c60000"",
  "_docs": "docs/",
  "_sprocs": "sprocs/",
  "_triggers": "triggers/",
  "_udfs": "udfs/",
  "_conflicts": "conflicts/"
}

The documents with the same key are stored in the same logical partition. In this container { "divisionid" : "div1", "name" : "engineering" } and { "divisionid" : "div1", "revenue" : 3000000 } resides in the same logical partition.
By the term “logical”, it implies that the same “physical” location might be allocated among the different logical partitions. The physical partition is transparently allocated by Azure Cosmos DB to meet with RU requirement.

Let’s see how to route into the physical partition internally. (If you’re using SDK, it’s done by SDK and you don’t need anything to do.)

First you should get the information about partition key range from the endpoint https://{your endpoint domain}.documents.azure.com/dbs/{your container's uri fragment}/pkranges as follows, before using partition in your query. In this example, “dbs/GZscAA==/colls/GZscAJ56rgA=/” (see _self property above) is the URI fragment of the container.

Note : To create multiple physical partitions, please set a large value of RU (such as 10,000) in your container.

GET https://myaccount01-eastus.documents.azure.com/dbs/GZscAA==/colls/GZscAJ56rgA=/pkranges
authorization: type%3dmas...
Accept: application/json

HTTP/1.1 200 Ok
Content-Type: application/json

{
  "_rid": "GZscAJ56rgA=",
  "PartitionKeyRanges": [
    {
      "_rid": "GZscAJ56rgACAAAAAAAAUA==",
      "id": "0",
      "_etag": ""00001402-0000-0000-0000-584150c60000"",
      "minInclusive": "",
      "maxExclusive": "05C1A53DB92960",
      "_self": "dbs/GZscAA==/colls/GZscAJ56rgA=/pkranges/GZscAJ56rgACAAAAAAAAUA==/",
      "throughputFraction": 0.04,
      "_ts": 1480675514
    },
    {
      "_rid": "GZscAJ56rgADAAAAAAAAUA==",
      "id": "1",
      "_etag": ""00001502-0000-0000-0000-584150c60000"",
      "minInclusive": "05C1A53DB92960",
      "maxExclusive": "05C1B53DB92960",
      "_self": "dbs/GZscAA==/colls/GZscAJ56rgA=/pkranges/GZscAJ56rgADAAAAAAAAUA==/",
      "throughputFraction": 0.04,
      "_ts": 1480675514
    },
    {
      "_rid": "GZscAJ56rgAEAAAAAAAAUA==",
      "id": "2",
      "_etag": ""00001602-0000-0000-0000-584150c60000"",
      "minInclusive": "05C1B53DB92960",
      "maxExclusive": "05C1BF5D153D90",
      "_self": "dbs/GZscAA==/colls/GZscAJ56rgA=/pkranges/GZscAJ56rgAEAAAAAAAAUA==/",
      "throughputFraction": 0.04,
      "_ts": 1480675514
    },
    {
      "_rid": "GZscAJ56rgAFAAAAAAAAUA==",
      "id": "3",
      "_etag": ""00001702-0000-0000-0000-584150c60000"",
      "minInclusive": "05C1BF5D153D90",
      "maxExclusive": "05C1C53DB92960",
      "_self": "dbs/GZscAA==/colls/GZscAJ56rgA=/pkranges/GZscAJ56rgAFAAAAAAAAUA==/",
      "throughputFraction": 0.04,
      "_ts": 1480675514
    },
    {
      "_rid": "GZscAJ56rgAGAAAAAAAAUA==",
      "id": "4",
      "_etag": ""00001802-0000-0000-0000-584150c60000"",
      "minInclusive": "05C1C53DB92960",
      "maxExclusive": "05C1C9CD673378",
      "_self": "dbs/GZscAA==/colls/GZscAJ56rgA=/pkranges/GZscAJ56rgAGAAAAAAAAUA==/",
      "throughputFraction": 0.04,
      "_ts": 1480675514
    },
    {
      "_rid": "GZscAJ56rgAHAAAAAAAAUA==",
      "id": "5",
      "_etag": ""00001902-0000-0000-0000-584150c60000"",
      "minInclusive": "05C1C9CD673378",
      "maxExclusive": "05C1CF5D153D90",
      "_self": "dbs/GZscAA==/colls/GZscAJ56rgA=/pkranges/GZscAJ56rgAHAAAAAAAAUA==/",
      "throughputFraction": 0.04,
      "_ts": 1480675514
    },
    {
      "_rid": "GZscAJ56rgAIAAAAAAAAUA==",
      "id": "6",
      "_etag": ""00001a02-0000-0000-0000-584150c60000"",
      "minInclusive": "05C1CF5D153D90",
      "maxExclusive": "05C1D1F5E1A3D4",
      "_self": "dbs/GZscAA==/colls/GZscAJ56rgA=/pkranges/GZscAJ56rgAIAAAAAAAAUA==/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgAJAAAAAAAAUA==",
      "id": "7",
      "_etag": ""00001b02-0000-0000-0000-584150c60000"",
      "minInclusive": "05C1D1F5E1A3D4",
      "maxExclusive": "05C1D53DB92960",
      "_self": "dbs/GZscAA==/colls/GZscAJ56rgA=/pkranges/GZscAJ56rgAJAAAAAAAAUA==/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgAKAAAAAAAAUA==",
      "id": "8",
      "_etag": ""00001c02-0000-0000-0000-584150c60000"",
      "minInclusive": "05C1D53DB92960",
      "maxExclusive": "05C1D7858FADEC",
      "_self": "dbs/GZscAA==/colls/GZscAJ56rgA=/pkranges/GZscAJ56rgAKAAAAAAAAUA==/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgALAAAAAAAAUA==",
      "id": "9",
      "_etag": ""00001d02-0000-0000-0000-584150c60000"",
      "minInclusive": "05C1D7858FADEC",
      "maxExclusive": "05C1D9CD673378",
      "_self": "dbs/GZscAA==/colls/GZscAJ56rgA=/pkranges/GZscAJ56rgALAAAAAAAAUA==/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgAMAAAAAAAAUA==",
      "id": "10",
      "_etag": ""00001e02-0000-0000-0000-584150c70000"",
      "minInclusive": "05C1D9CD673378",
      "maxExclusive": "05C1DD153DB904",
      "_self": "dbs/GZscAA==/colls/GZscAJ56rgA=/pkranges/GZscAJ56rgAMAAAAAAAAUA==/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgANAAAAAAAAUA==",
      "id": "11",
      "_etag": ""00001f02-0000-0000-0000-584150c70000"",
      "minInclusive": "05C1DD153DB904",
      "maxExclusive": "05C1DF5D153D90",
      "_self": "dbs/GZscAA==/colls/GZscAJ56rgA=/pkranges/GZscAJ56rgANAAAAAAAAUA==/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgAOAAAAAAAAUA==",
      "id": "12",
      "_etag": ""00002002-0000-0000-0000-584150c70000"",
      "minInclusive": "05C1DF5D153D90",
      "maxExclusive": "05C1E151F5E18E",
      "_self": "dbs/GZscAA==/colls/GZscAJ56rgA=/pkranges/GZscAJ56rgAOAAAAAAAAUA==/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgAPAAAAAAAAUA==",
      "id": "13",
      "_etag": ""00002102-0000-0000-0000-584150c70000"",
      "minInclusive": "05C1E151F5E18E",
      "maxExclusive": "05C1E1F5E1A3D4",
      "_self": "dbs/GZscAA==/colls/GZscAJ56rgA=/pkranges/GZscAJ56rgAPAAAAAAAAUA==/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgAQAAAAAAAAUA==",
      "id": "14",
      "_etag": ""00002202-0000-0000-0000-584150c70000"",
      "minInclusive": "05C1E1F5E1A3D4",
      "maxExclusive": "05C1E399CD671A",
      "_self": "dbs/GZscAA==/colls/GZscAJ56rgA=/pkranges/GZscAJ56rgAQAAAAAAAAUA==/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgARAAAAAAAAUA==",
      "id": "15",
      "_etag": ""00002302-0000-0000-0000-584150c70000"",
      "minInclusive": "05C1E399CD671A",
      "maxExclusive": "05C1E53DB92960",
      "_self": "dbs/GZscAA==/colls/GZscAJ56rgA=/pkranges/GZscAJ56rgARAAAAAAAAUA==/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgASAAAAAAAAUA==",
      "id": "16",
      "_etag": ""00002402-0000-0000-0000-584150c70000"",
      "minInclusive": "05C1E53DB92960",
      "maxExclusive": "05C1E5E1A3EBA6",
      "_self": "dbs/GZscAA==/colls/GZscAJ56rgA=/pkranges/GZscAJ56rgASAAAAAAAAUA==/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgATAAAAAAAAUA==",
      "id": "17",
      "_etag": ""00002502-0000-0000-0000-584150c70000"",
      "minInclusive": "05C1E5E1A3EBA6",
      "maxExclusive": "05C1E7858FADEC",
      "_self": "dbs/GZscAA==/colls/GZscAJ56rgA=/pkranges/GZscAJ56rgATAAAAAAAAUA==/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgAUAAAAAAAAUA==",
      "id": "18",
      "_etag": ""00002602-0000-0000-0000-584150c70000"",
      "minInclusive": "05C1E7858FADEC",
      "maxExclusive": "05C1E9297B7132",
      "_self": "dbs/GZscAA==/colls/GZscAJ56rgA=/pkranges/GZscAJ56rgAUAAAAAAAAUA==/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgAVAAAAAAAAUA==",
      "id": "19",
      "_etag": ""00002702-0000-0000-0000-584150c70000"",
      "minInclusive": "05C1E9297B7132",
      "maxExclusive": "05C1E9CD673378",
      "_self": "dbs/GZscAA==/colls/GZscAJ56rgA=/pkranges/GZscAJ56rgAVAAAAAAAAUA==/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgAWAAAAAAAAUA==",
      "id": "20",
      "_etag": ""00002802-0000-0000-0000-584150c70000"",
      "minInclusive": "05C1E9CD673378",
      "maxExclusive": "05C1EB7151F5BE",
      "_self": "dbs/GZscAA==/colls/GZscAJ56rgA=/pkranges/GZscAJ56rgAWAAAAAAAAUA==/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgAXAAAAAAAAUA==",
      "id": "21",
      "_etag": ""00002902-0000-0000-0000-584150c70000"",
      "minInclusive": "05C1EB7151F5BE",
      "maxExclusive": "05C1ED153DB904",
      "_self": "dbs/GZscAA==/colls/GZscAJ56rgA=/pkranges/GZscAJ56rgAXAAAAAAAAUA==/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgAYAAAAAAAAUA==",
      "id": "22",
      "_etag": ""00002a02-0000-0000-0000-584150c70000"",
      "minInclusive": "05C1ED153DB904",
      "maxExclusive": "05C1EDB9297B4A",
      "_self": "dbs/GZscAA==/colls/GZscAJ56rgA=/pkranges/GZscAJ56rgAYAAAAAAAAUA==/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgAZAAAAAAAAUA==",
      "id": "23",
      "_etag": ""00002b02-0000-0000-0000-584150c70000"",
      "minInclusive": "05C1EDB9297B4A",
      "maxExclusive": "05C1EF5D153D90",
      "_self": "dbs/GZscAA==/colls/GZscAJ56rgA=/pkranges/GZscAJ56rgAZAAAAAAAAUA==/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    },
    {
      "_rid": "GZscAJ56rgAaAAAAAAAAUA==",
      "id": "24",
      "_etag": ""00002c02-0000-0000-0000-584150c70000"",
      "minInclusive": "05C1EF5D153D90",
      "maxExclusive": "FF",
      "_self": "dbs/GZscAA==/colls/GZscAJ56rgA=/pkranges/GZscAJ56rgAaAAAAAAAAUA==/",
      "throughputFraction": 0.04,
      "_ts": 1480675515
    }
  ],
  "_count": 25
}

As you can see above, here’s 25 partitions. (The number of partitions is transparent and automatically determined by the storage size and RUs of the container. It’s not managed by your own.)
Each minInclusive and maxExclusive properties are hex number and it means the range (min and max) of hash. In Cosmos DB (server side), MurmurHash is used by the partitioning hash algorithm. For example, if the key value is “div200“, the computed hash of partition key is “05C1EDBFC1A70A“. (You can download this program to hash values from my repository “Github – tsmatz/azure-cosmosdb-partition-resolver“.)
That is :

"05C1EDB9297B4A" <= "05C1EDBFC1A70A" < "05C1EF5D153D90"

As a result, this document (which is having the key “div200“) is deployed to the 23rd physical partition (id=”23”).

If you frequently pick up the documents using /divisionid, you can route your query to the appropriate physical partition.
When you search documents using the partition key with “div200“, you send the query only to the 23rd partition as follows, and you don’t need to send to the others. (The following “GZscAJ56rgA=,23” in HTTP request header means the 23rd partition.)

Note : When you send query to the partitioned containers, the following “x-ms-documentdb-query-enablecrosspartition : True” is needed in request header.

POST https://myaccount01-eastus.documents.azure.com/dbs/db01/colls/test01/docs
x-ms-continuation: 
x-ms-documentdb-isquery: True
x-ms-documentdb-query-enablecrosspartition: True
x-ms-documentdb-partitionkeyrangeid: GZscAJ56rgA=,23
authorization: type%3dmas...
Accept: application/json
Content-Type: application/query+json

{"query":"SELECT * FROM root WHERE (root["divisionid"] = "div200") "}

HTTP/1.1 200 Ok
Content-Type: application/json
x-ms-item-count: 3

{
  "_rid": "GZscAJ56rgA=",
  "Documents": [
    {
      "divisionid": "div200",
      "content": "cont200",
      "id": "75233879-4ebf-4c37-ac9d-a3b373ac4441",
      "_rid": "GZscAJ56rgAOAAAAAACADg==",
      "_self": "dbs/GZscAA==/colls/GZscAJ56rgA=/docs/GZscAJ56rgAOAAAAAACADg==/",
      "_etag": ""0001ee0b-0000-0000-0000-584152af0000"",
      "_attachments": "attachments/",
      "_ts": 1480676011
    },
    {
      "divisionid": "div200",
      ...

    },
    {
      "divisionid": "div200",
      ...

    }
  ],

  "_count": 3
}

Note that the partitioned container has not only pros (benefits) but also cons (caveat).
Let’s consider the case if you search the documents without partition key. In this case, you must send the query to all the partitions (25 partitions in this case), and there is a significant overhead. Then if your application is having various kinds of documents, it’s better that you divide these documents into the different containers which might be a single partition container or partitioned container, and set the appropriate key for each partitioned container. (You must care which one should be the partition key.)

If you need to use the partitioned container and need to search without partition key, it’s better to send the query in parallel. When using .NET SDK, you must specify the following MaxDegreeOfParallelism property. (Of course, more high RUs are needed for the parallel execution.)
Otherwise, you fall into the sequential search of so many partitions, and it would take the tremendous time to return the results despite of setting the higher performance level (RU). (Inside, each call meets the performance level, but the overhead of total calls exceeds so much.)
Especially if you’re using .NET SDK, it automatically determines whether the multiple calls are required by the LINQ query expression, but it doesn’t automatically change into the parallel calls. You remember that you must set the parallel calls by your own.

...
using Microsoft.Azure.Documents.Client;
...

var client = new DocumentClient(
  new Uri(
    "https://myaccount01.documents.azure.com:443/"),
    "RQCgm3a...");  // key

var query = client.CreateDocumentQuery<MyDoc>(
  UriFactory.CreateDocumentCollectionUri("db01", "test01"),
  new FeedOptions
  {
    EnableCrossPartitionQuery = true,
    MaxDegreeOfParallelism = 10
  })
  .Where(p => p.UserName == "Tsuyoshi Matsuzaki");
var tsmatz = query.AsEnumerable().FirstOrDefault();
...

Note : For import and update operations, you can also use bulk executor.
Now, let’s consider you update a lot of documents. In this case, it consumes the latency of each round-trip on the client side. With bulk executor, the operation is committed as mini-batch for each partition key range on the server-side.
The bulk executor (bulk-import, bulk-update) has greater write throughput than a multi-threaded application that writes data in parallel.

Note (May 2021) : Azure Cosmos DB now supports gateway and integrated cache, which optimizes the cost for read-heavy workloads using in-memory cache for both items and queries.
Read requests are cached by the dedicated gateway, and repeated query requests that hit the integrated cache won’t use any RU. (But the dedicated compute resources will be hourly charged.)

Without cache

With dedicated gateway and cache

Replication

Another aspect of Cosmos DB scalability is “replication” (replica). Cosmos DB has not only the locally distributed strategies, but also the globally distributed strategies.

Let’s see the Cosmos DB management in Azure Portal.
If you click the global map, you can see the “replicate data globally blade” in the portal. To add the replication region, you just click and save the location in this blade. You can add or remove even if your container is running !

When you set regions as above screenshot, you can read (or query) documents from any 3 regions (“East US”, “Japan East”, and “West Europe”) of the client’s choice, and you can write documents to only the write region (“East US”).
You can get these endpoints from the REST API (or SDK) as follows.

GET https://myaccount01.documents.azure.com/
authorization: type%3dmas...

HTTP/1.1 200 Ok
Content-Type: application/json

{
  "_self": "",
  "id": "myaccount01",
  "_rid": "myaccount01.documents.azure.com",
  "media": "//media/",
  "addresses": "//addresses/",
  "_dbs": "//dbs/",
  "writableLocations": [
    {
      "name": "East US",
      "databaseAccountEndpoint": "https://myaccount01-eastus.documents.azure.com:443/"
    }
  ],
  "readableLocations": [
    {
      "name": "East US",
      "databaseAccountEndpoint": "https://myaccount01-eastus.documents.azure.com:443/"
    },
    {
      "name": "Japan East",
      "databaseAccountEndpoint": "https://myaccount01-japaneast.documents.azure.com:443/"
    },
    {
      "name": "West Europe",
      "databaseAccountEndpoint": "https://myaccount01-westeurope.documents.azure.com:443/"
    }
  ],
  "userReplicationPolicy": {
    "asyncReplication": false,
    "minReplicaSetSize": 3,
    "maxReplicasetSize": 4
  },
  "userConsistencyPolicy": {
    "defaultConsistencyLevel": "Session"
  },
  "systemReplicationPolicy": {
    "minReplicaSetSize": 3,
    "maxReplicasetSize": 4
  },
  "readPolicy": {
    "primaryReadCoefficient": 1,
    "secondaryReadCoefficient": 1
  },
  "queryEngineConfiguration": "{"maxSqlQueryInputLength":30720,...}"
}

These region are prioritized, and if you don’t specify any specific region using SDK, the region of the first priority (in this case, “East US”) is selected.
You can change this priority by drag-and-drop in Azure Portal. (You can also change the write region by the manual failover operation.)

This global multi-region setting affects to the availability and performance.

For the availability perspective, Cosmos DB is able to fail over automatically (transparently) with these regions. Cosmos DB is also having the local replication (4 replicas by default), and these replicas will work on any level of troubles.

For the performance perspective, let’s consider that you’re serving the applications or services in the world-wide. All the partitions might be replicated in all regions, and then your customer can read the data from the nearby region with low network latency.

Note that if you consume 20,000 RUs for each 3 regions, the total 60,000 RUs are needed for the performance level. But, if the read operations are distributed to other regions, eventually each RUs in one region would be reduced. When using replication, you must carefully estimate the costs.

Note : This replication mechanism is not for the database backup. When you overwrite the wrong data because of your mistakes, this data is also replicated soon.
If you want to take backup, Cosmos DB can hold on-demand backup, periodic backup by each 4 hours, or continuous backup with point-in-time restore. For more details, please see “Automatic online backup and restore with Azure Cosmos DB“.

Note (Sep 2018) : Now you can also replicate write operations globally with new Multi-Master capabilities.

Indexing (Query optimization)

Cosmos DB provides the rich query expression by SQL, including joins, string functions (contains, concat, …), spatial functions, and user-defined functions (custom functions), etc. You can also use LINQ in C#, and the query by the expression tree is extracted and executed on the server side.

Cosmos DB automatically indexes the documents, and this helps the high throughput of these query in most cases. But, you remember that not all the query is covered by the default index settings, and the index designing is still important for detailed cases. Here I explain how it works and how to design.

Now let’s see the following rest example. This example is creating the database container with index policy.

After this index policy is provisioned, when you insert the document { "divisionid" : "div001", "divisioninfo" : { "membercount" : 5, "name" : "engineering" }, "divisionloc" : "Japan" }, the 3 indexes of /divisionid, /divisioninfo/membercount, and /divisioninfo/name will be created. When you search documents with these properties, these indexes are used and improves the query performance.

Note that path property is the document path. When you use “?” (question) in path, it means that it’s the exact path. When you use “*” (asterisk), the all paths under the specified path are included.

POST https://myaccount01-eastus.documents.azure.com/dbs/db01/colls
authorization: type%3dmas...
Accept: application/json

{
  "id": "test01",
  "indexingPolicy": {
    "automatic": true,
    "indexingMode": "Consistent",
    "includedPaths": [
      {
        "path": "/divisionid/?",
        "indexes": [
          {
            "dataType": "String",
            "precision": -1,
            "kind": "Hash"
          }
        ]
      },
      {
        "path": "/divisioninfo/*",
        "indexes": [
          {
            "dataType": "String",
            "precision": -1,
            "kind": "Hash"
          },
          {
            "dataType": "Number",
            "precision": -1,
            "kind": "Range"
          }
        ]
      }
    ],
    "excludedPaths": [
      {
        "path": "/*"
      }
    ]
  }
}

HTTP/1.1 201 Created
Content-Type: application/json

{
  "id": "test01",
  "indexingPolicy": {
    "indexingMode": "consistent",
    "automatic": true,
    "includedPaths": [
      {
        "path": "/divisionid/?",
        "indexes": [
          {
            "kind": "Hash",
            "dataType": "String",
            "precision": -1
          },
          {
            "kind": "Range",
            "dataType": "Number",
            "precision": -1
          }
        ]
      },
      {
        "path": "/divisioninfo/*",
        "indexes": [
          {
            "kind": "Hash",
            "dataType": "String",
            "precision": -1
          },
          {
            "kind": "Range",
            "dataType": "Number",
            "precision": -1
          }
        ]
      }
    ],
    "excludedPaths": [
      {
        "path": "/*"
      }
    ]
  },
  "_rid": "GZscAKjHAQ0=",
  "_ts": 1480945705,
  "_self": "dbs/GZscAA==/colls/GZscAKjHAQ0=/",
  "_etag": ""0000e300-0000-0000-0000-5845702d0000"",
  "_docs": "docs/",
  "_sprocs": "sprocs/",
  "_triggers": "triggers/",
  "_udfs": "udfs/",
  "_conflicts": "conflicts/"
}

The index kind must be “Hash” (hash index), “Range” (range index), or “Spatial” (spatial index). The hash index works when the equality query (=) is used. When you want to improve performance for the query with comparison (<, >, <=, !=, …) or sorting (order-by), the range index will work well.

Note : In Azure Cosmos DB, all the document is having the id property (which is automatically assigned to all the documents). This id property is unique in one single partition and having the hash index.

In the previous example I defined the index policy manually, but by default, Cosmos DB indexes every path (/*) in the document tree, and the string properties are indexed as the hash index and numeric properties as the range index.
The following is the result of provisioned index policy by default.

{
  "id": "test01",
  "indexingPolicy": {
    "indexingMode": "consistent",
    "automatic": true,
    "includedPaths": [
      {
        "path": "/*",
        "indexes": [
          {
            "kind": "Range",
            "dataType": "Number",
            "precision": -1
          },
          {
            "kind": "Hash",
            "dataType": "String",
            "precision": 3
          }
        ]
      }
    ],
    "excludedPaths": [
      
    ]
  },
  "_rid": "GZscAIxwGQ0=",
  "_ts": 1480993718,
  "_self": "dbs/GZscAA==/colls/GZscAIxwGQ0=/",
  "_etag": ""0000e900-0000-0000-0000-58462bbc0000"",
  "_docs": "docs/",
  "_sprocs": "sprocs/",
  "_triggers": "triggers/",
  "_udfs": "udfs/",
  "_conflicts": "conflicts/"
}

When you set “lazy” as the index mode (indexingMode), the index is created asynchronously and it responds soon when the document is updated (create/update/delete). The default is “consistent“.

Note : You can also use etag to avoid conflicts and keep the consistency of data.

The index precision is the byte of precision, and “-1” means the maximum precision.
For example, if you specify “7” bytes as index precision for the number property (8 bytes), you can reduce the index storage, but the query consumes more IOs (inputs and outputs).

As I described before, this understanding is important for your actual applications.
For example, let’s say that you have the date/time property and use this property as the query parameter of timeline search. Cosmos DB is the JavaScript based database, and it’s not having the date/time type in native. In such a case, it’s better to store this value as the epoch time (numeric time format) and set the range index for this numeric property.

Indexing is transparent in most cases, but the designing is still important for detailed cases.

Note (Sep 2019) : Composite indexes were introduced in Azure Cosmos DB. You can use composite indexes in case, such as “ORDER BY” with multiple properties, filtering (query) on multiple properties, or mixed with “ORDER BY” and filtering.

Note (Nov 2019) : Now you can enable an analytical storage in Cosmos DB backend, in addition to a normal transactional storage. A transactional storage (enabled by default) is row-oriented and used for fast read and write operations. On the other hand, an analytical storage is columnar-based (like Parquet format) and used for analytic query and workloads, such as Apache Spark workloads.

Consistency

When the data is updated, this updates will be applied to any replicas. As you know, waiting all these updates affects to the performance of Cosmos DB, and there also exists the trade-off between the consistency and latency.
To meet your business needs, Cosmos DB has the several level of consistency. You can use the following 5 types of consistency level (top to bottom, getting relaxed for consistency), and by default, the “Session” consistency is used in Cosmos DB. (See “Consistency levels in Azure Cosmos DB” for details.)

Strong :
Mostly consistent. The reads are guaranteed to return the most recent committed version of an item.
Bounded staleness :
In this consistency level, you specify how long you can allow the lag of reads by at most “K” number of times of updates or by “T” time interval. (There’s also ordering guarantee for reads.)
If zero, it’s equivalent to strong consistency.
Session :
Consistent in a single client session, but it’s eventual for the outside of session.
In terms of session consistence, you should specify x-ms-session-token request header (which is returned as the response header beforehand) in each HTTP requests, and then Cosmos DB identifies the specific client session and keeps the consistency of this client session. (When using SDK, this is automatically done by SDK internally.)
Consistent prefix :
It guarantees the order. i.e, If data is written by the order : A　-> B -> C, it is guranteed that you can read data as A, AB, or ABC.
Eventual :
There’s no ordering guarantee for reads, but most low of latency.

The latency of strong consistency is hundreds times larger than eventual consistency, but strong and bounded staleness consistency meets the globally distributed application’s needs for replicas.

The consistency level is defined in the scope of database account, but you can apply to the specific requests using x-ms-consistency-level request header. (i.e, You can change the consistency level for each requests.)

Note : When you want to handle some bunch of operations, you can also use server-side JavaScript stored procedures or triggers for the consistent transactional operations.

Change logs :

2017/05 renamed “DocumentDB” with “Cosmos DB”, added about 5 consistency levels (changed from 4 levels)

Categories: Uncategorized

Tagged as: Azure

tsmatz

Professional Development, Data Science

Azure Cosmos DB (NoSQL) – How it works (RU, Partition, Index, and Replica)

Guarantee – SLA and Reserved Throughput

Partitioning

Replication

Indexing (Query optimization)

Consistency

1 reply »

Leave a Reply Cancel reply

Recent Posts

Reinforcement Learning

Language (NLP)

Databricks Hands-On

Azure ML Hands-On

Others (GitHub)

Tags

Follow

Azure Cosmos DB (NoSQL) – How it works (RU, Partition, Index, and Replica)

Guarantee – SLA and Reserved Throughput

Partitioning

Replication

Indexing (Query optimization)

Consistency

Share this:

Related

1 reply »

Leave a Reply Cancel reply

Recent Posts

Reinforcement Learning

Language (NLP)

Databricks Hands-On

Azure ML Hands-On

Others (GitHub)

Tags

Follow