Metrics

For basic metrics, Backbeat gathers, processes, and exposes six data points:

  • Number of operations (ops)
  • Number of completed operations (opsdone)
  • Number of failed operations (opsfail)
  • Number of bytes (bytes)
  • Number of completed bytes (bytesdone)
  • Number of failed bytes (bytesfail)

Design

To collect metrics, a separate Kafka Producer and Consumer pair (MetricsProducer and MetricsConsumer) using their own Kafka topic (default to “backbeat-metrics”) produce their own Kafka entries.

When a new CRR entry is sent to Kafka, a Kafka entry to the metrics topic is produced, indicating to increase ops and bytes. On consumption of this metrics entry, Redis keys are generated with the following schema:

Site-level CRR metrics Redis key:

<site-name>:<default-metrics-key>:<ops-or-bytes>:<normalized-timestamp>

Object-level CRR metrics Redis key:

<site-name>:<bucket-name>:<key-name>:<version-id>:<default-metrics-key>:<ops-or-bytes>:<normalized-timestamp>

A normalized timestamp determines the time interval on which to set the data. The default metrics key ends with the type of data point it represents.

When the CRR entry is consumed from Kafka, processed, and the metadata for replication status updated to a completed state (i.e. COMPLETED, FAILED), a Kafka entry is sent to the metrics topic indicating to increase opsdone and bytesdone if replication was successful or opsfail and bytesfail if replication was unsuccessful. Again, on consumption of this metrics entry, Redis keys are generated for their respective data points.

It is important to note that a MetricsProducer is initialized and producing to the metrics topic both when the CRR topic BackbeatProducer produces and sends a Kafka entry, and when the CRR topic BackbeatConsumer consumes and processes its Kafka entries. The MetricsConsumer processes these Kafka metrics entries and produces to Redis.

A single-location CRR entry produces four keys in total. The data points stored in Redis are saved in intervals (default of 5 minutes) and are available up to an expiry time (default of 15 minutes).

An object CRR entry creates one key. An initial key is set when the CRR operation begins, storing the total size of the object to be replicated. Then, for each part of the object that is transferred to the destination, another key is set (or incremented if a key already exists for the current timestamp) to reflect the number of bytes that have completed replication. The data points stored in Redis are saved in intervals (default of 5 minutes) and are available up to an expiry time (default of 24 hours).

Throughput for object CRR entries are available up to an expiry time (default of 15 minutes). Object CRR throughput is the average bytes transferred per second within the latest 15 minutes.

A BackbeatServer (default port 8900) and BackbeatAPI expose these metrics stored in Redis by querying based on the prepended Redis keys. Using these data points, we can calculate simple metrics like backlog, number of completions, progress, throughput, etc.

Metrics API

Routes are organized as follows:

/_/backbeat/api/metrics/<extension-type>/<location-name/[<metric-type>]/[<bucket>]/[<key>]?[versionId=<version-id>]

Where:

  • <extension-type> currently supports only crr for replication metrics
  • <location-name> represents any current destination replication locations you have defined. To display metrics for all locations, use all
  • <metric-type> is an optional field. If you specify a metric type, Backbeat returns the specified metric. If you omit it, Backbeat returns all available metrics for the given extension and location.
  • <bucket> is an optional field. It carries the name of the bucket in which the object is expected to exist.
  • <key> is an optional field. When getting CRR metrics for a particular object, it contains the object’s key.
  • <version-id> is an optional field. When getting CRR metrics for a particular object, it contains the object’s version ID.

Backbeat offers routes for the following services:

All metric routes contain a <location-name> variable.

The site name must match the name specified in env_replication_endpoints under the backbeat replication configurations in env/client_template/group_vars/all.

If the site is for a different cloud backend (i.e. AWS, Azure), use that backend’s defined type (aws_s3 or azure, for example).

Get All Metrics

Route: GET /_/backbeat/api/metrics/crr/<location-name>

This route gathers all metrics for the requested location name and extension type, returning the requested information in a JSON-formatted object.

Get Pending

Route: GET /_/backbeat/api/metrics/crr/<location-name>/pending

This route returns pending replication in number of objects and number of total bytes. The bytes total represents data only and does not include the size of metadata.

Pending replication represents the objects that have been queued up to be replicated to another site, but the replication task has not yet been completed or failed for that object.

Response:

"pending":{
  "description":"Number of pending replication operations (count) and bytes (size)",
  "results":{
    "count":0,
    "size":0
  }
}

Get Backlog

Route: GET /_/backbeat/api/metrics/crr/<location-name>/backlog

This route returns the replication backlog in number of objects and number of total bytes for the specified extension type and location name. Replication backlog represents the objects that have been queued for replication to another location, but for which the replication task is not complete. If replication for an object fails, failed object metrics are considered backlog.

Response:

"backlog":{
  "description":"Number of incomplete replication operations (count) and number of incomplete bytes transferred (size)",
  "results":{
    "count":4,
    "size":"6.12"
  }
}

Get Completions

Request: GET /_/backbeat/api/metrics/crr/<location-name>/completions

This route returns the replication completions in number of objects and number of total bytes transferred for the specified extension type and location. Completions are only collected up to an EXPIRY time, which is currently set to 15 minutes.

Response:

"completions":{
  "description":"Number of completed replication operations (count) and number of bytes transferred (size) in the last 900 seconds",
  "results":{
    "count":31,
    "size":"47.04"
  }
}

Get Failures

Request: GET /_/backbeat/api/metrics/crr/<location-name>/failures

This route returns the replication failures in number of objects and number of total bytes for the specified extension type and location. Failures are collected only up to an EXPIRY time, currently set to a default 15 minutes.

Response:

"failures":{
  "description":"Number of failed replication operations (count) and bytes (size) in the last 900 seconds",
  "results":{
    "count":"5",
    "size":"10.12"
  }
}

Get Throughput: Ops/Sec

Request: GET /_/backbeat/api/metrics/crr/<location-name>/throughput

This route returns the current throughput in number of completed operations per second (or number of objects replicating per second) and number of total bytes completing per second for the specified type and location name.

Response:

"throughput":{
  "description":"Current throughput for replication operations in ops/sec (count) and bytes/sec (size)",
  "results":{
    "count":"0.00",
    "size":"0.00"
  }
}

Get Throughput: Bytes/Sec

Request: GET /_/backbeat/api/metrics/crr/<site-name>/throughput/<bucket>/<key>?versionId=<version-id>

This route returns the throughput in number of total bytes completing per second for the specified object.

Response:

{
  "description": "Current throughput for object replication in bytes/sec (throughput)",
  "throughput": "0.00"
}

Get Progress

Request: GET /_/backbeat/api/metrics/crr/<location-name>/progress/<bucket>/<key>?versionId=<version-id>

This route returns replication progress in bytes transferred for the specified object.

Response:

{
  "description": "Number of bytes to be replicated (pending), number of bytes transferred to the destination (completed), and percentage of the object that has completed replication (progress)",
  "pending": 1000000,
  "completed": 3000000,
  "progress": "75%"
}