Skip to main content

Monitoring Ember Metrics

Overview

This document describes how Ember can be monitored using popular tools Prometheus, Grafana, and Zabbix.

  • Prometheus open-source monitoring software.
  • Grafana is open source time-series data analytics software.

Installation and initial configuration of Prometheus and Grafana are out of scope for this document.

Operations team should monitor vital system metrics like amount of free disk, and memory, system error log, etc.

Execution Server publishes custom metrics that can help monitor system performance and health. Each Ember server publishes its metrics (also known as “performance counters”) into local file in custom binary format (using shared memory).

note

REST API endpoint for Prometheus metrics export is available at https://emberhost:8988/actuator/prometheus

Ember Monitor has endpoint to export these metrics into Prometheus. See Ember Monitor Configuration Guide on how enable this feature.

Full set of metrics can be found at the end of this document. Here we will mention some key metrics related to error and overload.

Each deployment may be different, but consider to monitoring the following important Ember metrics.

Error counting metrics

  • Ember_Errors – general system error counter (non-zero value is reason for concern)
  • Ember_Rejects – rejected orders counter (sharp increase is a bad sign)
  • Ember_KillSwitch – system or operator halted trading
  • Ember_RiskVetos – count risk limit violations (sharp increase is a bad sign)
  • Ember_UnknownOrderEvents – counts flow related to orders unknown to Ember OMS
  • TradeConnector_Status – signals loss of connectivity to Order Entry connector.
  • ImboundSegmentMisses
  • MessageBus_Trasp_OutboundBackPressureDisconnects
  • MessageBus_Trasp_InboundBackPressureDisconnects

Overload metrics

Ember uses efficient in-memory queues to transmit signals between OMS and order destinations like Algorithms and Trade Connectors.

"... When in use, queues are typically always close to full or close to empty due to the differences in pace between consumers and producers. They very rarely operate in a balanced middle ground where the rate of production and consumption is evenly matched." Source

Ember queues

OMS
  • InboundFailedOffers - Signals main Ember Journal Queue back-pressure.
  • TradeConnector.id.OutboundFailedOffers - Number of unsuccessful writes to OMS event queue. Non-zero value indicates OMS overload (message queue overflow).
  • Algorithm.id.OutboundFailedOffers - Number of unsuccessful writes to OMS event queue. Non-zero value indicates OMS overload (message queue overflow).
  • MessageBus.Transp.OutboundBackPressureDisconnects - Counts Message bus client disconnects due to pressure on send.
  • MessageBus.Transp.InboundBackPressureDisconnects - Counts Message bus client disconnects due to pressure on receive (usually accompanied by InboundFailedOffers counter).
  • MessageBus.InboundFailedOffers -
Algorithm or Connector
  • Algorithm.id.InboundFailedOffers - Number of unsuccessful writes to Algorithms’s request Queue. Non-zero value indicates algorithm overload (message queue overflow). For example, Algorithm.MM.InboundFailedOffers counter may indicate MM (MarketMaker) algorithm overload.
  • TradeConnector.id.InboundFailedOffers - Number of unsuccessful writes to Connector’s request Queue. Non-zero value indicates connector overload (message queue overflow).

In addition, if you use FIX Gateway, read-up on overload metrics in FIX Gateway Administrator’s Guide.

Prometheus

Once you enable Prometheus in ES Monitor and added Ember exporter stanza in Prometheus you need to restart both services. After that you can connect to simple Prometheus dashboard usually running on port 9090 to make sure configuration is correct:

Ember metric in Prometheus

Ember metrics are grouped into several categories:

  • Ember process-wide metrics (active orders count, errors, trading status)
  • Per-algorithm metrics (number of errors, custom metrics)
  • Per-trading connector metrics (connection status, number of errors, etc.)
  • Per-FIX Gateway session metrics (optionally per FIX Session metrics).
  • etc.

Here we chart active orders count (metric Ember_ActiveOrders):

Ember Metric in Prometheus

For components that usually run as multiple instances, we use Prometheus maps where component ID is used as a key. For example, on screenshot below we chart the status of trading connectors using TradeConnector_Enabled metric. Our specific ember process has three instances of trading connectors deployed (BITSTAMP, CORRECTION, and SIM). Status Each trading connector instance appears as separate value (1 = connected, 0 = disconnected):

Ember in Prometheus

Grafana

You can use popular charting tool Grafana to create Ember monitoring dashboard and setup alerts.

Ember metrics in Grafana

Grafana dashboards

Adding Prometheus-based metrics is simple. Just select Prometheus as Grafana panel data source and select Ember-related metric.

You may notice that Ember metrics are grouped into Ember, Algorithm, TradeConnector, TradingGateway, MarketDataGateway groups.

Ember ActiveOrders metric in Grafana dashboard

Alerts

Alerts in Grafana can be added for any dashboard. Use Alert tab in dashboard editor (third tab in the screenshot above).

Ember JMX Metrics Agent

As additional option you can export Ember counters via JMX. JMX is widely used protocol to monitor Java applications. This option requires running small service that reads counters and exposes them as JMX Metrics.

java -server -cp "/deltix/ember/lib/*" -Dember.work=/deltix/emberhome -Dember.home=/deltix/emberhome  -Djava.rmi.server.hostname=10.0.0.46 -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.port=8053 -Dcom.sun.management.jmxremote.rmi.port=8053 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false       -Dzabbix.url="http://10.0.0.161:8077" -Dzabbix.user=Admin -Dzabbix.pass=zabbix deltix.ember.app.EmberStat2JmxAgent

Here we assume that this host has local IP address 10.0.0.46 and Zabbix server is running on 10.0.0.161.

Zabbix

This section describes how to monitor Ember JMX metrics via Zabbix. Connect to your Zabbix server:

Default port is 8077. Default username and password for Zabbix is Admin/zabbix

First, go to Configuration > Hosts and verify that Ember host is configured and JMX is available:

Zabbix screenshot 1

If not, verify that Ember JMX metrics agent is running.

Next, go to back to Monitoring > Dashboard and press ‘Edit Dashboard’ button in upper right corner.

Then click ‘Add Widget’ button to add Ember metrics to dashboard.

Zabbix screenshot 2

Make sure you use ‘Graph’ widget type that uses ‘Simple graph’ sources to see list of Ember metrics available for monitoring. Here we added Ember.ActiveOrders counter that displays number of active orders.

You can add multiple widgets displaying various metrics. Click ‘Save Changes’ in the upper right corner when done.

Your widgets should appear on dashboard and should look like this:

Zabbix screenshot 3

Zabbix charts have a bit outdated look and feel, use Grafana if you want more modern dashboard.

Displaying Zabbix metrics in Grafana

Adding Zabbix-provided metrics is very similar to Prometheus metrics (see chapter above).

  • DataSource: Zabbix
  • Group: QuantServer
  • Server: Ember Server
  • Item: metric that you want to display

Zabbix screenshot 4

Here we also use function rate() to display change rate of selected metric. In Grafana you can select more than one metric per chart.

When done, click on Return to dashboard button in the upper right corner.

Zabbix screenshot 5

Zabbix screenshot 6

Appendix: List of Ember Metrics

Here is a list of standard Ember Metrics as of version 1.6.72 (July 2020):

Ember OMS Counters

MetricDescription
Ember.TotalOrdersTotal number of orders processed
Ember.TotalTradesTotal number of trades (fills) processed/
Ember.ActiveOrdersCurrent number of active orders in OMS cache
Ember.RejectsTotal number of order rejects processed
Ember.RequestsTotal number of trading requests (order submissions, replacements, cancellations) processed
Ember.RequestRateRequest Rate
Ember.EventsTotal number of trading events (order acks/rejects, fills, replacement and cancel acks, etc.) processed
Ember.EventRateEvent rate
Ember.RiskVetosTotal number of risk limit violations
Ember.UnknownOrderEventsTotal number of unknown order events
Ember.ErrorsTotal number of internal exceptions in OMS related to message processing
Ember.OrdersPoolSizeTo prevent frequent memory allocations OMS maintains object pool of order objects. After some time, inactive (completed) orders return to the pool. Persistent growth of this counter may be a sign of memory leak.
Ember.OrderEntriesPoolSizeTo prevent frequent memory allocations OMS maintains object pool of order events objects. After some time, inactive (completed) orders return their events back to the pool. Persistent growth of this counter may be a sign of memory leak.
Ember.KillSwitch1 = Trading Halted, 0 = Trading Resumed (normal operation mode)
Ember.RoleCluster Role of this node: 0 = Bootstrap (Start-up), 1 = Follower 2= Leader, 3 = Candidate
Ember.HeartbeatPublishes Ember’s process clock (in Linux Epoch Time format)

Algorithm Counters

The following table lists standard counters available for each algorithm. Note: algorithms very frequently define custom counters.

MetricDescription
Algorithm.id.OutboundFailedOffersNumber of unsuccessful writes to OMS event queue. Non-zero value indicates OMS overload (message queue overflow)
Algorithm.id.OutboundSucceededOffersNumber of writes to OMS event queue
Algorithm.id.InboundFailedOffersNumber of unsuccessful writes to Algorithm’s request Queue. Non-zero value indicates algorithm overload (message queue overflow)
Algorithm.id.InboundSucceededOffersNumber of writes into Algorithm’s request queue
Algorithm.id.Enabled1=Enabled, 0=Disabled

Where .id. in each metric name identifies algorithm. For example, counter Algorithm.NIAGARA.OutboundFailedOffers provides number of outbound failed offers for algorithm NIAGARA.

Trading Connector Counters

For each trading connector configured in the system (and identified by Id) you will see the following counters.

MetricDescription
TradeConnector.id.OutboundFailedOffersNumber of unsuccessful writes to OMS event queue. Non-zero value indicates OMS overload (message queue overflow)
TradeConnector.id.OutboundSucceededOffersNumber of writes to OMS event queue
TradeConnector.id.Status0 = Disconnected, 1 = Connected
TradeConnector.id.InboundFailedOffersNumber of unsuccessful writes to Connector’s request Queue. Non-zero value indicates connector overload (message queue overflow)
TradeConnector.id.InboundSucceededOffersNumber of writes into Connector’s request queue
TradeConnector.id.Enabled1=Enabled, 0=Disabled

Where .id. in each metric name identifies trade connector. For example, counter TradeConnector.CME.OutboundFailedOffers provides number of outbound failed offers for connector CME.

FIX Gateway: MARKET DATA Gateway Counters

FIX Market Data Gateway counters are also described in FIX Gateway Administrator’s Guide (See overload control section). Each configured gateway is identified by Id.

MetricDescription
MarketDataGateway.id.Transp.BytesReceivedTotal Number of bytes received from all clients of this FIX Gateway (at Transport Layer)
MarketDataGateway.id.Transp.BytesSentTotal Number of bytes sent to clients of this FIX Gateway (at Transport Layer)
MarketDataGateway.id.Transp.MessagesReceivedTotal Count of messages received from all clients of this FIX Gateway (at Transport Layer)
MarketDataGateway.id.Transp.MessagesSentTotal Count of messages sent to clients of this FIX Gateway (at Transport Layer)
MarketDataGateway.id.Transp.OutboundBackPressureDisconnectsTransport layer disconnects a client if it cannot write data into client socket (send buffer overflow). This counter reflects total number of such disconnects.
MarketDataGateway.id.Transp.InboundBackPressureDisconnectsTransport layer disconnects a client if it cannot enqueue inbound FIX message into Session Layer (usually due to overload). This counter reflects total number of such disconnects.
MarketDataGateway.id.Transp.ConnectionsAcceptedTotal number of FIX client connections accepted from server start up.
MarketDataGateway.id.Transp.ConnectionsRejectedTotal number of FIX client connections rejects from server start up. Client connection can be rejected due to invalid password, incorrect source IP, or session comp ID mismatch, etc. For example, attempts to connect on gateway port that already has active FIX session also results in a reject and included in this counter.
MarketDataGateway.id.Transp.ConnectionsActiveCurrent count of active FIX clients (sessions)
MarketDataGateway.id.Transp.DelayedDisconnectsTotal
MarketDataGateway.id.Transp.InboundThrottleIncremented when Transport Layer throttles inbound FIX messages to reduce system load. Throttling logic is activated when inbound queue that connects Transport Layer with Session layer has less than inboundThrottleLimit bytes.
MarketDataGateway.id.Transp.IdleCyclesTotal number of times when Transport Layer finished execution cycle with zero work done (i.e. idling).
MarketDataGateway.id.Transp.ActiveCyclesTotal number of times when Transport Layer finished execution cycle with some work done (e.g. some messages pulled from TimeBase, some messages sent to FIX client, etc.)
MarketDataGateway.id.Transp.ActiveTimeTotal time spent on work on Transport Layer. This counter must be enabled explicitly by the gateway configuration option "measureActiveTime = true". Turned off by default because it calls System.nanoTime() each time when we switch between active and idle states. It is not recommended turning this counter on if you use more than one MDG in same Ember instance.
MarketDataGateway.id.Sess.SendQueueBackPressure
MarketDataGateway.id.Transp.SendQueueSizeMaximum observed size of outbound queue over 1 second interval on the side on consumer (Transport Layer).
MarketDataGateway.id.Server.MDPublishFailureIncremented each time when Session Layer cannot publish Market Data Update to Session Layer (usually due to overload from outbound back pressure).
MarketDataGateway.id.Server.TradePublishFailureIncremented each time when Session Layer cannot publish Market Trade message to Session Layer (usually due to overload from outbound back pressure).
MarketDataGateway.id.Server.OtherMessagePublishFailureIncremented each time when Session Layer cannot publish non-market data message to Session Layer (usually due to overload from outbound back pressure).
MarketDataGateway.id.Server.MDPublishDelayBackPressureThis metric report number of times MDG delayed sending market data to clients due to back pressure (outbound message queue is at least half full). If you observe growth of this metric, then Market Data gateway is unable to send data as fast as it configured (See “minimumUpdateInterval” setting).
MarketDataGateway.id.Server.AppMessagesProcessedTotal number of application-level FIX messages processed by Session Layer.
MarketDataGateway.id.Server.FixMessagesProcessedTotal number of all FIX messages processed by Session Layer.
MarketDataGateway.id.Server.MarketDataUpdatesProcessedTotal number of Market Data feed messages (typically from TimeBase) processed by FIX Market Data Session Layer.
MarketDataGateway.id.Server.Errors
MarketDataGateway.id.Server.StalePollerReset
MarketDataGateway.id.Server.MarketMessageInfoReceived
MarketDataGateway.id.ProxyFailedOffers

Where .id. in each metric name identifies market data gateway.

FIX Gateway: Trade Gateway Counters

MetricDescription
TradeGateway.id.Transp.BytesReceivedTotal Number of bytes received from all clients of this FIX Gateway (at Transport Layer)
TradeGateway.id.Transp.BytesSentTotal Number of bytes sent to clients of this FIX Gateway (at Transport Layer)
TradeGateway.id.Transp.MessagesReceivedTotal Count of messages received from all clients of this FIX Gateway (at Transport Layer)
TradeGateway.id.Transp.MessagesSentTotal Count of messages sent to clients of this FIX Gateway (at Transport Layer)
TradeGateway.id.Transp.OutboundBackPressureDisconnectsTransport layer disconnects a client if it cannot write data into client socket (send buffer overflow). This counter reflects total number of such disconnects.
TradeGateway.id.Transp.InboundBackPressureDisconnectsTransport layer disconnects a client if it cannot enqueue inbound FIX message into Session Layer (usually due to overload). This counter reflects total number of such disconnects.
TradeGateway.id.Transp.ConnectionsAcceptedTotal number of FIX client connections accepted from server start up.
TradeGateway.id.Transp.ConnectionsRejectedTotal number of FIX client connections rejects from server start up. Client connection can be rejected due to invalid password, incorrect source IP, or session comp ID mismatch, etc. For example, attempts to connect on gateway port that already has active FIX session also results in a reject and included in this counter.
TradeGateway.id.Transp.ConnectionsActiveCurrent count of active FIX clients (sessions)
TradeGateway.id.Transp.DelayedDisconnectsTotal
TradeGateway.id.Transp.InboundThrottleIncremented when Transport Layer throttles inbound FIX messages to reduce system load. Throttling logic is activated when inbound queue that connects Transport Layer with Session layer has less than inboundThrottleLimit bytes.
TradeGateway.id.Transp.IdleCyclesTotal number of times when Transport Layer finished execution cycle with zero work done (i.e. idling).
TradeGateway.id.Transp.ActiveCyclesTotal number of times when Transport Layer finished execution cycle with some work done (e.g. some messages pulled from TimeBase, some messages sent to FIX client, etc.)
TradeGateway.id.Sess.SendQueueBackPressure
TradeGateway.id.Transp.SendQueueSizeMaximum observed size of outbound queue over 1 second interval on the side on consumer (Transport Layer).
TradeGateway.id.OutboundFailedOffers
TradeGateway.id.OutboundSucceededOffers
TradeGateway.id.InboundFailedOffers
TradeGateway.id.InboundSucceededOffers

Where .id. in each metric name identifies order entry gateway.

Appendix: Examples

Here is an example of problematic Ember state.

You can see that system is not incrementing total number of orders/trades counter and at the same time InboundFailedOffers keeps raising (messages cannot be written into Ember Journal).

In this particular case system run out of disk space on $EMBER_WORK volume. As result, no new messages can be written into ember journal and hence OMS stopped processing new orders.

Overload1