Matching Engine Order Acknowledgement Latency

May 2023

Introduction

This document analyzes an experiment measuring the order acknowledgement latency of a First-In-First-Out (FIFO) matching engine developed on the Ember stack. Data was gathered from an Ember client's live production order traffic during active trading hours.

Results Overview

The average order acknowledgement latency in PROD stood at approximately 30 microseconds. 99% of the orders were acknowledged in less than 133 microseconds.

Nov 2023 Update: migration of target environment to Xilinx ONLOAD reduced median latency to 16 microseconds and 99% percentile to 65 us.

These measurements were assessed from the moment the operating system received the inbound order network packet until the outgoing order acknowledgement network packet was dispatched (using libpcap for FIX message capture).

April 2024 Update: migration of environment to a modern high end server reduced median latency in PROD to 6 microseconds (and 99% to 12 microseconds).

Environment Variables

The environment for the analysis consisted of the following components:

Hardware - Beeks Platinum server: Single Intel Xeon CPU E5-2695 v4 @ 2.1 GHz (18 physical cores); RAM: 64G; SSD: 480G.
Operating System (OS) - Ubuntu 22.04 with modified kernel (built with CONFIG_NO_HZ parameter).
OS Tuning - OS was tuned according to this Linux tuning guide.
Ember and TimeBase - Running under docker-compose, with host networking mode.

Measurement Details

Order acknowledgement latency is defined as the time elapsed between the receipt of a FIX message containing order-related requests on our end and the subsequent issuance of our initial response.

We gathered results from 8:30 AM to 12:00 PM (UTC-4). During this period, we captured metrics for approximately 111,000 orders from a single market maker.

Our latency measurement tool captured network packets using LIBPCAP API. For each inbound packet containing FIX message with new order request, the tool establish correlation with an outbound packet (containing order ACK FIX message for the same order). The difference in Linux kernel-provided timestamps of these packets gave us order ACK latency of individual order. During the duration of our test we were able to collect measurement 111K production orders. HDR Histogram was used to calculate latency percentiles. A simplified version of this tool is available here.

The diagram below is a visual representation of the latency testing process:

latency test diagram

Detailed Results

The vast majority of the observed network traffic consisted of FIX Order Replace Request messages (35=G), accompanied by a smaller number of Order New Requests (35=D) and Order Cancel Requests (35=F).

Each of these requests typically prompted one or more response Execution Report messages (35=8). For instance, an order could result in an initial acknowledgement (150=0), followed by several Execution Report messages (150=F).

The following table contains the latency histogram, with times shown in microseconds:

Percentile : Microseconds : Event count  
MIN        :         23.0 :           6  
0%      :         30.0 :       57761  
0%      :         55.0 :      100940  
0%      :        133.0 :      110655  
9%      :        310.0 :      111638  
99%     :        538.0 :      111738  
999%    :       2495.0 :      111748  
9999%   :       2575.0 :      111749  
99999%  :       2575.0 :      111749  
999999% :       2575.0 :      111749  
MAX, TOTAL :       2575.0 :      111749

The same results shown in chart format:

percentiles chart

Comparison with AWS-Based Benchmark

Deltix has an internal performance benchmark that measures a similar order acknowledgement latency in an AWS-based configuration. In this setup, the matching engine is fed with 20,000 order requests per second from 10 FIX-based bots that simulate real clients.

Here is a sample result, based on a dataset of 2.2 million orders:

Percentile : Microseconds : Event count
MIN        :         13.0 :         117
0%      :         22.0 :     1166665
0%      :         71.0 :     2028273
0%      :        100.0 :     2225310
9%      :        129.0 :     2244197
99%     :        157.0 :     2246105
999%    :        183.0 :     2246294
9999%   :        797.0 :     2246314
99999%  :        992.0 :     2246316
999999% :        992.0 :     2246316
MAX, TOTAL :        992.0 :     2246316

The AWS environment offers better CPUs (c5d.12xlarge with Intel Xeon Platinum 8275CL CPU @ 3.00GHz) that result in better mean time but lead to more jitter.

The following histogram chart compares the latency results between the production environment at Beeks (blue) and the AWS test environment (red), with lower values indicating better performance:

Beeks vs. AWS Chart

Further Steps

To further reduce order acknowledgement latency, there are several strategies that can be explored.

TCP Bypass

November 2023 Update: Results after setup was upgraded to run under Xilinx ONLOAD driver:

Percentile	Latency (μs)
MIN	13
50.0%	24
90.0%	54
99.0%	247
99.9%	476
99.99%	746
99.999%	746
99.9999%	746
99.99999%	746
99.999999%	746
MAX, TOTAL	746

Upgrade to a Modern, Overclocked CPU

The benchmark results from the AWS-based setup indicate that achieving an average order acknowledgement time as low as 12 microseconds is possible with more advanced CPUs. Considering this, an upgrade of the production environment to a contemporary, overclocked CPU is worth pursuing. This upgrade is currently on our horizon.

March 2024 Update: Production was upgraded to a modern server with Intel® Xeon® w7-2495X running at 4.8Ghz. Updated results:

Percentile	Latency (μs)
MIN	4
50.0%	6
90.0%	8
99.0%	12
99.9%	27
99.99%	90
99.999%	171
99.9999%	247
99.99999%	247
99.999999%	247
MAX, TOTAL	247

Switch to Commercial JVM with Better JIT

Our order ACK latency benchmark shows that switching from OpenJDK to GraalVM Enterprise or Azul Zulu Prime improves latency metrics by about 16%:

Azul Zulu Prime 11 vs Amazon Corretto OpenJDK 11

Custom Architecture

We can effectively bypass Ember's generic OMS by allowing the matching engine to operate independently instead of running within the custom algorithm. This adjustment eradicates two queue hops for the signals under observation, leading to a more streamlined process.

Matching Engine Order Acknowledgement Latency

Introduction​

Results Overview​

Environment Variables​

Measurement Details​

Detailed Results​

Comparison with AWS-Based Benchmark​

Further Steps​

TCP Bypass​

Upgrade to a Modern, Overclocked CPU​

Switch to Commercial JVM with Better JIT​

Custom Architecture​