Skip to main content

Tuning CentOS Linux for Ember

This document describes various ways to tune OS jitter to optimize Execution Server latency.

note

This document was written in September 2017, but large parts of it still stand. Please refer to Ubuntu tuning guide for more up-to-date information.

Hardware Considerations

A few comments for choosing server hardware to host the Execution Server:

  • EPAM RTC (former Deltix) has good experience with Dell PowerEdge and Supermicro Hyperspeed server lines, as well as AWS C5 instances.
  • Take the highest CPU/Bus frequency your budget can afford.
  • Take highest the CPU unit / CPU core count your budget can afford.
  • Some unconfirmed evidence suggests that the total amount of RAM should be 16G/64G/256G to avoid page faults.
  • In addition to a normal storage system, you need a fast NVMe disk for the transaction log. In a typical case, the system log requires about 3K bytes per trading order and to be purged once a day or once a week.
  • Use an NVMe or high IOPS disk for the Ember work directory.
  • A Solarflare network card is recommended.

BIOS Tuning

  • Set power usage to MAX PERFORMANCE.
  • Disable C-States.
  • Leave Turbo-boost enabled (has no visible effect).
  • Leave Hyper-Threading enabled.

Overclocking

Overclocking CPUs and Memory is a common practice in the HFT domain. We recommend it for clients who understand the risk (system hangs, void warranty, etc.). One way of doing it is ordering a system from a vendor who specializes in overclocked solutions.

We recommend stress-testing an overclocked machine before deploying the Execution Server on it.

Some recommended stress tests:

OS Tuning

This section describes the tuning for CentOS 7.X OS, which is an open-source version of Red Hat Enterprise Linux 7.X. Other Linux dialects are not certified but may perform similarly.

We recommend the "Minimal” installation of CentOS (which installs headless OS without extra components).

Kernel

We experimented with different Kernels in September 2017. Back then, CentOS 7.3 shipped kernel version 3.10. When this document was written, the latest version of Linux kernel was 4.12. We did not notice any performance improvement/degradation from upgrading to the latest version of kernel in our standard tick-to-order latency benchmark. You may have other considerations to use the latest kernel version.

Real-time Kernel

Preliminary experiments with Real Time kernels were unproductive. Our lab is happy to engage with more experiments with Real Time kernels with interested clients.

Kernel Parameters

Recommended kernel parameters:

ParameterDescription
isolcpusIsolate some cores from the general schedule (these cores are used by Ember).
nohz=off
transparent_hugepage=neverDisable THP.
intel_pstate=disableIf the driver is driver_pstate, you can disable it.
intel_idle.max_cstate=0If the driver is intel_idle, see this. (Run cat /sys/devices/system/cpu/cpuidle/current_driver to check.)
processor.max_cstate=0Same source as above.
mce=ignore_ce
nosoftlockup=0Disable checking software lockups on CPUs.
audit=0
idle=pollHighest performance (at the expense of power and heat).
nmi_watchdog=0

To set kernel parameters:

  1. Edit the file called /etc/default/grub.

    sudo vi /etc/default/grub
  2. Find or add the line GRUB_CMDLINE_LINUX_DEFAULT="" and add one or more parameters described below (space separated). If something goes wrong, boot in rescue mode and remove the bad changes.

    GRUB_CMDLINE_LINUX_DEFAULT="isolcpus=4-11 nohz=off transparent_hugepage=never intel_pstate=disable intel_idle.max_cstate=0 processor.max_cstate=0 mce=ignore_ce nosoftlockup=0 audit=0 idle=poll nmi_watchdog=0"
  3. To apply changes, run the following command:

    • If you use BIOS boot mode:

      sudo grub2-mkconfig -o /boot/grub2/grub.cfg
    • If you use a UEFI-based boots application:

      sudo grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg
  4. Reboot.

  5. After reboot, verify that the settings were applied using the command cat /proc/cmdline.

OS Services

SELinux

Our benchmark showed a 15% latency improvement when SELinux was disabled.

sudo vi /etc/sysconfig/selinux  

To disable SELinux:

  1. Set SELINUX=disabled.
  2. Save.
  3. Reboot.

Services

Stop or disable the following services:

sudo systemctl stop abrt-ccpp  abrtd abrt-oops
sudo systemctl stop alsa-state
sudo systemctl stop anacron
sudo systemctl stop atd
sudo systemctl stop autofs
sudo systemctl stop avahi-daemon
sudo systemctl stop bluetooth
sudo systemctl stop certmonger
sudo systemctl stop cups
sudo systemctl stop firewalld
sudo systemctl stop haldaemon
sudo systemctl stop hidd
sudo systemctl stop ip6tables
sudo systemctl stop iprdump
sudo systemctl stop iprinit
sudo systemctl stop iprupdate
sudo systemctl stop mdmonitor
sudo systemctl stop messagebus
sudo systemctl stop nfs-lock
sudo systemctl stop postfix
sudo systemctl stop restorecond
sudo systemctl stop rhnsd
sudo systemctl stop rhsmcertd
sudo systemctl stop rpcbind
sudo systemctl stop netfilter

Firewall

Run the following commands as root:

iptables -F ; iptables -t nat -F; iptables -t mangle -F
iptables -X ; iptables -t nat -X; iptables -t mangle -X
iptables -t raw -F ; iptables -t raw -X

Cron Jobs

Verify that your system does not have any cron jobs:

crontab -l
crontab -l -u deltix

Swap

Permanently disable swap. In order to do that, edit /etc/fstab and comment out swap line (usually the last entry).

sudo vi /etc/fstab

For example:

#/dev/mapper/cl-swap     swap                    swap    defaults        0 0

Tuned Performance Profiles

To improve the performance of your CentOS system, follow these steps:

  1. Install the tuned utility by running the following command:

    sudo yum install -y tuned
  2. Set the network-latency performance profile by running the following command:

    sudo tuned-adm profile network-latency
  3. Ensure that the tuned profile remains active after system reboots by running the following commands:

    sudo service tuned start  
    sudo chkconfig tuned on

    Note: The chronicle team recommends using the latency-performance profile.

  4. Verify the current CPU frequencies by running the following command:

    sudo turbostat sleep 5
  5. Make sure that:

  • Each CPU is always at max frequency.
  • CPU%c0 is at 100% or close.
  • SMI counters are zero.

File System

To optimize the performance of your file system, consider the following tips:

  • Use the ext4 type of partition for journaling, as it is faster than xfs.
  • When mounting the journal partition, use the barrier=0 and noatime settings to improve performance.

TCP

To optimize TCP performance, adjust the kernel TCP buffers with the following commands:

sysctl -w net.core.rmem_max=2097152
sysctl -w net.core.wmem_max=2097152

By increasing the TCP buffers, you can improve the speed and efficiency of data transmission over TCP connections.

Other settings

kworker

To force the kwoker thread to collect statistics every hour, add the following line to the /etc/sysctl.conf file:

sudo sysctl -w vm.stat_interval=3600

This setting applies immediately and persists after each system reboot.

CPU Affinity

To verify isolated cores (see above), use the command cat /sys/devices/system/cpu/isolated.

To partition CPUs, use the taskset OS command and affinity control in the Ember and TimeBase configuration files.

It's good practice to create a CPU-to-thread layout map. Refer to the Ember User's Guide for the CPU Affinity chapter to learn about CPU affinity configuration inside Ember.

When defining CPU affinity, consider the following:

  • Hyper-threading: Earlier, we recommended having hyper threading enabled.  While virtual cores may be useful for the concurrency of non-mission critical tasks, each key signal processing component should receive an entire physical core.
  • NUMA
  • IRQ Balancing

To validate what is really going on with isolated cores, use the perf utility:

sudo perf record -e “sched:sched_switch” -C 4,6,8,10,12  
sudo perf report

The output should only show the idle process (swapper).

Verification

sysjitter utility

Use the Solarflare sysjitter utility to check OS jitter after applying the above tune-ups.

Here is a sample of good results. Focus on 90P, 99P, and 999P.

sudo ./sysjitter --cores 4-15  --runtime 15 500 | column -t
core_i: 4 5 6 7 8 9 10 11 12 13 14 15
threshold(ns): 500 500 500 500 500 500 500 500 500 500 500 500
cpu_mhz: 3003 3003 3003 3003 3003 3003 3003 3003 3003 3003 3003 3003
runtime(ns): 14985359730 14985359347 14985359408 14985359323 14985359483 14985359475 14985359477 14985359379 14985359734 14985359347 14985359411 14985361027
runtime(s): 14.985 14.985 14.985 14.985 14.985 14.985 14.985 14.985 14.985 14.985 14.985 14.985
int_n: 15000 15000 15006 15009 15000 15002 15000 15001 15001 15000 15007 15015
int_n_per_sec: 1000.977 1000.977 1001.377 1001.578 1000.977 1001.110 1000.977 1001.044 1001.044 1000.977 1001.444 1001.978
int_min(ns): 832 851 847 856 759 709 862 849 851 868 833 895
int_median(ns): 978 987 981 987 866 968 971 960 978 987 981 989
int_mean(ns): 987 996 995 1000 887 987 985 978 987 998 998 1014
int_90(ns): 1021 1037 1029 1030 940 1038 1032 1030 1022 1037 1030 1031
int_99(ns): 1198 1227 1248 1278 1080 1254 1198 1220 1208 1235 1313 1369
int_999(ns): 2448 2278 2912 2909 2832 2939 2900 2880 2267 2600 3026 6643
int_9999(ns): 4327 2955 3045 3896 3063 3147 6256 3030 3977 4441 4730 19402
int_99999(ns): 5176 4469 3295 4412 5333 6191 7372 4689 6273 6950 6394 20362
int_max(ns): 5176 4469 3295 4412 5333 6191 7372 4689 6273 6950 6394 20362
int_total(ns): 14810884 14951525 14933122 15025664 13319027 14810179 14782973 14675849 14809595 14983760 14983766 15237682
int_total(%): 0.099 0.100 0.100 0.100 0.089 0.099 0.099 0.098 0.099 0.100 0.100 0.102

perf-workshop utility

A similar tool, called perf-workshop, was developed in Java by LMAX developer Mark Price.

To use it, run the following commands:

svn co https://github.com/epickrram/perf-workshop/trunk perf-workshop
cd perf-workshop
./gradlew bundleJar
cd src/main/shell && bash ./run_test.sh BASELINE

Our sample numbers:

== Accumulator Inter-Message Latency (ns) ==
mean 9729
min 272
50.00% 9728
90.00% 9728
99.00% 9728
99.90% 10240
99.99% 10752
99.999% 11264
99.9999% 14336
max 21504
count 2795301

== Accumulator Message Transit Latency (ns) ==
mean 159
min 100
50.00% 160
90.00% 176
99.00% 184
99.90% 232
99.99% 352
99.999% 1216
99.9999% 2944
max 11264
count 2795301

Appendix: Other Configuration Steps

This section describes various measures that may not be required to optimize latency but are beneficial to have on production systems.

Reliable Clock Synchronization

To ensure reliable clock synchronization, make sure to have NTP or Crony running and periodically synchronizing clocks. We use same-source clocks for latency measurements. However, it makes sense to have a high quality clock source.

Ideally, you should use a local clock source provided by a data center rather than a global service like ntp.gov.

Clock synchronization references:

References