Tuning CentOS Linux for Ember
This document describes various ways to tune OS jitter to optimize Execution Server latency.
This document was written in September 2017, but large parts of it still stand. Please refer to Ubuntu tuning guide for more up-to-date information.
Hardware Considerations
A few comments for choosing server hardware to host the Execution Server:
- EPAM RTC (former Deltix) has good experience with Dell PowerEdge and Supermicro Hyperspeed server lines, as well as AWS C5 instances.
- Take the highest CPU/Bus frequency your budget can afford.
- Take highest the CPU unit / CPU core count your budget can afford.
- Some unconfirmed evidence suggests that the total amount of RAM should be 16G/64G/256G to avoid page faults.
- In addition to a normal storage system, you need a fast NVMe disk for the transaction log. In a typical case, the system log requires about 3K bytes per trading order and to be purged once a day or once a week.
- Use an NVMe or high IOPS disk for the Ember work directory.
- A Solarflare network card is recommended.
BIOS Tuning
- Set power usage to MAX PERFORMANCE.
- Disable C-States.
- Leave Turbo-boost enabled (has no visible effect).
- Leave Hyper-Threading enabled.
Overclocking
Overclocking CPUs and Memory is a common practice in the HFT domain. We recommend it for clients who understand the risk (system hangs, void warranty, etc.). One way of doing it is ordering a system from a vendor who specializes in overclocked solutions.
We recommend stress-testing an overclocked machine before deploying the Execution Server on it.
Some recommended stress tests:
- Intel BIOS Integrity Test Suite (BITS)
- Prime95 or OCCT
OS Tuning
This section describes the tuning for CentOS 7.X OS, which is an open-source version of Red Hat Enterprise Linux 7.X. Other Linux dialects are not certified but may perform similarly.
We recommend the "Minimal” installation of CentOS (which installs headless OS without extra components).
Kernel
We experimented with different Kernels in September 2017. Back then, CentOS 7.3 shipped kernel version 3.10. When this document was written, the latest version of Linux kernel was 4.12. We did not notice any performance improvement/degradation from upgrading to the latest version of kernel in our standard tick-to-order latency benchmark. You may have other considerations to use the latest kernel version.
Real-time Kernel
Preliminary experiments with Real Time kernels were unproductive. Our lab is happy to engage with more experiments with Real Time kernels with interested clients.
Kernel Parameters
Recommended kernel parameters:
Parameter | Description |
---|---|
isolcpus | Isolate some cores from the general schedule (these cores are used by Ember). |
nohz=off | |
transparent_hugepage=never | Disable THP. |
intel_pstate=disable | If the driver is driver_pstate, you can disable it. |
intel_idle.max_cstate=0 | If the driver is intel_idle, see this. (Run cat /sys/devices/system/cpu/cpuidle/current_driver to check.) |
processor.max_cstate=0 | Same source as above. |
mce=ignore_ce | |
nosoftlockup=0 | Disable checking software lockups on CPUs. |
audit=0 | |
idle=poll | Highest performance (at the expense of power and heat). |
nmi_watchdog=0 |
To set kernel parameters:
Edit the file called /etc/default/grub.
sudo vi /etc/default/grub
Find or add the line
GRUB_CMDLINE_LINUX_DEFAULT=""
and add one or more parameters described below (space separated). If something goes wrong, boot in rescue mode and remove the bad changes.GRUB_CMDLINE_LINUX_DEFAULT="isolcpus=4-11 nohz=off transparent_hugepage=never intel_pstate=disable intel_idle.max_cstate=0 processor.max_cstate=0 mce=ignore_ce nosoftlockup=0 audit=0 idle=poll nmi_watchdog=0"
To apply changes, run the following command:
If you use BIOS boot mode:
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
If you use a UEFI-based boots application:
sudo grub2-mkconfig -o /boot/efi/EFI/centos/grub.cfg
Reboot.
After reboot, verify that the settings were applied using the command
cat /proc/cmdline
.
OS Services
SELinux
Our benchmark showed a 15% latency improvement when SELinux was disabled.
sudo vi /etc/sysconfig/selinux
To disable SELinux:
- Set
SELINUX=disabled
. - Save.
- Reboot.
Services
Stop or disable the following services:
sudo systemctl stop abrt-ccpp abrtd abrt-oops
sudo systemctl stop alsa-state
sudo systemctl stop anacron
sudo systemctl stop atd
sudo systemctl stop autofs
sudo systemctl stop avahi-daemon
sudo systemctl stop bluetooth
sudo systemctl stop certmonger
sudo systemctl stop cups
sudo systemctl stop firewalld
sudo systemctl stop haldaemon
sudo systemctl stop hidd
sudo systemctl stop ip6tables
sudo systemctl stop iprdump
sudo systemctl stop iprinit
sudo systemctl stop iprupdate
sudo systemctl stop mdmonitor
sudo systemctl stop messagebus
sudo systemctl stop nfs-lock
sudo systemctl stop postfix
sudo systemctl stop restorecond
sudo systemctl stop rhnsd
sudo systemctl stop rhsmcertd
sudo systemctl stop rpcbind
sudo systemctl stop netfilter
Firewall
Run the following commands as root:
iptables -F ; iptables -t nat -F; iptables -t mangle -F
iptables -X ; iptables -t nat -X; iptables -t mangle -X
iptables -t raw -F ; iptables -t raw -X
Cron Jobs
Verify that your system does not have any cron jobs:
crontab -l
crontab -l -u deltix
Swap
Permanently disable swap. In order to do that, edit /etc/fstab and comment out swap line (usually the last entry).
sudo vi /etc/fstab
For example:
#/dev/mapper/cl-swap swap swap defaults 0 0
Tuned Performance Profiles
To improve the performance of your CentOS system, follow these steps:
Install the tuned utility by running the following command:
sudo yum install -y tuned
Set the network-latency performance profile by running the following command:
sudo tuned-adm profile network-latency
Ensure that the tuned profile remains active after system reboots by running the following commands:
sudo service tuned start
sudo chkconfig tuned onNote: The chronicle team recommends using the
latency-performance
profile.Verify the current CPU frequencies by running the following command:
sudo turbostat sleep 5
Make sure that:
- Each CPU is always at max frequency.
- CPU%c0 is at 100% or close.
- SMI counters are zero.
File System
To optimize the performance of your file system, consider the following tips:
- Use the ext4 type of partition for journaling, as it is faster than xfs.
- When mounting the journal partition, use the barrier=0 and noatime settings to improve performance.
TCP
To optimize TCP performance, adjust the kernel TCP buffers with the following commands:
sysctl -w net.core.rmem_max=2097152
sysctl -w net.core.wmem_max=2097152
By increasing the TCP buffers, you can improve the speed and efficiency of data transmission over TCP connections.
Other settings
kworker
To force the kwoker thread to collect statistics every hour, add the following line to the /etc/sysctl.conf file:
sudo sysctl -w vm.stat_interval=3600
This setting applies immediately and persists after each system reboot.
CPU Affinity
To verify isolated cores (see above), use the command cat /sys/devices/system/cpu/isolated
.
To partition CPUs, use the taskset
OS command and affinity control in the Ember and TimeBase configuration files.
It's good practice to create a CPU-to-thread layout map. Refer to the Ember User's Guide for the CPU Affinity chapter to learn about CPU affinity configuration inside Ember.
When defining CPU affinity, consider the following:
- Hyper-threading: Earlier, we recommended having hyper threading enabled. While virtual cores may be useful for the concurrency of non-mission critical tasks, each key signal processing component should receive an entire physical core.
- NUMA
- IRQ Balancing
To validate what is really going on with isolated cores, use the perf
utility:
sudo perf record -e “sched:sched_switch” -C 4,6,8,10,12
sudo perf report
The output should only show the idle process (swapper).
Verification
sysjitter utility
Use the Solarflare sysjitter utility to check OS jitter after applying the above tune-ups.
Here is a sample of good results. Focus on 90P, 99P, and 999P.
sudo ./sysjitter --cores 4-15 --runtime 15 500 | column -t
core_i: 4 5 6 7 8 9 10 11 12 13 14 15
threshold(ns): 500 500 500 500 500 500 500 500 500 500 500 500
cpu_mhz: 3003 3003 3003 3003 3003 3003 3003 3003 3003 3003 3003 3003
runtime(ns): 14985359730 14985359347 14985359408 14985359323 14985359483 14985359475 14985359477 14985359379 14985359734 14985359347 14985359411 14985361027
runtime(s): 14.985 14.985 14.985 14.985 14.985 14.985 14.985 14.985 14.985 14.985 14.985 14.985
int_n: 15000 15000 15006 15009 15000 15002 15000 15001 15001 15000 15007 15015
int_n_per_sec: 1000.977 1000.977 1001.377 1001.578 1000.977 1001.110 1000.977 1001.044 1001.044 1000.977 1001.444 1001.978
int_min(ns): 832 851 847 856 759 709 862 849 851 868 833 895
int_median(ns): 978 987 981 987 866 968 971 960 978 987 981 989
int_mean(ns): 987 996 995 1000 887 987 985 978 987 998 998 1014
int_90(ns): 1021 1037 1029 1030 940 1038 1032 1030 1022 1037 1030 1031
int_99(ns): 1198 1227 1248 1278 1080 1254 1198 1220 1208 1235 1313 1369
int_999(ns): 2448 2278 2912 2909 2832 2939 2900 2880 2267 2600 3026 6643
int_9999(ns): 4327 2955 3045 3896 3063 3147 6256 3030 3977 4441 4730 19402
int_99999(ns): 5176 4469 3295 4412 5333 6191 7372 4689 6273 6950 6394 20362
int_max(ns): 5176 4469 3295 4412 5333 6191 7372 4689 6273 6950 6394 20362
int_total(ns): 14810884 14951525 14933122 15025664 13319027 14810179 14782973 14675849 14809595 14983760 14983766 15237682
int_total(%): 0.099 0.100 0.100 0.100 0.089 0.099 0.099 0.098 0.099 0.100 0.100 0.102
perf-workshop utility
A similar tool, called perf-workshop, was developed in Java by LMAX developer Mark Price.
To use it, run the following commands:
svn co https://github.com/epickrram/perf-workshop/trunk perf-workshop
cd perf-workshop
./gradlew bundleJar
cd src/main/shell && bash ./run_test.sh BASELINE
Our sample numbers:
== Accumulator Inter-Message Latency (ns) ==
mean 9729
min 272
50.00% 9728
90.00% 9728
99.00% 9728
99.90% 10240
99.99% 10752
99.999% 11264
99.9999% 14336
max 21504
count 2795301
== Accumulator Message Transit Latency (ns) ==
mean 159
min 100
50.00% 160
90.00% 176
99.00% 184
99.90% 232
99.99% 352
99.999% 1216
99.9999% 2944
max 11264
count 2795301
Appendix: Other Configuration Steps
This section describes various measures that may not be required to optimize latency but are beneficial to have on production systems.
Reliable Clock Synchronization
To ensure reliable clock synchronization, make sure to have NTP or Crony running and periodically synchronizing clocks. We use same-source clocks for latency measurements. However, it makes sense to have a high quality clock source.
Ideally, you should use a local clock source provided by a data center rather than a global service like ntp.gov.
Clock synchronization references:
References
- Low Latency Performance Tuning of RHEL 7
- RedHat 7 documentation
- Profiles for file system performance
- RealTimeCentOS
- Good introductory video: "When the OS gets in the way" by Mark Price
- Good blog about reducing network latency by Marek Majkowski