Skip to main content

Data Warehouse Pipelines

Overview

Ember uses a highly efficient storage system called Ember Journal to store all trading messages, including order requests and order events. Ember can process more than 100,000 orders per second. At this rate, one gigabyte of trading history data can be accumulated every minute.

While Ember is running, data warehouse pipelines are responsible for streaming or batching trading history to various destinations suitable for open data analytics and permanent storage. Ember supports the following data warehouses:

Data warehouses work in coordination with the Ember Journal Compactor to ensure that the operational storage size remains compact and all trading history is preserved.

Data warehouse conceptual diagram

Ember retains an operational subset of data in memory, stores recent trading data in the journal, and streams all data to data warehouses, where it can be stored indefinitely. From this perspective, neither the Ember API nor the Ember Journal is the optimal place to retrieve information like "show me all trades for today." Instead Ember delegates this task to data warehouses.

This design offers several advantages:

  • The Ember Journal can be optimized for rapid sequential data insertion.
  • The operational dataset can stay small.
  • Reporting queries don't overwhelm Ember RPC channels.

Different warehouses can be set up to run in parallel.

Comparing Data Warehouses

Here's a comparison of the various data warehouses:

TimeBaseClickHouseKafkaS3RedShiftRDS SQL
Max Rate (orders/sec)
Sustained
500K +200KTBD15K25050
Reports PerformanceVery GoodAdequate
Query LanguageQQL
(Limited)
SQL subset
Very Good
KQLAthena uses Presto SQLSQL subset
Maintenance EffortMedium-HighHighMediumVery LowLowLow
Storage CostHighHighHighLowHighHigh
GUI ClientTimeBase AdminTabbixKafkaTool, etc.AWS Athena ConsoleAny SQL clientAny SQL client

For a more detailed description of data warehouse configuration, visit the Ember Configuration Guide.