Pinterest EngineeringNovember 22, 2023

Improving Efficiency Of Goku Time Series Database at Pinterest (Part — 1)

View Original

Improving efficiency of the Goku time series database at Pinterest (Part 1)

Goku is a collection of sub-service components including GokuS (in-memory storage for the last 24 hours of data), GokuL (ssd and hdd based storage for older data) Goku Compactor (time series data aggregation and conversion engine), and Goku Root (smart query routing).
The first aspect of efficiency improvement is reducing recovery time for GokuS and GokuL. Recovery time is the total time for a host or cluster in Goku to come up and start serving time series queries.
The second aspect is improving query experience by lowering latencies of expensive and high cardinality queries in Goku.
The third aspect is reducing the overall cost of Goku in Pinterest.
The architecture of GokuS involves ingesting data from Kafka topics and storing it in memory. During recovery, finalized data from EFS is read into memory and logs are replayed.
The recovery process could take almost 90 minutes if multiple hosts were recovering at the same time.
There is a single point of failure due to health inference at the cluster level. The statsboard client is not able to read the latest synthetic metrics stored on a host during recovery.
The push model of the ingestor service was changed to a shard aware pull model using Goku side Kafka.
Logs are asynchronously logged into local storage and moved to S3 every 20 minutes to minimize data loss per shard during host termination/replacement.
The data lag per partition per cluster is exported into shared files and used for efficient query routing.
The Goku root now detects if a shard is ready for queries based on its lag, enabling more efficient query routing.