In general, there are many reasons for a process to manage its own memory, instead of relying on the operating system. Requesting and freeing memory from the operating system has high performance cost. Therefore, Okera minimizes the number of calls to the operating system to deal with allocating or freeing memory. We use the tcmalloc library (http://goog-perftools.sourceforge.net/doc/tcmalloc.html), in addition to internal memory pooling for efficiency.
Internal memory pooling behavior:
CDAS in its default configuration assumes it is not running on the same VMs as other resource intensive processes so returning back to the OS doesn't provide much benefit. Even with other resource-intensive processes, there are other ways to tune the process; constantly trading memory back and forth can lead to unpredictability, which is generally not good (we prefer VM level isolation).
For monitoring purposes, memory usage is not proxy for service load. CPU usage and network IO are much better metrics to be tracking.
Only the memory usage of the workers is expected to be high. All the other services maintain very little state.
The general calling pattern for a scan is:
- user -> planner, returns a list of tasks back. Since this only hits the planner, this has minimal memory usage on the planner. You will see high IO and CPU load, as generating the splits can be compute and IO intensive for large datasets.
- For large datasets, the list of tasks will be far larger than the number of works/number of cores in the compute cluster.
- For example, we may return 1000 tasks but the EMR cluster only runs 64(based on EMR cluster size, etc) at a time.
- Continuing with this example, the 64 tasks (run from EMR) get run on all the workers randomly.
- In a 10 node CDAS cluster, each node is expected to see 6-7 concurrent tasks, but you will see some variance in practice due to randomness. It would be unsurprising if the range across the workers was 4-10 concurrent tasks.
- When each task begins running, it starts using memory. When it is done, it frees that memory (to the worker, not OS as described above) and EMR will schedule the next task, which does the same thing.
- In this case, the peak memory required would be
- ~10 (peak number of concurrent tasks) * memory requirement per task
The memory requirement per task is typically low (maintaining a few buffers that read data from S3) but it is workload dependent. This is typically in the 10-100MB per task. In this case, perhaps a peak usage of 1GB. In larger scale scenarios/concurrent users, we may have 200 concurrent tasks per worker (so 2000 in a 10 node CDAS cluster), so more in the 20GB peak range, which we feel is acceptable for typical VMs.