- How do tasks correlate to sessions in the context of Okera?
- Can one client running one query, have multiple connections?
- Since queries run distributed, a client running a single query can have multiple connections. Okera imposes a limit of 255 concurrent sessions. However, when a query runs, not all sessions will be concurrent.
- For example:
- The ODAS cluster has 3 workers
- The EMR cluster has 15 nodes
- The client is Hive
- The user has a table with 10 partitions each containing ~10 MB
- The query selects from 7 partitions
- The planner creates 7 tasks.
- Note the number of tasks is dependent on the size of the data. Large partitions would generate more tasks.
- 7 of the EMR nodes would each open a session to one of the 3 ODAS workers.
- The planner determines the number of tasks based on criteria such as the amount of data, the complexity of the compute, ....The planner log shows the number of tasks.
- A session is an active connection from the client to an ODAS worker. The relationship is exactly 1:1. Each time a client connects, it picks an worker at random, so the number of sessions (active connections) per worker is expected to be roughly even.
- Non-distributed clients like the REST API will generate 1 session. Distributed clients like spark will generate sessions based on its size (~number of client cores), which is why Okera recommends the 10:1 client:ODAS core ratio.
- From a sizing point of view, the critical metric is the max sessions per server, which is the same as the max connections at any time across clients.