2015-01-30

To understand the problem I first have to give you an overview of our communcation mechanism.
We currently have the following setup:
5 PUs running clustered spaces, those spaces are logically alligned like a stack, so there are lower level PUs and higher level PUs:

*Spaces View*
Space E
Space D
Space C
Space B
Space A

Space A is the lowest level PU and is not allowed to reference any upper level PUs like Space B, C, D and E.
Space E is the highest level PU and is allowed to reference all its lower level PUs.

Inside those PUs we have "services" which write SpaceObjects to inform other "services" about some event or something. When written such a SpaceObject we refer to this like "sending a message".

*Spaces with Partitions and Services View*
*Space Name | Partition Count | Services per Partition*
Space E (1 partition) ServiceE1
Space D (2 partitions) ServiceD1
Space C (2 partitions) ServiceC1, ServiceC2, ServiceC3
Space B (2 partitions) ServiceB1, ServiceB2
Space A (2 partitions) ServiceA1, ServiceA2

When a service (E1) in a higher level space (Space E) wants to inform or tell a service (A1) in a lower level space (Space A) something the routing is easy.
Service (E1) just needs to write a `SpaceObject` (which is defined by Service (A1)) into the space where Service (A1) is "hosted". This `SpaceObject` is annotated with `SpaceRouting` and the proxy takes care of writting it into the correct partition.

The other way around when Service (A1) "publishes" some event to Service (E1) is more difficult
Because we decided that the lower level space in which Service (A1) is hosted is not allowed to reference the upper level space in which Service (E1) is hosted, we cannot just write a SpaceObject (from Space A) into the target space (Space E).

Our current solution for this is that we introduced a "Fetcher" component in each the upper level space partition which listens for events of the lower level services. So in the example Space B with two partitions there would one fetcher per partition.

How the Fetcher works:
Service (E1) has a dependency to Service (A1) because Service (A1) publishes some data which Service (E1) subscribed for.
Therefor Service (E1) says to the Fetcher in its partition (this is done for each partition) that the Fetcher should fetch messages from Service (A1) that have its own address as destination.

The Fetcher creates a polling container on space A with the following query:

"destination = ServiceE1 or destination = ServiceE2 or destiantion = ..."

So basically it will fetch all messages from Space A which are destinated for ServiceE1 or E2 or ...

Service (A1) just need to write the message with the correct destination in its local space and **one of** the Fetchers in the upper level space will pick it up.

And the problem here is that the Fetching is quite inefficient because the Fetchers are competing for the Messages often the Fetcher in the wrong partition picks up the SpaceObject from the lower level space and then the SpaceObject is in the wrong partition according the its RoutingKey. This means we need an **additional (network) hop** to write it to the correct partition.

We thought about 3 solutions to this problem:

1. Let the lower level space write the object into the upper level space, but this would introduce a bidirectional runtime dependency
2. Tell the lower level spaces about the partition count of the upper level spaces so that we can calculate the "correct" routing key already in the lower level space and adjust the query in the Fetchers so that they only take the objects which have the correct routing key. `"(prepComputedRoutingkey = ownPartitionId) AND (destination = ServiceE1 OR destination = ServiceE2)"`
3. Adjust the query in the fetcher to something like `"(routingkey % partitionCount = ownPartitionId) AND (destination = ServiceE1 OR destination = ServiceE2)"`. But I dont know if this is even possible and how big the performance hit would be because of not beeing able to use indexes.

Show more