Fault Tolerance
As with most application distribution systems, we provide the ability to fail over to a different server if the current serer fails. Since the OpDesk Resource and System Management design doesn’t actually allow agent to communicate directly with the backend server, our fault tolerance deals more with failing to talk to the Distribution/Collection Points. In fact all of the backend server components (IIS, RasmServerService and the database) can be taken offline for short periods of time and the agents won’t even notice. When the agent determines that is can no longer reach the network share for the distribution/collection point, the agent will find the next distribution/collection point in the same Region. If no other distribution/collection points can be located, we’ll next look at the current distribution/collection point’s “fail over” distribution/collection point. Finally, the agent will error out and enter error messages into its log and go offline relying on the cached application metadata and package files it already has downloaded. Microsoft’s Distributed File System also provides several areas of Fault Tolerance which our system will take advantage of automatically.
Load Balancing
At agent startup, the agent will automatically pick a random distribution/collection point in the same region where its default distribution point belongs. We randomly select a distribution point in an effort to spread the load out of all agents that belong to the same region. After the agent is first installed it will initially communicate only with the default distribution/collection point until it gets the list of other distribution/collection point shares in the same region. Administrators can also rely on a Load Balancer physical device to spread the load by only listing one distribution/collection point in the default region then have the one UNC path point to a DNS name that points to the physical load balancer.