I recently had the opportunity to install a geographically dispersed CCR Exchange 2007 cluster.
Server 2008’s cluster features can now handle clusters on separate subnet’s making the fact that the only data centres available were operating on Layer 3 wasn’t a problem. I didn’t need to stretch a VLAN across physical sites.
Configuring the networking for the cluster went slightly against the grain for me. Essentially the Private networking element has gone for these types of clusters, because all traffic, heartbeat and all has to go over the public network. That said, it was a simple process. I configured the networking using four NIC’s, three were teamed and another was on its own but it was set not to register in DNS. I didn’t want client traffic coming over the single NIC.
When you set up the cluster you simply enter two IP addresses that the cluster can use, and on failover, one, the one that’s not on the subnet the active node is in, will stay offline, sounds nice doesn’t it, but wait.
Even though you don’t have to stretch a VLAN anymore for this type of cluster. Exchange 2007 still requires cluster nodes to be in the same Active Directory site. This means that if you are planning for the disaster of losing a site, then you’ll need two DC’s in each site in the same AD site so that each node will always have a DC in the event that you loose one of the physical sites. You can’t use DC siteCoverage for this, as I discovered.
With the cluster set up I set up a combined HUB CAS in each physical site. Exchange will load balance mail flow to each HUB Transport Server by itself, but what about CAS connectivity. Autodiscovery service will handle Outlook Web Services, such as OAB & Out of Office etc, but what about Outlook Web Access. On the same subnet you’d use NLB to provide users with a single resilient point of entry to OWA. That’s no good on separate subnets unless you have a hardware load balancer, which I didn’t. So the OWA failover process became a manual process using CName’s in DNS. Not the nicest of solutions.
Another issue… You can’t put a Public Folder Database on a CCR unless it’s the on CCR in the Exchange Organisation. So Public Folders were to be sat on the HUB/CAS servers with content replication between each server. But in the event of a loss of one of those PF servers, it’s a manual failover process to get PF access back. You need to change the Default Public Folder Database for each Mailbox Database in the CCR. But that’s the same for any Public Folder failure.
So now we have two parts of the failover that requires manual failover, not nice, was starting to not like separating my Cluster over different subnets.
Issue number 3… When cluster failover occurs, the cluster IP changes. Meaning that unless all your clients are sat on the same AD site this change of DNS record will take time to replicate to them. By default the TTL of cluster DNS names is 20 Minutes. Meaning that in the worst scenario, your clients could be waiting 15 minutes for AD replication plus 20 Minutes for the DNS record to expire on their machines. 35 Minutes is a long time. Not really acceptable either. You can alleviate this issue by reducing the TTL of the record. I reduced mine to 3 Minutes. Another change you can make is by enabling change notification on the AD site links between the Cluster’s AD site and the AD site/sites where the clients sit. This brings the failover time down to 3 Minutes. Another change we made was in group policy… We created a GPO that configured Outlook not to complain about connectivity issues for 4 Minutes after disconnection from the Exchange Server.
This configuration meant that during a failover the majority of clients would not notice a problem unless they were sending emails and noticing that they were sitting in their outbox.
So with the exception of OWA and Public Folders, the system was quite acceptable. Just after covering off all of the above problems, space became available in our main data centre. We could now stretch a VLAN between these sites. So I reconfigured the networking and put each node in the same subnet. And guess what, most of the problems above went away. With the exception of Public Folder failover, but I can’t get these people to use the SharePoint servers available in the organisation, so I’m afraid that they’ll just have to live with that :-).
Comments