Sunday, 29 November 2009

Upgrading Exchange 2007 Clusters to SP2 - Continued

Regarding my previous post delegated installs and upgrades to SP2, see here - http://daiowen.blogspot.com/2009/11/upgrading-exchange-2007-clusters-to-sp2.html

Microsoft has informed us that this will be classed as a bug and is working on discovering the cause before saying if they will fix the problem or not.

Geographically Dispersed CCR Cluster

I recently had the opportunity to install a geographically dispersed CCR Exchange 2007 cluster.

Server 2008’s cluster features can now handle clusters on separate subnet’s making the fact that the only data centres available were operating on Layer 3 wasn’t a problem. I didn’t need to stretch a VLAN across physical sites.

Configuring the networking for the cluster went slightly against the grain for me. Essentially the Private networking element has gone for these types of clusters, because all traffic, heartbeat and all has to go over the public network. That said, it was a simple process. I configured the networking using four NIC’s, three were teamed and another was on its own but it was set not to register in DNS. I didn’t want client traffic coming over the single NIC.

When you set up the cluster you simply enter two IP addresses that the cluster can use, and on failover, one, the one that’s not on the subnet the active node is in, will stay offline, sounds nice doesn’t it, but wait.

Even though you don’t have to stretch a VLAN anymore for this type of cluster. Exchange 2007 still requires cluster nodes to be in the same Active Directory site. This means that if you are planning for the disaster of losing a site, then you’ll need two DC’s in each site in the same AD site so that each node will always have a DC in the event that you loose one of the physical sites. You can’t use DC siteCoverage for this, as I discovered.

With the cluster set up I set up a combined HUB CAS in each physical site. Exchange will load balance mail flow to each HUB Transport Server by itself, but what about CAS connectivity. Autodiscovery service will handle Outlook Web Services, such as OAB & Out of Office etc, but what about Outlook Web Access. On the same subnet you’d use NLB to provide users with a single resilient point of entry to OWA. That’s no good on separate subnets unless you have a hardware load balancer, which I didn’t. So the OWA failover process became a manual process using CName’s in DNS. Not the nicest of solutions.

Another issue… You can’t put a Public Folder Database on a CCR unless it’s the on CCR in the Exchange Organisation. So Public Folders were to be sat on the HUB/CAS servers with content replication between each server. But in the event of a loss of one of those PF servers, it’s a manual failover process to get PF access back. You need to change the Default Public Folder Database for each Mailbox Database in the CCR. But that’s the same for any Public Folder failure.

So now we have two parts of the failover that requires manual failover, not nice, was starting to not like separating my Cluster over different subnets.

Issue number 3… When cluster failover occurs, the cluster IP changes. Meaning that unless all your clients are sat on the same AD site this change of DNS record will take time to replicate to them. By default the TTL of cluster DNS names is 20 Minutes. Meaning that in the worst scenario, your clients could be waiting 15 minutes for AD replication plus 20 Minutes for the DNS record to expire on their machines. 35 Minutes is a long time. Not really acceptable either. You can alleviate this issue by reducing the TTL of the record. I reduced mine to 3 Minutes. Another change you can make is by enabling change notification on the AD site links between the Cluster’s AD site and the AD site/sites where the clients sit. This brings the failover time down to 3 Minutes. Another change we made was in group policy… We created a GPO that configured Outlook not to complain about connectivity issues for 4 Minutes after disconnection from the Exchange Server.

This configuration meant that during a failover the majority of clients would not notice a problem unless they were sending emails and noticing that they were sitting in their outbox.

So with the exception of OWA and Public Folders, the system was quite acceptable. Just after covering off all of the above problems, space became available in our main data centre. We could now stretch a VLAN between these sites. So I reconfigured the networking and put each node in the same subnet. And guess what, most of the problems above went away. With the exception of Public Folder failover, but I can’t get these people to use the SharePoint servers available in the organisation, so I’m afraid that they’ll just have to live with that  :-).

Tuesday, 17 November 2009

Duplicate legacyExchangeDN Properties

Had a case recently that wasn’t immediately obvious to resolve.

We had reports of a user that no one was able to e-mail due to duplicate addressing. At first look there was no duplicate addresses on the object. We were receiving the following NDR’s

There is a problem with the recipient's e-mail system. More than one user has this e-mail address. The recipient's system administrator will have to fix this. Microsoft Exchange will not try to redeliver this message for you. Please provide the following diagnostic text to your system administrator and then try resending the message after the problem has been resolved.

IMCEAEX-_O=ORGNAME_OU=EXCHANGE+20ADMINISTRATIVE+20GROUP+20+28FYDIBOHF23SPDLT+29_CN=RECIPIENTS_CN=NAME+2ESURNAME@DOMAIN.SUFFIX
#550 5.1.4 RESOLVER.ADR.Ambiguous; ambiguous address ##

Further investigations showed that there was a problem with the way that the user was shown in the Exchange Address Books. It seemed as though the object was being confused with another user with the same name.

Comparing the properties of the two users revealed that their legacyExchangeDN properties were the same. The result was that the users were being confused in the Address Lists and no one was able to e-mail either due to this duplication.

The resolution was to change the container name that represents the user to another unique value, we changed ours to the users sAMAccountName value.

o=EXCHORG/ou=Exchange Administrative Group (FYDIBOHF23SPDLT)/cn=Recipients/cn=firstname.surname

to

o=Cymru/ou=Exchange Administrative Group (FYDIBOHF23SPDLT)/cn=Recipients/cn=sAMAccountName

The only problem with renaming this value is it will break reply ability if senders Outlook Cache is not removed.

As to how this happened, we believe it’s because we have multiple installations of the Quest Migration tools running against the same AD domain, and they happened to be migrating a user with the same name and populated the property with the same value.

Friday, 13 November 2009

Upgrading Exchange 2007 Clusters to SP2

In the Exchange Organisation I look after at work, we have quite a few Exchange Clusters. We have SCR & SCC clusters across multiple sites and ran by different subordinate administrators.

With the release of SP2 for Exchange 2007 we went about testing implementing SP2 and getting it rolled out. Unfortunately, our test lab doesn’t include any clusters, something we’ll have to address now, but I digress.

We installed SP2 on the Exchange servers we manage ourselves without issue, again, no clusters.

When it came time for the local admins to install SP2, they hit a problem on their Exchange Clusters. Following the steps described in this Technet Article - http://technet.microsoft.com/en-us/library/bb676320.aspx the attempts failed with the following error…

You must be a member of the 'Exchange Organization Administrators' or 'Enterprise Administrators' group to continue.

on inspection of the ExchangeSetup.log the prerequisites check failed with the following error.

[ERROR] The operation could not be performed because object '<server>' could not be found on domain controller '<domaincontroller>.<domain>'.

The install works fine with Exchange Organisational Administrator permissions, but it’s not ideal to go around each cluster and do it ourselves, we have quite a few and don’t want the blame for any subsequent failures.

We logged a call with Microsoft over a week ago now, and have been troubleshooting with them. They can reproduce our problem in their labs. Until then, it looks like we’ll have to upgrade the clusters ourselves.

I’ll post an update as soon as / if Microsoft come back to us with a solution.