Wednesday, February 11, 2009

Why some applications malfunction when one of the Domain Controllers is down?

Why some applications malfunction when one of the Domain Controllers is down?
- Or -
How to switch to disaster recovery site without booting my clients

Every Sysadmin installs at least two Domain Controllers on his domain for redundancy and
fault tolerance. But what actually happens when one of the DC's is down?

If you do a simple and disconnect one of your DCs from the network, you'll see that about
half of the workstations and member server who hasn't booted since the DC is down are
experiencing problems such as sluggishness, performance issues and some of the
applications simply stop working. The reason for that is the way Netlogon works.

Netlogon is the process which is responsible, among other tasks, to detect Active Directory
environment and the closest DC. The detection process is called DC Locator.

It is implemented in the NetAPI.DLL in a function named dsGetDCName and invoked by the
Netlogon service when the service starts. The DC Locator process sends a request to all
Domain Controllers in the domain and waits for them to respond. Once responded,
Netlogon caches the Domain Controller who was first to respond and saves its details
in the cache. From that moment, every call made by any application for the dsGetDCName
function returns this DC.

The DC Locator process does not re-check the availability of the cached DC periodically..
Therefore, if this DC is gone for any reason, workstations and member servers who have already
cached this DC remain with the faulty cache until the workstation is rebooted. As a result,
any application that needs to access the DC (and call the dsGetDCName for it) receives the
faulted DC and is expected to have problems when trying to connect to it.

In the last years, fault tolerance became an essential requirement in many organizations.
Many enterprises implement expensive disaster recovery sites, buy expensive clusters and
replicate data to at least one additional location.

When the disaster does happen and the main site is going down, this limitation will cause
you lots of trouble until you reboot your entire organization.

No comments: