Hi everyone,
We have quite a few customers which have multiple sites connected by OpenVPN, the VPN side of things is working just fine, however, in some cases we have a domain controller located at the VPN Server site and need .local domain queries to be forwarded to the DC.
At the remote client site, I add a DOMAIN OVERRIDE to the DNS resolver in satellite PFsense (CE or PF+, it varies from site to site).
eg. The site ABC uses a Wireguard VPN connecting 2 satellite sites (both running Netgate 2100 v23.09.1) to head office (Netgate 8200 v23.09.1) where there is a Domain Controller running on 192.168.10.3
The following domain override is set on both satellite Netgate 2100's
DOMAIN: abc.local
LOOKUP SERVER IP ADDR: 192.168.10.3
This usually works fine.... NSLOOKUP queries of the abc.local domain to the pfsense will resolve OK, but at seemingly random times, this domain forward stops working and replies with NX Domain or other generic fails.
Sorry I'm not able to reproduce this exact error at the time of writing this post.
During this time, recursive lookups still work OK on the local resolvers:
cisco.com
Server: abc-pm.abc.local
Address: 10.88.116.1
Non-authoritative answer:
Name: cisco.com
Addresses: 2001:420:1101:1::185
72.163.4.185
In almost all cases, restarting the DNS resolver service on the client PFsense will get it all back up and running again:
[23.09.1-RELEASE][root@abc-ch.abc.local]/root: nslookup
abc-dcserver
Server: 127.0.0.1
Address: 127.0.0.1#53
Name: abc-dcserver.abc.local
Address: 192.168.10.3
My initial work around was simply to put the DC's IP address as primary DNS entry in DHCP scope and local pfsense as secondary. This actually worked but with a horrible side-effect of Windows PCs would often report "No internet" on the NCSI (Network Connectivity Status Indicator) which doesn't usually impact functionality, until you try to re-activate office software (for example) and it won't even try if that says no internet.
This problem with the NCSI goes away instantly when the primary DNS is on the same subnet as the client.
If anyone knows how to fix THAT problem it would be wonderful also, but out of scope for this forum
There does not appear to be any errors in the DNS resolver logs..
I have seen other people have similar problems to this but I have not managed to find a working solution.
Since it seems to be happening randomly, I am thinking of setting a cron job to restart the unbound service periodically but am not very keen on this idea. Its only hiding a problem rather than fixing it.
We are also using the Service Watchdog service to restart unbound if it crashes... FWIW.
This issue has persisted on various devices over many months... we have some devices still using ISC-DHCP and some now using Kea but I doubt this is relevant.
Has anyone encountered and resolved this issue?
Thanks in advance for your time.