pfSense Forum

pfSense English Support => Routing and Multi WAN => Topic started by: dayer on September 17, 2017, 10:03:23 am

Title: Multi-WAN, High Availability, policy routing. Failover breaks connections
Post by: dayer on September 17, 2017, 10:03:23 am
Hi. Sorry if this message should be in the CARP section. I consider the situation is related with CARP and with Multi WAN also.
pfSense 2.3.4-p1

I've a this scenario:
Code: [Select]

                            |---(WAN1)----------|
            |---Pfsense1====                    |
            |      |        |---(WAN2)--|       |
            |      |                    |       |
PC --(LAN)--| (Sync)                 GW2     GW1
            |      |                    |       |
            |      |        |---(WAN2)--|       |
            |---Pfsense2====                    |
                            |---(WAN1)----------|



With this scenario all traffic from PC to outside goes by GW1 correctly. However, if I'm doing ping from PC to an Internet address with Pfsense1 as master and I disable CARP temporarily in Pfsense1, now Pfsense2 is the new master and the ping is broken. In case of a TCP connection, as SSH, the result is the same.
I've been monitoring the traffic with tcpdump and I've realised Pfsense2 is trying to send this traffic by GW2 and using the WAN1 VIP address even, and not by GW1.
In case I redo the ping, or close an reopen de SSH connection, the new traffic goes by GW1, all right. But I can't find the reason why the current traffic is forwarded by the default gateway system (GW2) instead of follow the policy routing (only GW1).

Thanks
Title: Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
Post by: dayer on September 18, 2017, 08:13:58 am
I've also tested:
Code: [Select]
/root: netstat -r
Routing tables

Internet:
Destination        Gateway            Flags      Netif Expire
8.8.4.4            192.168.1.1        UGHS   lagg0_vl
[...]
and a ping from the firewall to outdoor doesn't know the path:
Code: [Select]
/root: ping -S 127.0.0.1 8.8.8.8
PING 8.8.8.8 (8.8.8.8) from 127.0.0.1: 56 data bytes
ping: sendto: No route to host

However a ping from the WAN1 IP goes well:
Code: [Select]
/root: ping -S 192.168.1.111 8.8.8.8
PING 8.8.8.8 (8.8.8.8) from 192.168.1.111: 56 data bytes
64 bytes from 8.8.8.8: icmp_seq=0 ttl=57 time=3.357 ms

I think this could be related with some thread (https://forum.pfsense.org/index.php?topic=102053.0) and the bug #5476 (https://redmine.pfsense.org/issues/5476)  :(
Title: Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
Post by: dayer on September 22, 2017, 08:25:42 pm
I've simulated and simplified that situation with pfSense 2.4.0-RC using three virtual machines with VirtualBox and the host computer doing NAT (see the attachment file).
However the problem is the same.

Settings:

Reproduce:

From my point of view, when a Pfsense is the new master unit and there's traffic flowing, it try to route the traffic only according to routing table (or static routes) and it ignores the policy routing.

This is the normal behavior in Pfsense? Do I need any special rules? Could it be a bug?

PD: I've fixed the thread subject.
PD2: I've also posted the first scenario in reddit (https://www.reddit.com/r/PFSENSE/comments/71jdqk/multiwan_with_ha_established_connections_through/) and there's another person with the same behaviour.
Title: Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
Post by: dayer on September 27, 2017, 07:10:27 pm
I attach some information I'm using to explain the problem also in mailing list thread (http://lists.pfsense.org/pipermail/list/2017-September/thread.html#11203):
Title: Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
Post by: luckman212 on September 29, 2017, 10:25:43 pm
Interesting problem. Sorry I don't have a solution right now, but I'll be following this thread.

If, in your "Reproduce" section above, instead of stopping the ping and restarting it in step#6 - you go to Diag>States>Reset States and kill all states, does the ping start succeeding again?
Title: Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
Post by: dayer on October 01, 2017, 08:18:05 am
Thank you for your interest :)

I've followed your suggestion related to the setp #6:
And in the successfully situations, If Pfsense1 recover the master rol the ping continues succeeding.
I think it's a problem related to keep the routes in established states at LAN interface.

Could it be considered a bug?
Title: Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
Post by: luckman212 on October 01, 2017, 08:48:49 am
Ok at least you're a little closer to identifiying the culprit. I don't unfortunately have too much experience with CARP. So I will have to let someone else respond as to whether this is a config issue or a bug, and if it's a bug - whether the bug lies in pfSense or FreeBSD itself. If it's the latter, we'll have to file it upstream.
Title: Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
Post by: dayer on October 01, 2017, 02:06:38 pm
One more thing.
I've done the last tests again, with SSH instead of ping, but I haven't achieved recover the SSH.
Title: Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
Post by: Derelict on October 01, 2017, 03:02:12 pm
Quote
Could it be considered a bug?

Probably a configuration issue or an issue in your virtual environment. I have done countless tests like you are doing. Up to and including failing over during live video streams, etc and state sync works great.

I just pinged through from behind my 2.4.0-RC HA Multi-WAN VM pair last night when I updated it to 2.4.1-DEV. Worked great failing over and back.
Title: Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
Post by: dayer on October 01, 2017, 05:16:59 pm
Thank you very much for your tests.

Related to the failing over during live video streams, where those using the default gateway or... a secondary gateway according to a firewall rule or a failover gateway group? This difference is important because I've only found the problem if I'm using a not default gateway while I change the master unit.

I've done several tests, with virtual and physical machines, with LACP and without them, but the behavior always has been the same  :-\
I attached my config in the reply #3 a few days ago to try clearing any doubt about the settings.

Please, do you know a more detailed guide than Multi-WAN (http://Multi-WAN) + Configuring pfSense Hardware Redundancy (CARP) (https://doc.pfsense.org/index.php/Configuring_pfSense_Hardware_Redundancy_(CARP)) from PFSenseDocs or the High Availability » Multi-WAN with HA from The pfSense Book for this purpose?
Could you share your tests configuration?
Title: Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
Post by: Derelict on October 01, 2017, 05:42:23 pm
Policy routing all LAN pings out WAN2. WAN is the default gateway. Pinging 8.8.8.8:

States on Primary/MASTER:


LAN    icmp    172.25.236.227:29353 -> 8.8.8.8:29353    0:0    139 / 139    11 KiB / 11 KiB    
WAN2    icmp    172.25.227.17:38069 (172.25.236.227:29353) -> 8.8.8.8:38069    0:0    139 / 139    11 KiB / 11 KiB


States on Secondary/BACKUP:


LAN    icmp    172.25.236.227:29353 -> 8.8.8.8:29353    0:0    0 / 0    0 B / 0 B    
WAN2    icmp    172.25.227.17:38069 (172.25.236.227:29353) -> 8.8.8.8:38069    0:0    0 / 0    0 B / 0 B


Persistent CARP maintenance mode on Primary:

States on secondary start seeing the traffic.


LAN    icmp    172.25.236.227:29353 -> 8.8.8.8:29353    0:0    23 / 23    2 KiB / 2 KiB    
WAN2    icmp    172.25.227.17:38069 (172.25.236.227:29353) -> 8.8.8.8:38069    0:0    23 / 23    2 KiB / 2 KiB


Client dropped three pings then continued. That's about right for a failover event.

States still exists on primary but are not seeing any traffic:


LAN    icmp    172.25.236.227:29353 -> 8.8.8.8:29353    0:0    239 / 239    20 KiB / 20 KiB    
WAN2    icmp    172.25.227.17:38069 (172.25.236.227:29353) -> 8.8.8.8:38069    0:0    239 / 239    20 KiB / 20 KiB


Leave Persistent CARP mantenance mode on Primary. States on primary are seeing the traffic again (239 now 262):


LAN    icmp    172.25.236.227:29353 -> 8.8.8.8:29353    0:0    262 / 262    21 KiB / 21 KiB    
WAN2    icmp    172.25.227.17:38069 (172.25.236.227:29353) -> 8.8.8.8:38069    0:0    262 / 262    21 KiB / 21 KiB


Same results using temporary enabling and disabling of CARP on the primary.
Title: Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
Post by: Derelict on October 01, 2017, 06:36:59 pm
And FWIW, here's a TCP session. Policy routed all outbound SSH out WAN2:

States on Primary:


LAN    tcp    172.25.236.227:46380 -> 192.168.223.6:22    ESTABLISHED:ESTABLISHED    75 / 81    7 KiB / 10 KiB    
WAN2    tcp    172.25.227.17:21325 (172.25.236.227:46380) -> 192.168.223.6:22    ESTABLISHED:ESTABLISHED    75 / 81    7 KiB / 10 KiB


States on Secondary:


LAN    tcp    172.25.236.227:46380 -> 192.168.223.6:22    ESTABLISHED:ESTABLISHED    0 / 0    0 B / 0 B    
WAN2    tcp    172.25.227.17:21325 (172.25.236.227:46380) -> 192.168.223.6:22    ESTABLISHED:ESTABLISHED    0 / 0    0 B / 0 B


I failed back and forth a couple times. At no point did the ssh session drop. Noticed a couple delays in output but TCP did its thing and no data was lost in the session.
Title: Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
Post by: dayer on October 02, 2017, 09:51:18 am
Thank you Derelict :)

I've been simulating your tests.


WAN is the default gateway:
Code: [Select]
WAN1 (default)      WAN1 xxx.xxx.xxx.xxx     8.8.8.8     WAN1 Gateway
WAN2                WAN2 192.168.1.1         8.8.4.4     WAN2 Gateway

Policy routing all LAN traffic to outside go through WAN2:
Code: [Select]
States      Protocol    Source  Port    Destination     Port    Gateway     Queue   Schedule    Description
5 /3.80 MiB IPv4*       *       *       internals       *       *           none                Internal traffic to default gateway
0 /254 KiB  IPv4*       *       *       *               *       WAN2        none                The rest to WAN2

NAT for internal traffic to outside:
Code: [Select]
WAN1    internals * * * xxx.xxx.xxx.xxy     * NAT from internal networks
WAN2    internals * * * 192.168.1.100       * NAT from internal networks
internals is an alias with internal networks.

Pinging 208.123.73.69

States on Primary/MASTER:

Code: [Select]
LAN     icmp    172.16.103.2:10618 -> 208.123.73.69:10618                           0:0     69 / 69     6 KiB / 6 KiB
WAN2    icmp    192.168.1.100:57632 (172.16.103.2:10618) -> 208.123.73.69:57632     0:0     69 / 69     6 KiB / 6 KiB

States on Secondary/BACKUP:
Code: [Select]
LAN icmp 172.16.103.2:10618 -> 208.123.73.69:10618                           0:0     0 / 0 0 B / 0 B
WAN2 icmp 192.168.1.100:57632 (172.16.103.2:10618) -> 208.123.73.69:57632     0:0     0 / 0 0 B / 0 B

Persistent CARP maintenance mode on Primary:

States on secondary start seeing the traffic but something goes wrong. The number of packets observed matching the state from the destination side is zero.

Code: [Select]
LAN icmp 172.16.103.2:10618 -> 208.123.73.69:10618                         0:0 58 / 0 5 KiB / 0 B
WAN2 icmp 192.168.1.100:57632 (172.16.103.2:10618) -> 208.123.73.69:57632 0:0 58 / 0 5 KiB / 0 B

Client doesn't get ping reponses.

States still exists on primary but are not seeing any traffic (it's reasonable):

Code: [Select]
LAN icmp 172.16.103.2:10618 -> 208.123.73.69:10618                         0:0 368 / 368 30 KiB / 30 KiB
WAN2 icmp 192.168.1.100:57632 (172.16.103.2:10618) -> 208.123.73.69:57632 0:0 368 / 368 30 KiB / 30 KiB

Leave Persistent CARP maintenance mode on Primary. States on primary are seeing the traffic again (368 now 372):

Code: [Select]
LAN icmp 172.16.103.2:10618 -> 208.123.73.69:10618                         0:0 372 / 372 31 KiB / 31 KiB
WAN2 icmp 192.168.1.100:57632 (172.16.103.2:10618) -> 208.123.73.69:57632 0:0 372 / 372 31 KiB / 31 KiB



SSH to aaa.bbb.ccc.ddd (I've replaced the public IP for security reasons):

States on Primary:

Code: [Select]
LAN tcp 172.16.103.2:43290 -> aaa.bbb.ccc.ddd:22                        ESTABLISHED:ESTABLISHED 138 / 126 12 KiB / 20 KiB
WAN2 tcp 192.168.1.100:54741 (172.16.103.2:43290) -> aaa.bbb.ccc.ddd:22  ESTABLISHED:ESTABLISHED 138 / 126 12 KiB / 20 KiB

States on Secondary:
Code: [Select]
LAN tcp 172.16.103.2:43290 -> aaa.bbb.ccc.ddd:22                        ESTABLISHED:ESTABLISHED 0 / 0 0 B / 0 B
WAN2 tcp 192.168.1.100:54741 (172.16.103.2:43290) -> aaa.bbb.ccc.ddd:22  ESTABLISHED:ESTABLISHED 0 / 0 0 B / 0 B

Persistent CARP maintenance mode on Primary:

States on secondary start seeing the traffic, but something appears wrong:
Code: [Select]
LAN tcp 172.16.103.2:43290 -> aaa.bbb.ccc.ddd:22                        ESTABLISHED:ESTABLISHED 13 / 3 2 KiB / 708 B
WAN2 tcp 192.168.1.100:54741 (172.16.103.2:43290) -> aaa.bbb.ccc.ddd:22  ESTABLISHED:ESTABLISHED 13 / 3 2 KiB / 708 B


With tcpdump in WAN1, I see the Secondary firewall is routing through WAN1 using the WAN2 VIP address like for NAT:

Code: [Select]
14:09:03.973890 IP 192.168.1.100.54741 > aaa.bbb.ccc.ddd.22: Flags [.], ack 1150910396, win 593, options [nop,nop,TS val 3297622232 ecr 1263162549], length 0
14:09:07.569891 IP 192.168.1.100.54741 > aaa.bbb.ccc.ddd.22: Flags [.], ack 1, win 593, options [nop,nop,TS val 3297625828 ecr 1263163448,nop,nop,sack 1 {4294967113:1}], length 0
14:09:08.810847 IP 192.168.1.100.54741 > aaa.bbb.ccc.ddd.22: Flags [P.], seq 0:52, ack 1, win 593, options [nop,nop,TS val 3297627069 ecr 1263163448], length 52
14:09:08.978668 IP 192.168.1.100.54741 > aaa.bbb.ccc.ddd.22: Flags [P.], seq 52:104, ack 1, win 593, options [nop,nop,TS val 3297627237 ecr 1263163448], length 52
14:09:09.035602 IP 192.168.1.100.54741 > aaa.bbb.ccc.ddd.22: Flags [P.], seq 52:104, ack 1, win 593, options [nop,nop,TS val 3297627294 ecr 1263163448], length 52
14:09:09.122214 IP 192.168.1.100.54741 > aaa.bbb.ccc.ddd.22: Flags [P.], seq 104:156, ack 1, win 593, options [nop,nop,TS val 3297627380 ecr 1263163448], length 52
14:09:09.180582 IP 192.168.1.100.54741 > aaa.bbb.ccc.ddd.22: Flags [P.], seq 0:156, ack 1, win 593, options [nop,nop,TS val 3297627439 ecr 1263163448], length 156

States still exists on primary but are not seeing any traffic (it's reasonable):

Code: [Select]
LAN tcp 172.16.103.2:43290 -> aaa.bbb.ccc.ddd:22                        ESTABLISHED:ESTABLISHED 232 / 220 17 KiB / 34 KiB
WAN2 tcp 192.168.1.100:54741 (172.16.103.2:43290) -> aaa.bbb.ccc.ddd:22  ESTABLISHED:ESTABLISHED 232 / 220 17 KiB / 34 KiB

Leave Persistent CARP mantenance mode on Primary. States on primary are seeing the traffic again (232 now 294) and SSH terminal replies again:
Code: [Select]
LAN tcp 172.16.103.2:43290 -> aaa.bbb.ccc.ddd:22                        ESTABLISHED:ESTABLISHED 294 / 291 22 KiB / 80 KiB
WAN2 tcp 192.168.1.100:54741 (172.16.103.2:43290) -> aaa.bbb.ccc.ddd:22  ESTABLISHED:ESTABLISHED 294 / 291 22 KiB / 80 KiB

I'm going to check our differences.
Title: Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
Post by: dayer on October 07, 2017, 06:39:07 pm
@Derelict, please, what VM has you tested? VirtualBox? VMware? Could you share your two XML config files?

I've tested that situation with VirtualBox (before and after a factory default, to remake and check the settings) and with two identical physical machines also. No success.
We're considering use pfSense and be sure this functionality works in pfSense is important to us.
Title: Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
Post by: Derelict on October 07, 2017, 08:44:06 pm
XenServer. No, I do not have the configurations from those tests any more.

The real key is how you are testing. Note that TRex was generating about 350K states there.

The hypervisor used will be of no consequence to what gets policy routed where.
Title: Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
Post by: dayer on October 07, 2017, 09:31:11 pm
Thank you, Derelict.

I can't find why a established state in pfSense1 isn't routed successfully when pfSense2 is the master.
I'm talking about this example (rules here (https://forum.pfsense.org/index.php?topic=136739.msg751527#msg751527)):

Pinging 208.123.73.69

States on Primary/MASTER:

Code: [Select]
LAN     icmp    172.16.103.2:10618 -> 208.123.73.69:10618                           0:0     69 / 69     6 KiB / 6 KiB
WAN2    icmp    192.168.1.100:57632 (172.16.103.2:10618) -> 208.123.73.69:57632     0:0     69 / 69     6 KiB / 6 KiB

States on Secondary/BACKUP:
Code: [Select]
LAN icmp 172.16.103.2:10618 -> 208.123.73.69:10618                           0:0     0 / 0 0 B / 0 B
WAN2 icmp 192.168.1.100:57632 (172.16.103.2:10618) -> 208.123.73.69:57632     0:0     0 / 0 0 B / 0 B

Persistent CARP maintenance mode on Primary:

States on secondary start seeing the traffic but something goes wrong. The number of packets observed matching the state from the destination side is zero.

Code: [Select]
LAN icmp 172.16.103.2:10618 -> 208.123.73.69:10618                         0:0 58 / 0 5 KiB / 0 B
WAN2 icmp 192.168.1.100:57632 (172.16.103.2:10618) -> 208.123.73.69:57632 0:0 58 / 0 5 KiB / 0 B

It's like the policy routing is ignored for this kind of situation and the firewall is trying route the established traffic through the default gateway (and use the NAT for a connection to exit through another WAN).
I've tried also with floating rules, without success.
There's some command or log to check what decisions takes pfSense with search state or packet
Title: Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
Post by: Derelict on October 07, 2017, 10:59:12 pm
Except policy routing is not ignored in that scenario.

It looks like it is not policy routing that is your problem, since you see the traffic going out thoses states.

It looks like your problem is whatever is upstream is refusing to accept the CARP VIP (and MAC address) moving from primary to secondary.

Policy routing only affects outbound traffic. It can't do anything about problems with the reply traffic.
Title: Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
Post by: dayer on October 08, 2017, 12:25:55 pm
Thank you Derelict.
I understand your point of view. However, if...
Quote
It looks like your problem is whatever is upstream is refusing to accept the CARP VIP (and MAC address) moving from primary to secondary.
I can't understand why I only see this behavior when the gateway from LAN traffic to outside is different from the default gateway. If the gateway from LAN traffic to outside is the same from the default gateway, everything goes well.

That is:


Rules for LAN:
Code: [Select]
States      Protocol    Source  Port    Destination     Port    Gateway     Queue   Schedule    Description     Actions
1 /427 B    IPv4 *      *       *       LAN net         *       *           none
1 /1.04 MiB IPv4 *      *       *       *               *       GW1         none

Gateways (default gateway = gateway for LAN to outside):
Code: [Select]
Name            Interface   Gateway         Monitor IP
GW1 (default)   WAN1        192.168.1.1     192.168.1.1
GW2             WAN2        192.168.56.1    192.168.56.1

I try with SSH and it's goes well.

States relalted to xx.xxx.xxx.xxx in pfsense1 (master):
Code: [Select]
LAN     tcp     192.168.2.1:60626 -> xx.xxx.xxx.xxx:22522                           ESTABLISHED:ESTABLISHED     146 / 130   11 KiB / 20 KiB
WAN1    tcp     192.168.1.20:62445 (192.168.2.1:60626) -> xx.xxx.xxx.xxx:22522      ESTABLISHED:ESTABLISHED     146 / 130   11 KiB / 20 KiB

States related to xx.xxx.xxx.xxx in pfsense2 (backup):
Code: [Select]
LAN     tcp     192.168.2.1:60626 -> xx.xxx.xxx.xxx:22522                           ESTABLISHED:ESTABLISHED     0 / 0       0 B / 0 B
WAN1    tcp     192.168.1.20:62445 (192.168.2.1:60626) -> xx.xxx.xxx.xxx:22522      ESTABLISHED:ESTABLISHED     0 / 0       0 B / 0 B

Enter Persistent CARP Maintenance Mode

States related to xx.xxx.xxx.xxx in pfsense1 (backup):
Code: [Select]
LAN     tcp     192.168.2.1:60626 -> xx.xxx.xxx.xxx:22522                           ESTABLISHED:ESTABLISHED     339 / 321   21 KiB / 48 KiB
WAN1    tcp     192.168.1.20:62445 (192.168.2.1:60626) -> xx.xxx.xxx.xxx:22522      ESTABLISHED:ESTABLISHED     339 / 321   21 KiB / 48 KiB

States related to xx.xxx.xxx.xxx in pfsense2 (master):
Code: [Select]
LAN     tcp     192.168.2.1:60626 -> xx.xxx.xxx.xxx:22522                           ESTABLISHED:ESTABLISHED     111 / 111   6 KiB / 16 KiB
WAN1    tcp     192.168.1.20:62445 (192.168.2.1:60626) -> xx.xxx.xxx.xxx:22522      ESTABLISHED:ESTABLISHED     111 / 111   6 KiB / 16 KiB

But if the default gateway is not the gateway for LAN to outside:
Code: [Select]
Name            Interface   Gateway         Monitor IP
GW1             WAN1        192.168.1.1     192.168.1.1
GW2 (default)   WAN2        192.168.56.1    192.168.56.1

Then, the behavior is well until I put CARP Maintenance Mode (pfsense1 backup, pfsense2 master) and the states related to xx.xxx.xxx.xxx in pfsense2 (master) are:
Code: [Select]
LAN     tcp     192.168.2.1:60632 -> xx.xxx.xxx.xxx:22522                           ESTABLISHED:ESTABLISHED     31 / 5      5 KiB / 1 KiB
WAN1    tcp     192.168.1.20:49862 (192.168.2.1:60632) -> xx.xxx.xxx.xxx:22522      ESTABLISHED:ESTABLISHED     31 / 5      5 KiB / 1 KiB
and the SSH client is like frozen until I leave Persistent CARP Maintenance Mode and pfsense1 recovers the master role.
Title: Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
Post by: Derelict on October 08, 2017, 12:30:03 pm
I don't know. You have something scrwed up in your outbound NAT it looks like. I built this and it works fine. Using multiple versions of pfSense. Countless people are doing exactly the same thing.

I suggest you reinstall and start over as simply as possible. Adding nothing but what is necessary to test this concept.

You are chasing a red herring with the "not default gateway" thing. It doesn't exist. It is something else you have done.
Title: Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
Post by: ZsZs on November 10, 2017, 09:47:39 am
Hi Dayer and Derelict,

Apparently I am in the same situation as Dayer. The network layout is the same.
I am using 2.3.4 and I made a clean install as follows (apologize for the detailed list, but there might be an obvious mistake or missing part):
- the two VMs running on the same ESXi 6.0U3 host (for testing purpose)
- set up WAN1 and WAN2 in different subnet:
  - WAN1 public /26 (default GW)
  - WAN2 internal /24 (behind a cable modem)
- set up WAN1 and WAN2 with following monitoring IP addresses:
  - WAN1: 8.8.8.8
  - WAN2: 208.67.220.220
- add two DNS servers to each WAN
- Configure DNS resolver to forwarder mode
- install Open-vm-tools package (no other packages have been installed)
- Set up HA for syncing state and configs
- Set up CARP IPs (WAN1-VIP, WAN2-VIP, LAN-VIP) with appropriate netmask
- change Outbound NAT to Manual
  - remove auto-created SYNC interface related outbound NAT rules
  - change NAT address to WAN1-VIP on rules with interface WAN1
  - change NAT address to WAN2-VIP on rules with interface WAN2
- create WAN1first gateway group with WAN1GW Tier1, WAN2GW: Tier2
- create WAN2first gateway group with WAN1GW Tier2, WAN2GW: Tier1
- create FW rule with Policy routing for ssh traffic in LAN:
Code: [Select]
Protocol    Src Prt Dst Prt Gateway     Queue
IPv4 TCP    *   *   *   22  WAN2first   none

I've tried following policy routing scenarios by simply:
- changing GW in the above rule
- disabling the aboce rule
- toggling default GW
Code: [Select]
defGW   policy route   SSH sesseion after failover
WAN1    disabled       OK
WAN1    GW:WAN1GW      OK
WAN1    GW:WAN2GW      Freezes
WAN1    GW:WAN1first   OK
WAN1    GW:WAN2first   Freezes

WAN2    disabled       OK
WAN2    GW:WAN1GW      Freezes
WAN2    GW:WAN2GW      OK
WAN2    GW:WAN1first   Freezes
WAN2    GW:WAN2first   OK

I had the same issue, that in case the policy routing rule points to a gateway (group) other than the default, then after the HA fail-over to the secondary node the opened session freezes.
I can open a new ssh session via the new master, but moving the VIP back to the primary node this one freezes and the previously opened session starts responding again.
I also saw in tcpdump, that when the ssh session freezes, the traffic leaves the firewall on wrong WAN interface (on the default one) with the other WAN interface's source IP address.

I appreciate any hints you might have.

Regards,
Zsolt


edit: typos, some clarification added

Title: Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
Post by: dayer on November 10, 2017, 07:19:19 pm
Hi ZsZs,

I've read your description with attention and I think it could be the same problem. Although I'll wait to Derelict point of view.
I know this part is the key:
Quote
I also saw in tcpdump, that when the ssh session freezes, the traffic leaves the firewall on wrong WAN interface with the other WAN interface's source IP address.
However, according to the last Derelict suggestion, I must repeat the test with the simplest scenario.
Title: Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
Post by: ZsZs on November 13, 2017, 08:07:16 am
Hi again,

I've just spotted that 2.3.5 is out, so I've done a fresh install again and tried to simplify the setup as much as possible to reproduce this issue.
I omitted a few irrelevant steps (DNS config, setting up GW groups, etc) from the procedure described in my previous post which resulted this:
- the two VMs running on the same ESXi 6.0U3 host (for testing purpose)
- set up WAN1 and WAN2 in different subnet:
  - WAN1 public /26 (default GW)
  - WAN2 internal /24 (behind a cable modem)
- set up WAN1 and WAN2 with following monitoring IP addresses:
  - WAN1: 8.8.8.8
  - WAN2: 208.67.220.220
- Set up HA for syncing state and configs with relevant FW rules)
- Set up CARP IPs (WAN1-VIP, WAN2-VIP, LAN-VIP) with appropriate netmask
- change Outbound NAT Mode to Manual
  - remove auto-created SYNC interface related rules
  - change NAT address to WAN1-VIP on rules with interface WAN1
  - change NAT address to WAN2-VIP on rules with interface WAN2
  - the actual outbound NAT rules are (description removed)
Code: [Select]
Intf  Source           SPrt Dst DPrt NAT Address    NATPrt  Static Port
WAN1  127.0.0.0/8      *    *   500  WAN1 addres    *       KeepSrcStatic
WAN1  127.0.0.0/8      *    *   *    WAN1 addres    *       RandomizeSrcPort
WAN2  127.0.0.0/8      *    *   500  WAN2 address   *       KeepSrcStatic
WAN2  127.0.0.0/8      *    *   *    WAN2 address   *       RandomizeSrcPort
WAN1  192.168.25.0/24  *    *   500  WAN1 address   *       KeepSrcStatic
WAN1  192.168.25.0/24  *    *   *    213.XX.YY.8    *       RandomizeSrcPort
WAN2  192.168.25.0/24  *    *   500  WAN2 address   *       KeepSrcStatic
WAN2  192.168.25.0/24  *    *   *    192.168.0.10   *       RandomizeSrcPort
- create FW rule in LAN with routing ssh traffic via WAN2:
Code: [Select]
Protocol    Src Prt Dst Prt Gateway     Queue
IPv4 TCP    *   *   *   22  WAN2        none

The result is the same as described in my previous post.
Code: [Select]
defGW   policy route   SSH session after fail-over
WAN1    disabled       OK
WAN1    GW:WAN1GW      OK
WAN1    GW:WAN2GW      Freezes

WAN2    disabled       OK
WAN2    GW:WAN1GW      Freezes
WAN2    GW:WAN2GW      OK

I've tested with opening an ssh session to an external host.
In case the FW rule directs the outgoing LAN ssh traffic via a gateway other than the default gateway, then after the HA fails over to the secondary node the open ssh session freezes.
I can open a new ssh session via the new master (secondary node), but moving the VIP back to the primary node this newly opened ssh session freezes and the previously opened session starts responding again.
I still see in tcpdump, that when the ssh session freezes, the traffic leaves the firewall on default GW's WAN interface with the other WAN interface's VIP address.

Actually I haven't found much related posts where HA, multi-WAN and policy based routing is involved.
We might miss something obvious, but actually I haven't found much related posts/tutorials/howtos where HA, multi-WAN and policy based routing is involved.

edit: remove duplicated part
Title: Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
Post by: Derelict on November 13, 2017, 10:13:23 am
Quote
WAN2  192.168.25.0/24  *    *   *    192.168.0.10   *       RandomizeSrcPort
What is in front of that? If stateful, it will need a new state too.
Title: Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
Post by: ZsZs on November 13, 2017, 02:33:19 pm
Hi Derelict,
Tanks for your reply, but I am afraid I do not understand your question.
192.168.25.0/24 is my LAN subnet.
I've opened the ssh session from this subnet to an external host.
The quoted NAT rule supposed to do the outbound NAT on WAN2 arriving from my LAN.

I'm attaching the actual config.

HTH,
Zsolt
Title: Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
Post by: Derelict on November 13, 2017, 04:03:03 pm
Outbound NAT has nothing to do with routing traffic.

It determines what NAT happens when traffic is already routed that way.

Every time I test this it works fine. Not sure what you guys are doing wrong.

I was probably misreading your problem description.

I have already shown how to see the states, the sync of the same, and the fact that traffic goes over the synced state after failover.

Anything that doesn't do that properly is likely an issue at layer 2 having to do with the CARP MAC address changing switch ports.

This stuff works and works well.
Title: Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
Post by: ZsZs on November 14, 2017, 05:43:54 am
Ok, let me try to describe it with the actual states after each test step.

I have a fresh CARP-HA setup with Dual-WAN connection.
WAN1-CARP: 213.ss.tt.8
WAN2-CARP: 192.168.0.10
WANs are connected to different ISPs.
Outbound NAT on both WAN interfaces: NAT with CARP IP instead of interface ip.
External host with sshd: 212.xx.yy.193

Test case 1
- default GW: WAN1
- no policy routing is in place

- initial CARP master: primary node

Step 1
Open an ssh connection from a PC on LAN to an external host.
states (filter: external host's ip, interfaces: all)
primary
Code: [Select]
Intf Prot Src (Orig Src) -> Dst (Orig Dest)                               State                    Packets   Bytes
LAN  tcp  192.168.25.199:45762 -> 212.xx.yy.193:22                        ESTABLISHED:ESTABLISHED   16 / 20   4 KiB / 4 KiB
WAN1 tcp  213.ss.tt.8:55665 (192.168.25.199:45762) -> 212.xx.yy.193:22    ESTABLISHED:ESTABLISHED   16 / 20   4 KiB / 4 KiB
secondary
Code: [Select]
Intf Prot Src (Orig Src) -> Dst (Orig Dest)                               State                    Packets   Bytes
LAN  tcp  192.168.25.199:45762 -> 212.xx.yy.193:22                        ESTABLISHED:ESTABLISHED   0 / 0     0 B / 0 B
WAN1 tcp  213.ss.tt.8:55665 (192.168.25.199:45762) -> 212.xx.yy.193:22    ESTABLISHED:ESTABLISHED   0 / 0     0 B / 0 B

Step 2
Enter persistent CARP maintenance mode (new master: secondary node):
states (filter: external host's ip, interfaces: all) packet counter increasing on secondary
primary
Code: [Select]
Intf Prot Src (Orig Src) -> Dst (Orig Dest)                               State                    Packets   Bytes
LAN  tcp  192.168.25.199:45762 -> 212.xx.yy.193:22                        ESTABLISHED:ESTABLISHED   31 / 32   5 KiB / 5 KiB
WAN1 tcp  213.ss.tt.8:55665 (192.168.25.199:45762) -> 212.xx.yy.193:22    ESTABLISHED:ESTABLISHED   31 / 32   5 KiB / 5 KiB
secondary
Code: [Select]
Intf Prot Src (Orig Src) -> Dst (Orig Dest)                               State                    Packets   Bytes
LAN  tcp  192.168.25.199:45762 -> 212.xx.yy.193:22                        ESTABLISHED:ESTABLISHED   90 / 65   6 KiB / 8 KiB
WAN1 tcp  213.ss.tt.8:55665 (192.168.25.199:45762) -> 212.xx.yy.193:22    ESTABLISHED:ESTABLISHED   90 / 65   6 KiB / 8 KiB
Ssh connection remains responsive.

Step 3
Exit persistent CARP maintenance mode (new master: primary node):
states (filter: external host's ip, interfaces: all) packet counter increasing on primary
primary
Code: [Select]
Intf Prot Src (Orig Src) -> Dst (Orig Dest)                               State                    Packets   Bytes
LAN  tcp  192.168.25.199:45762 -> 212.xx.yy.193:22                        ESTABLISHED:ESTABLISHED   51 / 53   6 KiB / 9 KiB
WAN1 tcp  213.ss.tt.8:55665 (192.168.25.199:45762) -> 212.xx.yy.193:22    ESTABLISHED:ESTABLISHED   51 / 53   6 KiB / 9 KiB
secondary
Code: [Select]
Intf Prot Src (Orig Src) -> Dst (Orig Dest)                               State                    Packets   Bytes
LAN  tcp  192.168.25.199:45762 -> 212.xx.yy.193:22                        ESTABLISHED:ESTABLISHED   90 / 65   6 KiB / 8 KiB
WAN1 tcp  213.ss.tt.8:55665 (192.168.25.199:45762) -> 212.xx.yy.193:22    ESTABLISHED:ESTABLISHED   90 / 65   6 KiB / 8 KiB
Ssh connection remains responsive. All good, so far


Test case 2
- default GW: WAN1
- initial CARP master: primary node
- Create a LAN FW rule to policy route ssh traffic via WAN2.

Code: [Select]
Protocol    Src Prt Dst Prt Gateway     Queue
IPv4 TCP    *   *   *   22  WAN2GW      none

Step 1
Open an ssh connection from a PC on LAN to an external host.
states (filter: external host's ip, interfaces: all) packet counter increasing on primary
primary
Code: [Select]
Intf Prot Src (Orig Src) -> Dst (Orig Dest)                               State                    Packets   Bytes
LAN  tcp  192.168.25.199:45774 -> 212.xx.yy.193:22                        ESTABLISHED:ESTABLISHED   19 / 23   4 KiB / 4 KiB
WAN2 tcp  192.168.0.10:42115 (192.168.25.199:45774) -> 212.xx.yy.193:22   ESTABLISHED:ESTABLISHED   19 / 23   4 KiB / 4 KiB
secondary
Code: [Select]
Intf Prot Src (Orig Src) -> Dst (Orig Dest)                               State                    Packets   Bytes
LAN  tcp  192.168.25.199:45774 -> 212.xx.yy.193:22                        ESTABLISHED:ESTABLISHED   0 / 0    0 B / 0 B
WAN2 tcp  192.168.0.10:42115 (192.168.25.199:45774) -> 212.xx.yy.193:22   ESTABLISHED:ESTABLISHED   0 / 0    0 B / 0 B

Step 2
Enter persistent CARP maintenance mode (new master: secondary node)
states (filter: external host's ip, interfaces: all) packet counter increasing on secondary very slowly (on tcp retransmission only?)
primary
Code: [Select]
Intf Prot Src (Orig Src) -> Dst (Orig Dest)                               State                    Packets   Bytes
LAN  tcp  192.168.25.199:45774 -> 212.xx.yy.193:22                        ESTABLISHED:ESTABLISHED   67 / 71   7 KiB / 12 KiB
WAN2 tcp  192.168.0.10:42115 (192.168.25.199:45774) -> 212.xx.yy.193:22   ESTABLISHED:ESTABLISHED   67 / 71   7 KiB / 12 KiB
secondary
Code: [Select]
Intf Prot Src (Orig Src) -> Dst (Orig Dest)                               State                    Packets   Bytes
LAN  tcp  192.168.25.199:45774 -> 212.xx.yy.193:22                        ESTABLISHED:ESTABLISHED   8 / 8    500 B / 1 KiB
WAN2 tcp  192.168.0.10:42115 (192.168.25.199:45774) -> 212.xx.yy.193:22   ESTABLISHED:ESTABLISHED   8 / 8    500 B / 1 KiB
Ssh connection freezes.

Ssh traffic arriving to WAN2 correctly:
Code: [Select]
11:27:42.463760 IP 212.xx.yy.193.22 > 192.168.0.10.42115: Flags [P.], seq 8300:8408, ack 3422, win 312, options [nop,nop,TS val 4076858258 ecr 2258791998], length 108
But the reply is leaving the firewall on WRONG interface (default GW's: WAN1) instead of WAN2 and on top of this with WRONG SRC IP address. (WAN2-CARP instead of WAN1-CARP)
Code: [Select]
11:27:42.464087 IP 192.168.0.10.42115 > 212.xx.yy.193.22: Flags [.], ack 1, win 342, options [nop,nop,TS val 2258792365 ecr 4076858258,nop,nop,sack 1 {4294967189:1}], length 0
I see two problems here:
1. routing is wrong: the ssh traffic should be treated according to the policy routing.
2. the outbound NAT is also wrong because the traffic leaves on WAN1 (for whatever reason) so WAN1's outbound NAT rule should be applied as you said: "It determines what NAT happens when traffic is already routed that way."
If the routing were ok, that would solve the problem, but the wrong outbound NAT is also an issue and might help to determine where the wrong routing/ignoring policy routing originates.

Step 3
Exit persistent CARP maintenance mode (new master: primary node):
states (filter: external host's ip, interfaces: all) packet counter increasing on primary
primary
Code: [Select]
Intf Prot Src (Orig Src) -> Dst (Orig Dest)                               State                    Packets   Bytes
LAN  tcp  192.168.25.199:45774 -> 212.xx.yy.193:22                        ESTABLISHED:ESTABLISHED   75 / 80   7 KiB / 19 KiB
WAN2 tcp  192.168.0.10:42115 (192.168.25.199:45774) -> 212.xx.yy.193:22   ESTABLISHED:ESTABLISHED   75 / 80   7 KiB / 19 KiB
secondary
Code: [Select]
Intf Prot Src (Orig Src) -> Dst (Orig Dest)                               State                    Packets   Bytes
LAN  tcp  192.168.25.199:45774 -> 212.xx.yy.193:22                        ESTABLISHED:ESTABLISHED   9 / 9    564 B / 1 KiB
WAN2 tcp  192.168.0.10:42115 (192.168.25.199:45774) -> 212.xx.yy.193:22   ESTABLISHED:ESTABLISHED   9 / 9    564 B / 1 KiB
Ssh connection works again.

My conclusions:
It seems that the new master node does not route the failed over traffic according to the policy routing (defined in LAN FW rule), but according to the routing table.
I can open a new ssh session vie the new master node and this will be routed according to the policy routing (defined in LAN FW rule), however it will freeze once this new master becomed backup again.
Title: Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
Post by: Derelict on November 14, 2017, 11:21:23 am
Are you 100% certain all of your interfaces match exactly on both nodes? Both nodes should match exactly in Status > Interfaces the interface name, the wan/lan/optX name, and the physical interface name.
Title: Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
Post by: ZsZs on November 14, 2017, 01:51:49 pm
Thank you for your reply, I really appreciate.

I've double/triple checked and the pfsense/os interface names are following on both nodes:
WAN: vmx0 (WAN1)
LAN: vmx2 (LAN)
OPT1: vmx1 (WAN2)
OPT2: vmx3 (SYNC)
OPT3: vmx4 (DMZ) not used yet

edit: LAN and WAN2 description swapped.