Netgate SG-1000 microFirewall

Author Topic: Multi-WAN, High Availability, policy routing. Failover breaks connections  (Read 850 times)

0 Members and 1 Guest are viewing this topic.

Offline dayer

  • Newbie
  • *
  • Posts: 20
  • Karma: +0/-0
    • View Profile
Hi. Sorry if this message should be in the CARP section. I consider the situation is related with CARP and with Multi WAN also.
pfSense 2.3.4-p1

I've a this scenario:
Code: [Select]

                            |---(WAN1)----------|
            |---Pfsense1====                    |
            |      |        |---(WAN2)--|       |
            |      |                    |       |
PC --(LAN)--| (Sync)                 GW2     GW1
            |      |                    |       |
            |      |        |---(WAN2)--|       |
            |---Pfsense2====                    |
                            |---(WAN1)----------|


  • The gateway for PC is the VIP in the LAN.
  • The default gateway for Pfsense is GW2
  • The gateway for the traffic from LAN net is GW1 thank to policy routing:
    • Action: pass
    • Interface: LAN
    • Address Family: IPv4
    • Protocol: Any
    • Source: LAN net
    • Destination: ! LAN net
    • Gateway: GW1
  • NAT, Outbound settings:
    • In WAN1, from LAN net, to any: WAN1 VIP address
    • In WAN2, from LAN net, to any: WAN2 VIP address

With this scenario all traffic from PC to outside goes by GW1 correctly. However, if I'm doing ping from PC to an Internet address with Pfsense1 as master and I disable CARP temporarily in Pfsense1, now Pfsense2 is the new master and the ping is broken. In case of a TCP connection, as SSH, the result is the same.
I've been monitoring the traffic with tcpdump and I've realised Pfsense2 is trying to send this traffic by GW2 and using the WAN1 VIP address even, and not by GW1.
In case I redo the ping, or close an reopen de SSH connection, the new traffic goes by GW1, all right. But I can't find the reason why the current traffic is forwarded by the default gateway system (GW2) instead of follow the policy routing (only GW1).

Thanks
« Last Edit: September 22, 2017, 08:53:06 pm by dayer »

Offline dayer

  • Newbie
  • *
  • Posts: 20
  • Karma: +0/-0
    • View Profile
Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
« Reply #1 on: September 18, 2017, 08:13:58 am »
I've also tested:
  • only with a gateway, but without a default gateway for the system (in order to use policy routing)
Code: [Select]
/root: netstat -r
Routing tables

Internet:
Destination        Gateway            Flags      Netif Expire
8.8.4.4            192.168.1.1        UGHS   lagg0_vl
[...]
  • add a floating rule at the top with:
    • action: pass
    • quick: checked
    • interface: select all
    • direction: out
    • address family: IPv4
    • Protocol: Any
    • Source: This Firewall (self)
    • Destination: any
    • Gateway: GW1 - 192.168.1.1
and a ping from the firewall to outdoor doesn't know the path:
Code: [Select]
/root: ping -S 127.0.0.1 8.8.8.8
PING 8.8.8.8 (8.8.8.8) from 127.0.0.1: 56 data bytes
ping: sendto: No route to host

However a ping from the WAN1 IP goes well:
Code: [Select]
/root: ping -S 192.168.1.111 8.8.8.8
PING 8.8.8.8 (8.8.8.8) from 192.168.1.111: 56 data bytes
64 bytes from 8.8.8.8: icmp_seq=0 ttl=57 time=3.357 ms

I think this could be related with some thread and the bug #5476  :(
« Last Edit: September 22, 2017, 08:27:51 pm by dayer »

Offline dayer

  • Newbie
  • *
  • Posts: 20
  • Karma: +0/-0
    • View Profile
Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
« Reply #2 on: September 22, 2017, 08:25:42 pm »
I've simulated and simplified that situation with pfSense 2.4.0-RC using three virtual machines with VirtualBox and the host computer doing NAT (see the attachment file).
However the problem is the same.

Settings:
  • Pfsense1 as master. Pfsense2 as backup.
  • WLAN1 as default gateway (GW1), in the system routing table
  • WLAN2 as gateway (GW2) for traffic from the LAN to outside (with policy routing)
  • There are Outbound NAT from LAN in WAN1 (HA IP WAN1) and WAN2 (HA IP WAN2).

Reproduce:
  • I put a non-stop ping from PC to 8.8.8.8. The traffic flows through GW2. It's OK.
  • Disable CARP in Pfsense1. Now Pfsense2 is the master unit.
  • The states related to this ping are also in Pfsense2.
  • The ping begins to fail. There's no response.
  • With tcpdump in Pfsense2 I see:
    • The packets from PC arrive to LAN interface correctly.
    • The packets try leaving the firewall through WAN1 (¿¿¿why don't continue across WAN2???) and using the HA IP from WAN2 as source (because it keeps the NAT information according to the states)
  • If:
    • I close the ping and relaunch it, the pings goes OK through WAN2.
    • I enable CARP in Pfsense1 and it's the master unit again, the pings packets go through WAN2 again.

From my point of view, when a Pfsense is the new master unit and there's traffic flowing, it try to route the traffic only according to routing table (or static routes) and it ignores the policy routing.

This is the normal behavior in Pfsense? Do I need any special rules? Could it be a bug?

PD: I've fixed the thread subject.
PD2: I've also posted the first scenario in reddit and there's another person with the same behaviour.
« Last Edit: September 22, 2017, 08:53:48 pm by dayer »

Offline dayer

  • Newbie
  • *
  • Posts: 20
  • Karma: +0/-0
    • View Profile
Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
« Reply #3 on: September 27, 2017, 07:10:27 pm »
I attach some information I'm using to explain the problem also in mailing list thread:
  • Configuration files from a scenario with pfSense 2.4-0-RC over two virtual machines
  • Screenshots for an example

Offline luckman212

  • Hero Member
  • *****
  • Posts: 725
  • Karma: +58/-0
    • View Profile
    • @luckman212 - github
Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
« Reply #4 on: September 29, 2017, 10:25:43 pm »
Interesting problem. Sorry I don't have a solution right now, but I'll be following this thread.

If, in your "Reproduce" section above, instead of stopping the ping and restarting it in step#6 - you go to Diag>States>Reset States and kill all states, does the ping start succeeding again?

Offline dayer

  • Newbie
  • *
  • Posts: 20
  • Karma: +0/-0
    • View Profile
Thank you for your interest :)

I've followed your suggestion related to the setp #6:
  • If I kill all states the ping starts succeeding again :)
  • If I kill all states related to the destination IP, the ping starts succeeding again :)
  • If I kill the states related to the destination IP in WAN2 , the ping continues failing :(
  • If I kill the states related to the destination IP in LAN , the ping starts succeeding again :)
And in the successfully situations, If Pfsense1 recover the master rol the ping continues succeeding.
I think it's a problem related to keep the routes in established states at LAN interface.

Could it be considered a bug?

Offline luckman212

  • Hero Member
  • *****
  • Posts: 725
  • Karma: +58/-0
    • View Profile
    • @luckman212 - github
Ok at least you're a little closer to identifiying the culprit. I don't unfortunately have too much experience with CARP. So I will have to let someone else respond as to whether this is a config issue or a bug, and if it's a bug - whether the bug lies in pfSense or FreeBSD itself. If it's the latter, we'll have to file it upstream.

Offline dayer

  • Newbie
  • *
  • Posts: 20
  • Karma: +0/-0
    • View Profile
One more thing.
I've done the last tests again, with SSH instead of ping, but I haven't achieved recover the SSH.

Offline Derelict

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 9088
  • Karma: +1037/-306
    • View Profile
Quote
Could it be considered a bug?

Probably a configuration issue or an issue in your virtual environment. I have done countless tests like you are doing. Up to and including failing over during live video streams, etc and state sync works great.

I just pinged through from behind my 2.4.0-RC HA Multi-WAN VM pair last night when I updated it to 2.4.1-DEV. Worked great failing over and back.
Las Vegas, Nevada, USA
Use this diagram to describe your issue.
The pfSense Book is now available for just $24.70!
Do Not PM For Help! NO_WAN_EGRESSTM

Offline dayer

  • Newbie
  • *
  • Posts: 20
  • Karma: +0/-0
    • View Profile
Thank you very much for your tests.

Related to the failing over during live video streams, where those using the default gateway or... a secondary gateway according to a firewall rule or a failover gateway group? This difference is important because I've only found the problem if I'm using a not default gateway while I change the master unit.

I've done several tests, with virtual and physical machines, with LACP and without them, but the behavior always has been the same  :-\
I attached my config in the reply #3 a few days ago to try clearing any doubt about the settings.

Please, do you know a more detailed guide than Multi-WAN + Configuring pfSense Hardware Redundancy (CARP) from PFSenseDocs or the High Availability » Multi-WAN with HA from The pfSense Book for this purpose?
Could you share your tests configuration?

Offline Derelict

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 9088
  • Karma: +1037/-306
    • View Profile
Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
« Reply #10 on: October 01, 2017, 05:42:23 pm »
Policy routing all LAN pings out WAN2. WAN is the default gateway. Pinging 8.8.8.8:

States on Primary/MASTER:


LAN    icmp    172.25.236.227:29353 -> 8.8.8.8:29353    0:0    139 / 139    11 KiB / 11 KiB    
WAN2    icmp    172.25.227.17:38069 (172.25.236.227:29353) -> 8.8.8.8:38069    0:0    139 / 139    11 KiB / 11 KiB


States on Secondary/BACKUP:


LAN    icmp    172.25.236.227:29353 -> 8.8.8.8:29353    0:0    0 / 0    0 B / 0 B    
WAN2    icmp    172.25.227.17:38069 (172.25.236.227:29353) -> 8.8.8.8:38069    0:0    0 / 0    0 B / 0 B


Persistent CARP maintenance mode on Primary:

States on secondary start seeing the traffic.


LAN    icmp    172.25.236.227:29353 -> 8.8.8.8:29353    0:0    23 / 23    2 KiB / 2 KiB    
WAN2    icmp    172.25.227.17:38069 (172.25.236.227:29353) -> 8.8.8.8:38069    0:0    23 / 23    2 KiB / 2 KiB


Client dropped three pings then continued. That's about right for a failover event.

States still exists on primary but are not seeing any traffic:


LAN    icmp    172.25.236.227:29353 -> 8.8.8.8:29353    0:0    239 / 239    20 KiB / 20 KiB    
WAN2    icmp    172.25.227.17:38069 (172.25.236.227:29353) -> 8.8.8.8:38069    0:0    239 / 239    20 KiB / 20 KiB


Leave Persistent CARP mantenance mode on Primary. States on primary are seeing the traffic again (239 now 262):


LAN    icmp    172.25.236.227:29353 -> 8.8.8.8:29353    0:0    262 / 262    21 KiB / 21 KiB    
WAN2    icmp    172.25.227.17:38069 (172.25.236.227:29353) -> 8.8.8.8:38069    0:0    262 / 262    21 KiB / 21 KiB


Same results using temporary enabling and disabling of CARP on the primary.
« Last Edit: October 01, 2017, 05:55:39 pm by Derelict »
Las Vegas, Nevada, USA
Use this diagram to describe your issue.
The pfSense Book is now available for just $24.70!
Do Not PM For Help! NO_WAN_EGRESSTM

Offline Derelict

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 9088
  • Karma: +1037/-306
    • View Profile
Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
« Reply #11 on: October 01, 2017, 06:36:59 pm »
And FWIW, here's a TCP session. Policy routed all outbound SSH out WAN2:

States on Primary:


LAN    tcp    172.25.236.227:46380 -> 192.168.223.6:22    ESTABLISHED:ESTABLISHED    75 / 81    7 KiB / 10 KiB    
WAN2    tcp    172.25.227.17:21325 (172.25.236.227:46380) -> 192.168.223.6:22    ESTABLISHED:ESTABLISHED    75 / 81    7 KiB / 10 KiB


States on Secondary:


LAN    tcp    172.25.236.227:46380 -> 192.168.223.6:22    ESTABLISHED:ESTABLISHED    0 / 0    0 B / 0 B    
WAN2    tcp    172.25.227.17:21325 (172.25.236.227:46380) -> 192.168.223.6:22    ESTABLISHED:ESTABLISHED    0 / 0    0 B / 0 B


I failed back and forth a couple times. At no point did the ssh session drop. Noticed a couple delays in output but TCP did its thing and no data was lost in the session.
« Last Edit: October 01, 2017, 06:43:48 pm by Derelict »
Las Vegas, Nevada, USA
Use this diagram to describe your issue.
The pfSense Book is now available for just $24.70!
Do Not PM For Help! NO_WAN_EGRESSTM

Offline dayer

  • Newbie
  • *
  • Posts: 20
  • Karma: +0/-0
    • View Profile
Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
« Reply #12 on: October 02, 2017, 09:51:18 am »
Thank you Derelict :)

I've been simulating your tests.


WAN is the default gateway:
Code: [Select]
WAN1 (default)      WAN1 xxx.xxx.xxx.xxx     8.8.8.8     WAN1 Gateway
WAN2                WAN2 192.168.1.1         8.8.4.4     WAN2 Gateway

Policy routing all LAN traffic to outside go through WAN2:
Code: [Select]
States      Protocol    Source  Port    Destination     Port    Gateway     Queue   Schedule    Description
5 /3.80 MiB IPv4*       *       *       internals       *       *           none                Internal traffic to default gateway
0 /254 KiB  IPv4*       *       *       *               *       WAN2        none                The rest to WAN2

NAT for internal traffic to outside:
Code: [Select]
WAN1    internals * * * xxx.xxx.xxx.xxy     * NAT from internal networks
WAN2    internals * * * 192.168.1.100       * NAT from internal networks
internals is an alias with internal networks.

Pinging 208.123.73.69

States on Primary/MASTER:

Code: [Select]
LAN     icmp    172.16.103.2:10618 -> 208.123.73.69:10618                           0:0     69 / 69     6 KiB / 6 KiB
WAN2    icmp    192.168.1.100:57632 (172.16.103.2:10618) -> 208.123.73.69:57632     0:0     69 / 69     6 KiB / 6 KiB

States on Secondary/BACKUP:
Code: [Select]
LAN icmp 172.16.103.2:10618 -> 208.123.73.69:10618                           0:0     0 / 0 0 B / 0 B
WAN2 icmp 192.168.1.100:57632 (172.16.103.2:10618) -> 208.123.73.69:57632     0:0     0 / 0 0 B / 0 B

Persistent CARP maintenance mode on Primary:

States on secondary start seeing the traffic but something goes wrong. The number of packets observed matching the state from the destination side is zero.

Code: [Select]
LAN icmp 172.16.103.2:10618 -> 208.123.73.69:10618                         0:0 58 / 0 5 KiB / 0 B
WAN2 icmp 192.168.1.100:57632 (172.16.103.2:10618) -> 208.123.73.69:57632 0:0 58 / 0 5 KiB / 0 B

Client doesn't get ping reponses.

States still exists on primary but are not seeing any traffic (it's reasonable):

Code: [Select]
LAN icmp 172.16.103.2:10618 -> 208.123.73.69:10618                         0:0 368 / 368 30 KiB / 30 KiB
WAN2 icmp 192.168.1.100:57632 (172.16.103.2:10618) -> 208.123.73.69:57632 0:0 368 / 368 30 KiB / 30 KiB

Leave Persistent CARP maintenance mode on Primary. States on primary are seeing the traffic again (368 now 372):

Code: [Select]
LAN icmp 172.16.103.2:10618 -> 208.123.73.69:10618                         0:0 372 / 372 31 KiB / 31 KiB
WAN2 icmp 192.168.1.100:57632 (172.16.103.2:10618) -> 208.123.73.69:57632 0:0 372 / 372 31 KiB / 31 KiB



SSH to aaa.bbb.ccc.ddd (I've replaced the public IP for security reasons):

States on Primary:

Code: [Select]
LAN tcp 172.16.103.2:43290 -> aaa.bbb.ccc.ddd:22                        ESTABLISHED:ESTABLISHED 138 / 126 12 KiB / 20 KiB
WAN2 tcp 192.168.1.100:54741 (172.16.103.2:43290) -> aaa.bbb.ccc.ddd:22  ESTABLISHED:ESTABLISHED 138 / 126 12 KiB / 20 KiB

States on Secondary:
Code: [Select]
LAN tcp 172.16.103.2:43290 -> aaa.bbb.ccc.ddd:22                        ESTABLISHED:ESTABLISHED 0 / 0 0 B / 0 B
WAN2 tcp 192.168.1.100:54741 (172.16.103.2:43290) -> aaa.bbb.ccc.ddd:22  ESTABLISHED:ESTABLISHED 0 / 0 0 B / 0 B

Persistent CARP maintenance mode on Primary:

States on secondary start seeing the traffic, but something appears wrong:
Code: [Select]
LAN tcp 172.16.103.2:43290 -> aaa.bbb.ccc.ddd:22                        ESTABLISHED:ESTABLISHED 13 / 3 2 KiB / 708 B
WAN2 tcp 192.168.1.100:54741 (172.16.103.2:43290) -> aaa.bbb.ccc.ddd:22  ESTABLISHED:ESTABLISHED 13 / 3 2 KiB / 708 B


With tcpdump in WAN1, I see the Secondary firewall is routing through WAN1 using the WAN2 VIP address like for NAT:

Code: [Select]
14:09:03.973890 IP 192.168.1.100.54741 > aaa.bbb.ccc.ddd.22: Flags [.], ack 1150910396, win 593, options [nop,nop,TS val 3297622232 ecr 1263162549], length 0
14:09:07.569891 IP 192.168.1.100.54741 > aaa.bbb.ccc.ddd.22: Flags [.], ack 1, win 593, options [nop,nop,TS val 3297625828 ecr 1263163448,nop,nop,sack 1 {4294967113:1}], length 0
14:09:08.810847 IP 192.168.1.100.54741 > aaa.bbb.ccc.ddd.22: Flags [P.], seq 0:52, ack 1, win 593, options [nop,nop,TS val 3297627069 ecr 1263163448], length 52
14:09:08.978668 IP 192.168.1.100.54741 > aaa.bbb.ccc.ddd.22: Flags [P.], seq 52:104, ack 1, win 593, options [nop,nop,TS val 3297627237 ecr 1263163448], length 52
14:09:09.035602 IP 192.168.1.100.54741 > aaa.bbb.ccc.ddd.22: Flags [P.], seq 52:104, ack 1, win 593, options [nop,nop,TS val 3297627294 ecr 1263163448], length 52
14:09:09.122214 IP 192.168.1.100.54741 > aaa.bbb.ccc.ddd.22: Flags [P.], seq 104:156, ack 1, win 593, options [nop,nop,TS val 3297627380 ecr 1263163448], length 52
14:09:09.180582 IP 192.168.1.100.54741 > aaa.bbb.ccc.ddd.22: Flags [P.], seq 0:156, ack 1, win 593, options [nop,nop,TS val 3297627439 ecr 1263163448], length 156

States still exists on primary but are not seeing any traffic (it's reasonable):

Code: [Select]
LAN tcp 172.16.103.2:43290 -> aaa.bbb.ccc.ddd:22                        ESTABLISHED:ESTABLISHED 232 / 220 17 KiB / 34 KiB
WAN2 tcp 192.168.1.100:54741 (172.16.103.2:43290) -> aaa.bbb.ccc.ddd:22  ESTABLISHED:ESTABLISHED 232 / 220 17 KiB / 34 KiB

Leave Persistent CARP mantenance mode on Primary. States on primary are seeing the traffic again (232 now 294) and SSH terminal replies again:
Code: [Select]
LAN tcp 172.16.103.2:43290 -> aaa.bbb.ccc.ddd:22                        ESTABLISHED:ESTABLISHED 294 / 291 22 KiB / 80 KiB
WAN2 tcp 192.168.1.100:54741 (172.16.103.2:43290) -> aaa.bbb.ccc.ddd:22  ESTABLISHED:ESTABLISHED 294 / 291 22 KiB / 80 KiB

I'm going to check our differences.
« Last Edit: October 02, 2017, 09:57:49 am by dayer »

Offline dayer

  • Newbie
  • *
  • Posts: 20
  • Karma: +0/-0
    • View Profile
Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
« Reply #13 on: October 07, 2017, 06:39:07 pm »
@Derelict, please, what VM has you tested? VirtualBox? VMware? Could you share your two XML config files?

I've tested that situation with VirtualBox (before and after a factory default, to remake and check the settings) and with two identical physical machines also. No success.
We're considering use pfSense and be sure this functionality works in pfSense is important to us.

Offline Derelict

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 9088
  • Karma: +1037/-306
    • View Profile
Re: Multi-WAN, High Availability, policy routing. Failover breaks connections
« Reply #14 on: October 07, 2017, 08:44:06 pm »
XenServer. No, I do not have the configurations from those tests any more.

The real key is how you are testing. Note that TRex was generating about 350K states there.

The hypervisor used will be of no consequence to what gets policy routed where.
Las Vegas, Nevada, USA
Use this diagram to describe your issue.
The pfSense Book is now available for just $24.70!
Do Not PM For Help! NO_WAN_EGRESSTM