Netgate SG-1000 microFirewall

Author Topic: 2.4.1: pfSense lockup with CARP on bridge interface  (Read 432 times)

0 Members and 1 Guest are viewing this topic.

Offline bkraptor

  • Jr. Member
  • **
  • Posts: 84
  • Karma: +0/-0
    • View Profile
2.4.1: pfSense lockup with CARP on bridge interface
« on: October 29, 2017, 08:07:51 am »
After successfully upgrading one SG-4860 and one APU2C4 from 2.3.4-p1 to 2.4.1, they both showed the same behavior: after a random number of minutes the routers would become unreachable over the network. Both systems repeatedly showed the same exact behavior 100% of the time over many reboots, so I don't believe the problem is related to the hardware platform.

Logging in over console showed the routers had not crashed and I could run various commands. Weird behaviors observed:

1. Pinging a known reachable IP in the same subnet would time out (i.e. no output). When I tried killing the ping command with CTRL+C I would only get ^C printed on screen, but the program did not end. Since this was run over console and I had no IP reachability, there was no other way to attempt to kill the ping command.

2. Plugging in an external keyboard and pressing CTRL+ALT+DEL resulted in the shutdown sequence starting, but the system would hang and never actually reboot:
Code: [Select]
pfSense is now shutting down ...

ath0_wlan3: ieee80211_new_state_locked: pending RUN -> SCAN transition lost
ath0_wlan2: ieee80211_new_state_locked: pending RUN -> SCAN transition lost
ath0_wlan2: ieee80211_new_state_locked: pending RUN -> SCAN transition lost
ath0: ath_txrx_stop_locked: didn't finish after 100 iterations

3. After a fresh boot the system would appear healthy for a few minutes, then randomly drop from the network.

Running top over console showed this weird output:
Code: [Select]
last pid: 76962;  load averages:  0.51,  0.32,  0.21    up 0+00:17:32  23:24:22
67 processes:  2 running, 62 sleeping, 3 lock <------------------------------------------ ???
CPU:  0.0% user,  1.6% nice,  3.3% system,  0.0% interrupt, 95.1% idle
Mem: 88M Active, 46M Inact, 339M Wired, 26M Buf, 7415M Free
Swap: 16G Total, 16G Free

Running ps showed several processes in "waiting for lock" state, including an ifconfig process:
Code: [Select]
[2.4.1-RELEASE][admin@pfSense.localdomain]/root: ps waux
USER         PID  %CPU %MEM    VSZ   RSS TT  STAT STARTED     TIME COMMAND
root          11 400.0  0.0      0    64  -  RL   23:07   53:10.84 [idle]
unbound    90940   0.4  0.3  70880 24192  -  Ss   23:07    0:03.17 /usr/local/sbin/unbound -c /var/unbound/unbound.conf
root           0   0.0  0.0      0   592  -  DLs  23:07    0:00.01 [kernel]
root           1   0.0  0.0   5024   872  -  ILs  23:07    0:00.02 /sbin/init --
root           2   0.0  0.0      0    16  -  DL   23:07    0:00.00 [crypto]
root           3   0.0  0.0      0    16  -  DL   23:07    0:00.00 [crypto returns]
root           4   0.0  0.0      0    32  -  DL   23:07    0:00.15 [cam]
root           5   0.0  0.0      0    16  -  DL   23:07    0:00.00 [soaiod1]
root           6   0.0  0.0      0    16  -  DL   23:07    0:00.00 [soaiod2]
root           7   0.0  0.0      0    16  -  DL   23:07    0:00.00 [soaiod3]
root           8   0.0  0.0      0    16  -  DL   23:07    0:00.00 [soaiod4]
root           9   0.0  0.0      0    16  -  DL   23:07    0:00.00 [sctp_iterator]
root          10   0.0  0.0      0    16  -  DL   23:07    0:00.00 [audit]
root          12   0.0  0.0      0   720  -  WL   23:07    0:02.70 [intr]
root          13   0.0  0.0      0    64  -  DL   23:07    0:00.00 [ng_queue]
root          14   0.0  0.0      0    48  -  DL   23:07    0:00.08 [geom]
root          15   0.0  0.0      0    80  -  DL   23:07    0:00.06 [usb]
root          16   0.0  0.0      0    16  -  DL   23:07    0:00.47 [pf purge]
root          17   0.0  0.0      0    16  -  DL   23:07    0:00.48 [rand_harvestq]
root          18   0.0  0.0      0    48  -  DL   23:07    0:00.05 [pagedaemon]
root          19   0.0  0.0      0    16  -  DL   23:07    0:00.00 [vmdaemon]
root          20   0.0  0.0      0    16  -  DL   23:07    0:00.00 [pagezero]
root          21   0.0  0.0      0    16  -  DL   23:07    0:00.01 [bufspacedaemon]
root          22   0.0  0.0      0    32  -  DL   23:07    0:00.06 [bufdaemon]
root          23   0.0  0.0      0    16  -  DL   23:07    0:00.01 [vnlru]
root          24   0.0  0.0      0    16  -  DL   23:07    0:00.18 [syncer]
root          58   0.0  0.0      0    16  -  DL   23:07    0:00.02 [md0]
root         297   0.0  0.4 282680 29232  -  Ss   23:07    0:00.09 php-fpm: master process (/usr/local/lib/php-fpm.conf) (php-fpm)
root         368   0.0  0.1  19440  4472  -  INs  23:07    0:00.02 /usr/local/sbin/check_reload_status
root         370   0.0  0.1  19440  4276  -  IN   23:07    0:00.00 check_reload_status: Monitoring daemon of check_reload_status
root         383   0.0  0.1   9556  4968  -  Is   23:07    0:00.02 /sbin/devd -q -f /etc/pfSense-devd.conf
root        5229   0.0  0.1  35660  6900  -  Is   23:07    0:00.00 nginx: master process /usr/local/sbin/nginx -c /var/etc/nginx-we
root        5512   0.0  0.1  35660  7372  -  I    23:07    0:00.00 nginx: worker process (nginx)
root        5550   0.0  0.1  37708  8056  -  L    23:07    0:00.34 nginx: worker process (nginx) <------------------------------------------- ???
root        6080   0.0  0.0  12496  2352  -  Is   23:07    0:00.02 /usr/sbin/cron -s
root        6445   0.0  0.1  24612 12448  -  Is   23:08    0:00.19 /usr/local/sbin/ntpd -g -c /var/etc/ntpd.conf -p /var/run/ntpd.p
root        7559   0.0  0.0  10580  2304  -  Is   23:08    0:00.00 /usr/local/sbin/sshlockout_pf 15
messagebus 10712   0.0  0.0  21528  3176  -  Is   23:08    0:00.00 /usr/local/bin/dbus-daemon --system
dhcpd      11863   0.0  0.1  16652  7896  -  Is   23:08    0:00.07 /usr/local/sbin/dhcpd -user dhcpd -group _dhcp -chroot /var/dhcp
root       12585   0.0  0.1  53492  6928  -  Is   23:07    0:00.00 /usr/sbin/sshd
root       12685   0.0  0.0  12628  2216  -  Is   23:07    0:00.01 /usr/local/sbin/sshlockout_pf 15
root       14777   0.0  0.0   8224  2004  -  Is   23:08    0:00.00 /usr/local/bin/minicron 240 /var/run/ping_hosts.pid /usr/local/b
root       14907   0.0  0.0  10528  2296  -  Is   23:07    0:00.00 dhclient: igb4 [priv] (dhclient)
root       14937   0.0  0.0   8224  2020  -  I    23:08    0:00.00 minicron: helper /usr/local/bin/ping_hosts.sh  (minicron)
root       15517   0.0  0.0   8224  2004  -  Is   23:08    0:00.00 /usr/local/bin/minicron 3600 /var/run/expire_accounts.pid /usr/l
root       15694   0.0  0.0   8224  2004  -  Is   23:08    0:00.00 /usr/local/bin/minicron 86400 /var/run/update_alias_url_data.pid
root       16272   0.0  0.0   8224  2020  -  I    23:08    0:00.00 minicron: helper /usr/local/sbin/fcgicli -f /etc/rc.expireaccoun
root       16523   0.0  0.0   8224  2020  -  I    23:08    0:00.00 minicron: helper /usr/local/sbin/fcgicli -f /etc/rc.update_alias
root       18669   0.0  0.0  15076  2504  -  Is   23:08    0:00.42 /usr/local/bin/dpinger -S -r 0 -i GW_1_IPv4 -B 10.123
root       19021   0.0  0.0  13028  2472  -  Is   23:08    0:00.19 /usr/local/bin/dpinger -S -r 0 -i GW_2_IPv4 -B 10.12
root       19307   0.0  0.0  15076  2516  -  Is   23:08    0:00.35 /usr/local/bin/dpinger -S -r 0 -i GW_2_IPv6 -B fd00:
_dhcp      19506   0.0  0.0  10528  2404  -  Is   23:07    0:00.01 dhclient: igb4 (dhclient)
root       19722   0.0  0.0  13028  2472  -  Is   23:08    0:00.22 /usr/local/bin/dpinger -S -r 0 -i GW_3_IPv4 -B 10.12
root       19875   0.0  0.0  13028  2480  -  Is   23:08    0:00.23 /usr/local/bin/dpinger -S -r 0 -i GW_3_IPv6 -B fd00:
root       20054   0.0  0.0  13028  2472  -  Is   23:08    0:00.20 /usr/local/bin/dpinger -S -r 0 -i GW_WAN_1_IPv4 -B <PUBLIC IP>
root       20330   0.0  0.0  13028  2472  -  Is   23:08    0:00.21 /usr/local/bin/dpinger -S -r 0 -i GW_WAN_1_IPv4_Google_DNS -B 10
messagebus 23664   0.0  0.0  21528  3196  -  Is   23:08    0:00.00 /usr/local/bin/dbus-daemon --system
root       41132   0.0  0.1  20352  5712  -  Ss   23:07    0:00.73 /usr/local/sbin/openvpn --config /var/etc/openvpn/server3.conf
root       43454   0.0  0.0   7548  3516  -  Ss   23:08    0:00.01 /usr/sbin/watchdogd -t 128
root       45654   0.0  0.1  20352  5988  -  Ss   23:07    0:00.02 /usr/local/sbin/openvpn --config /var/etc/openvpn/server2.conf
root       48389   0.0  0.0  12696  2304  -  Ss   23:07    0:00.05 /usr/local/sbin/filterlog -i pflog0 -p /var/run/filterlog.pid
root       48511   0.0  0.1  19648  5580  -  Ls   23:08    0:00.03 /usr/local/sbin/miniupnpd -f /var/etc/miniupnpd.conf -P /var/run <-------- ???
root       48605   0.0  0.1  20352  5984  -  Ss   23:07    0:00.02 /usr/local/sbin/openvpn --config /var/etc/openvpn/server4.conf
root       48758   0.0  0.0  10368  2092  -  Ss   23:08    0:00.28 /usr/sbin/powerd -b hadp -a hadp -n hadp
root       55992   0.0  0.0  13084  2724  -  I    23:12    0:00.01 /bin/sh /usr/local/bin/ping_hosts.sh
root       56169   0.0  0.0  13084  2724  -  I    23:12    0:00.00 /bin/sh /usr/local/bin/ping_hosts.sh
root       56277   0.0  0.0  12788  2676  -  Is   23:08    0:00.01 /usr/local/sbin/filterdns -p /var/run/filterdns.pid -i 300 -c /v
root       56369   0.0  0.0  16988  2856  -  L    23:12    0:00.02 ifconfig <---------------------------------------------------------------- ???
root       56416   0.0  0.0  14728  2460  -  I    23:12    0:00.00 grep carp: BACKUP vhid
root       56745   0.0  0.0  12532  2212  -  I    23:12    0:00.00 wc -l
root       74116   0.0  0.1  20352  5880  -  Ss   23:07    0:00.40 /usr/local/sbin/openvpn --config /var/etc/openvpn/client1.conf
avahi      85807   0.0  0.0  29960  3852  -  I    23:08    0:00.13 avahi-daemon: running [pfSense.local] (avahi-daemon)
root       93089   0.0  0.5 284728 38648  -  I    23:11    0:00.46 php-fpm: pool nginx (php-fpm)
root       93155   0.0  0.0  10484  2532  -  Ss   23:08    0:00.79 /usr/sbin/syslogd -s -c -c -l /var/dhcpd/var/run/log -P /var/run
root       94590   0.0  0.0   8520  2732  -  Is   23:08    0:00.00 bgpd: parent (bgpd)
_bgpd      94858   0.0  0.0   8520  2716  -  I    23:08    0:00.00 bgpd: route decision engine (bgpd)
_bgpd      95194   0.0  0.0   8520  2792  -  I    23:08    0:00.00 bgpd: session engine (bgpd)
root       99398   0.0  0.0   6172  1928  -  IN   23:20    0:00.00 sleep 60
root        7594   0.0  0.0  13084  2556 u1  I    23:08    0:00.00 /bin/sh /etc/rc.initial
root       10266   0.0  0.0  13392  3624 u1  S    23:08    0:00.17 /bin/tcsh
root       36372   0.0  0.0  13084  2640 u1- IN   23:08    0:00.60 /bin/sh /var/db/rrd/updaterrd.sh
root       83521   0.0  0.0  39432  2848 u1  Is   23:08    0:00.01 login [pam] (login)
root       99594   0.0  0.0  21104  2744 u1  R+   23:20    0:00.03 ps waux
root       82157   0.0  0.0  10388  2132 v0  Is+  23:08    0:00.00 /usr/libexec/getty Pc ttyv0
root       82332   0.0  0.0  10388  2132 v1  Is+  23:08    0:00.00 /usr/libexec/getty Pc ttyv1
root       82429   0.0  0.0  10388  2132 v2  Is+  23:08    0:00.00 /usr/libexec/getty Pc ttyv2
root       82538   0.0  0.0  10388  2132 v3  Is+  23:08    0:00.00 /usr/libexec/getty Pc ttyv3
root       82728   0.0  0.0  10388  2132 v4  Is+  23:08    0:00.00 /usr/libexec/getty Pc ttyv4
root       83058   0.0  0.0  10388  2132 v5  Is+  23:08    0:00.00 /usr/libexec/getty Pc ttyv5
root       83404   0.0  0.0  10388  2132 v6  Is+  23:08    0:00.00 /usr/libexec/getty Pc ttyv6
root       83409   0.0  0.0  10388  2132 v7  Is+  23:08    0:00.00 /usr/libexec/getty Pc ttyv7
[2.4.1-RELEASE][admin@pfSense.localdomain]/root:


The main features I use:
  • bridges that contain:
    • VLAN tagged interfaces (no PPPoE)
    • wireless interfaces with multiple (3) virtual SSIDs
    • CARP running on bridge interfaces
  • NAT in various flavors
  • OpenVPN
  • OpenBGPd
  • no explicit shaping/policing/queueing configured
« Last Edit: November 05, 2017, 11:23:36 am by bkraptor »

Offline bkraptor

  • Jr. Member
  • **
  • Posts: 84
  • Karma: +0/-0
    • View Profile
Re: 2.3.4-p1 to 2.4.1: system lockup after a few minutes (SG-4860 and APU2C4)
« Reply #1 on: November 02, 2017, 08:37:02 pm »
Want to bump this thread as I had another attempt at updating both boxes. I thought this was related to the pfBlockerNG issue that everyone seems to be having, which should have been fixed with pfSense 2.4.1 and the latest pfBlockerNG (that I had installed, but not activated). What I observed: the APU2C4 box was upgraded, but left in a CARP backup state for 24h. It did not show any signs of locking up for the whole duration. The moment it became CARP master it only took ~5 minutes to get it to lock up. Same thing then happened for the SG-4860.

I believe this issue is different from the pfBlockerNG issue, as the processess in this case get stuck in the L state, compared to the D state for the pfBlockerNG issue.

BlueKobold

  • Guest
Re: 2.3.4-p1 to 2.4.1: system lockup after a few minutes (SG-4860 and APU2C4)
« Reply #2 on: November 04, 2017, 12:50:54 am »
Quote
The main features I use:

    bridges that contain:
        VLAN tagged interfaces (no PPPoE)
        wireless interfaces with multiple (3) virtual SSIDs
        CARP running on bridge interfaces
    NAT in various flavors
    OpenVPN
    OpenBGPd
    no explicit shaping/policing/queueing configured
In version 2.4.0 some VLAN labeling (to long names) problems occurs if I was reading it right here through the forum.
In Version 2.4.1 are some hard problems using VLANs at the PPPoE connection!
In the early version 2.4.2 this problems are gone, but this must not be meaning now that the version is stable as others!

Across the whole forum therre are many problems updating or upgrading to a 2.4.x version, but often or many users
were installing it fresh and full on an storage and played back their config xml file and all was right then. If I am in
your situation I would try out installing 2.4.0 ADI image on the SG-4860 and the CE Edition on the APU2C4 and
in front of that I would proof the firmware images too and/or update them both if needed. And then play back the
config xml file.



Offline bkraptor

  • Jr. Member
  • **
  • Posts: 84
  • Karma: +0/-0
    • View Profile
Re: 2.3.4-p1 to 2.4.1: system lockup after a few minutes (SG-4860 and APU2C4)
« Reply #3 on: November 05, 2017, 11:02:45 am »
Thanks for the tip, but I think this is a basic 2.4 bug. I can easily replicate the issue on a freshly installed pfSense VM by sending traffic via a CARP IP on a bridge interface.

https://redmine.pfsense.org/issues/8056

Offline wwwdrich

  • Newbie
  • *
  • Posts: 4
  • Karma: +2/-0
    • View Profile
Re: 2.4.1: pfSense lockup with CARP on bridge interface
« Reply #4 on: November 05, 2017, 09:29:07 pm »
I hate to make a "me too" post, but I'm seeing the same thing on 2.4.1 with a clean install and a config.xml loaded from my old system. The only way I have found to keep the firewall up is to turn all of my CARP VIPs into IP Aliases and turn off my secondary firewall.

When I get the hang, hitting ctrl-t on the console gives me variations on:
Code: [Select]
load: 7.22  cmd: ifconfig 88749 [*carp_if] 11.76r 0.00u 0.00s 0% 2704kso it looks like it is spinning in carp_if.


Offline webwiz

  • Newbie
  • *
  • Posts: 4
  • Karma: +0/-0
    • View Profile
Re: 2.4.1: pfSense lockup with CARP on bridge interface
« Reply #5 on: November 10, 2017, 08:37:50 am »
Another me too, as well.

All our pfSense firewalls that are using Bridged interfaces and CARP will freeze as soon as traffic starts passing across the Bridge Interface.

Had to reinstall 2.3 to get firewalls working again.

Offline bkraptor

  • Jr. Member
  • **
  • Posts: 84
  • Karma: +0/-0
    • View Profile
Re: 2.4.1: pfSense lockup with CARP on bridge interface
« Reply #6 on: November 15, 2017, 04:07:07 pm »
Hoping this gets some traction, but so far no activity on the linked bug report...