pfSense Support Subscription

Author Topic: apinger doesn't recover opt wan when connection returns: still an issue?  (Read 1270 times)

0 Members and 1 Guest are viewing this topic.

Offline arion_p

  • Newbie
  • *
  • Posts: 4
  • Karma: +0/-0
    • View Profile
I am using 2.0.1 stable release, in a dual wan configuration and experience the symptoms described in bug #742 and in here: http://forum.pfsense.org/index.php/topic,32010.0.html

In the dashboard, the gateway shows as offline, RTT ~30ms and loss 100%. It is quite strange that both these values are shown when apinger thinks the gateway is down.
I am not sure right now (I will have to confirm), but I think when the gateway is really down RTT is zero. <-- This is not true, I was wrong. It simply shows the last RTT, so ignore this comment.

Also in my logs I have:
May 2 11:16:36   php: : Gateways status could not be determined, considering all as up/active.
May 2 11:16:36   php: : MONITOR: GW_WIND is down, removing from routing group
May 2 11:16:36   php: : MONITOR: GW_WIND is down, removing from routing group
May 2 11:16:36   php: : MONITOR: GW_WIND is down, removing from routing group
May 2 11:16:36   php: : Message sent to *****@*****.gr OK
May 2 11:15:58   php: : MONITOR: GW_WIND is down, removing from routing group
May 2 11:15:53   check_reload_status: Reloading filter
May 2 11:15:43   apinger: ALARM: GW_WIND(62.169.255.45) *** GW_WINDdown ***


Edit2: I have this issue in both a regular PC installation and an embedded installation on an ALIX board.

I was under the impression that this was resolved in 2.0, is it not?

Thanks in advance.
« Last Edit: May 04, 2012, 04:28:33 am by arion_p »

Offline cmb

  • Administrator
  • Hero Member
  • *****
  • Posts: 6333
  • Karma: +0/-0
    • LinkedIn
    • Twitter
    • View Profile
    • Chris Buechler
hasn't been an issue since long before 2.0 release.

Offline arion_p

  • Newbie
  • *
  • Posts: 4
  • Karma: +0/-0
    • View Profile
Thanks for confirming.
This is however odd because I have the exact behaviour described in #742. I can easily reproduce by temporarily dropping the connection of the 2nd WAN router (ADSL).

The 2 WAN connections are set for load-balancing. Both have static IPs. I have tried using a local subnet or public IP between pfsense and ADSL routers, made no difference.

Anyone have any suggestions on how to troubleshoot the issue? Any info I could provide that would shed some light?
I have some limited experience in linux and almost none in FreeBSD, but I have tried everything I could think of without any success.

Any help would be greatly appreciated.
« Last Edit: May 07, 2012, 02:10:44 am by arion_p »

Offline arion_p

  • Newbie
  • *
  • Posts: 4
  • Karma: +0/-0
    • View Profile
I did some more debugging. Here's a packet capture from WAN2 (using pfSense's own packet capture so some packets my have been dropped) while apinger fails to ping the gateway even though the link is up:

12:30:55.017351 IP (tos 0x0, ttl 64, id 35835, offset 0, flags [none], proto ICMP (1), length 64)
    10.0.2.103 > 62.169.255.45: ICMP echo request, id 7969, seq 8193, length 44
12:30:56.017341 IP (tos 0x0, ttl 64, id 57869, offset 0, flags [none], proto ICMP (1), length 64)
    10.0.2.103 > 62.169.255.45: ICMP echo request, id 7969, seq 8449, length 44
12:30:57.017374 IP (tos 0x0, ttl 64, id 43995, offset 0, flags [none], proto ICMP (1), length 64)
    10.0.2.103 > 62.169.255.45: ICMP echo request, id 7969, seq 8705, length 44
12:30:58.017486 IP (tos 0x0, ttl 64, id 3231, offset 0, flags [none], proto ICMP (1), length 64)
    10.0.2.103 > 62.169.255.45: ICMP echo request, id 7969, seq 8961, length 44
12:30:59.017906 IP (tos 0x0, ttl 64, id 55248, offset 0, flags [none], proto ICMP (1), length 64)
    10.0.2.103 > 62.169.255.45: ICMP echo request, id 7969, seq 9217, length 44
12:31:00.017994 IP (tos 0x0, ttl 64, id 42689, offset 0, flags [none], proto ICMP (1), length 64)
    10.0.2.103 > 62.169.255.45: ICMP echo request, id 7969, seq 9473, length 44
12:31:01.018099 IP (tos 0x0, ttl 64, id 43292, offset 0, flags [none], proto ICMP (1), length 64)
    10.0.2.103 > 62.169.255.45: ICMP echo request, id 7969, seq 9729, length 44


Then without changing anything, I did a:

ping -S 10.0.2.103 -c 4  62.169.255.45

and here is the packet capture (note that apinger ICMP requests are also included):

12:41:25.119149 IP (tos 0x0, ttl 64, id 16190, offset 0, flags [none], proto ICMP (1), length 64)
    10.0.2.103 > 62.169.255.45: ICMP echo request, id 7969, seq 38403, length 44
12:41:26.120109 IP (tos 0x0, ttl 64, id 1185, offset 0, flags [none], proto ICMP (1), length 64)
    10.0.2.103 > 62.169.255.45: ICMP echo request, id 7969, seq 38659, length 44
12:41:26.751510 IP (tos 0x0, ttl 64, id 15068, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.2.103 > 62.169.255.45: ICMP echo request, id 15242, seq 0, length 64
12:41:26.781472 IP (tos 0x0, ttl 126, id 15068, offset 0, flags [none], proto ICMP (1), length 84)
    62.169.255.45 > 10.0.2.103: ICMP echo reply, id 15242, seq 0, length 64
12:41:27.121100 IP (tos 0x0, ttl 64, id 29878, offset 0, flags [none], proto ICMP (1), length 64)
    10.0.2.103 > 62.169.255.45: ICMP echo request, id 7969, seq 38915, length 44
12:41:27.752088 IP (tos 0x0, ttl 64, id 7106, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.2.103 > 62.169.255.45: ICMP echo request, id 15242, seq 1, length 64
12:41:27.780434 IP (tos 0x0, ttl 126, id 7106, offset 0, flags [none], proto ICMP (1), length 84)
    62.169.255.45 > 10.0.2.103: ICMP echo reply, id 15242, seq 1, length 64
12:41:28.122462 IP (tos 0x0, ttl 64, id 64529, offset 0, flags [none], proto ICMP (1), length 64)
    10.0.2.103 > 62.169.255.45: ICMP echo request, id 7969, seq 39171, length 44
12:41:28.753091 IP (tos 0x0, ttl 64, id 6296, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.2.103 > 62.169.255.45: ICMP echo request, id 15242, seq 2, length 64
12:41:28.781626 IP (tos 0x0, ttl 126, id 6296, offset 0, flags [none], proto ICMP (1), length 84)
    62.169.255.45 > 10.0.2.103: ICMP echo reply, id 15242, seq 2, length 64
12:41:29.123152 IP (tos 0x0, ttl 64, id 29476, offset 0, flags [none], proto ICMP (1), length 64)
    10.0.2.103 > 62.169.255.45: ICMP echo request, id 7969, seq 39427, length 44
12:41:29.754032 IP (tos 0x0, ttl 64, id 46050, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.2.103 > 62.169.255.45: ICMP echo request, id 15242, seq 3, length 64
12:41:29.781492 IP (tos 0x0, ttl 126, id 46050, offset 0, flags [none], proto ICMP (1), length 84)
    62.169.255.45 > 10.0.2.103: ICMP echo reply, id 15242, seq 3, length 64


Notice that ping gets a response, while apinger does not. If I kill apinger then restart it it works fine until the line drops.
The only differences I can see are:
  • the packet/data length - it shouldn't matter because it works if I restart apinger
  • the sequence #: in ping it starts from 0 while in apinger it continues from where it left off in the previous try. If I restart apinger, sequence # restarts at 0. Could this be the issue?

Next I want to try to insert a sniffer directly on the LAN segment between pfSense and ADSL router (I've done this before but do not remember if the ADSL router actually replied - I think it did but I have to confirm)

Offline heper

  • Hero Member
  • *****
  • Posts: 676
  • Karma: +0/-0
    • View Profile
i've had a similar issue before ... changing the dsl router solved this for me.

In my case it was an old cisco 800 series dsl router that caused the problem ... it was replaced by a cheap dlink

Offline arion_p

  • Newbie
  • *
  • Posts: 4
  • Karma: +0/-0
    • View Profile
I've experienced this with a Gennet Oxygen router and a Pirelli router. At least for the Pirelli one, replacing it is not an option - the router is provided and controlled by the ISP