Netgate SG-1000 microFirewall

Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.


Messages - kieranc

Pages: [1] 2
1
General Questions / Re: Found a bugfix, how to get it added to the wiki?
« on: February 14, 2018, 02:07:54 am »
Thanks, I appreciate it. If I get time to dig into it further, I'll do so.

2
General Questions / Re: Found a bugfix, how to get it added to the wiki?
« on: February 13, 2018, 01:01:51 pm »
Each of the crashes had an identical backtrace:

Code: [Select]
db:0:kdb.enter.default>  bt
Tracing pid 12 tid 100057 td 0xfffff80003d15560
rn_match() at rn_match+0x11d/frame 0xfffffe00f0fd2650
fib4_lookup_nh_basic() at fib4_lookup_nh_basic+0x84/frame 0xfffffe00f0fd26b0
ip_findroute() at ip_findroute+0x31/frame 0xfffffe00f0fd26e0
ip_tryforward() at ip_tryforward+0x1f7/frame 0xfffffe00f0fd2750
ip_input() at ip_input+0x3c5/frame 0xfffffe00f0fd27b0
netisr_dispatch_src() at netisr_dispatch_src+0xa0/frame 0xfffffe00f0fd2800
ether_demux() at ether_demux+0x16d/frame 0xfffffe00f0fd2830
ether_nh_input() at ether_nh_input+0x310/frame 0xfffffe00f0fd2890
netisr_dispatch_src() at netisr_dispatch_src+0xa0/frame 0xfffffe00f0fd28e0
ether_input() at ether_input+0x26/frame 0xfffffe00f0fd2900
igb_rxeof() at igb_rxeof+0x6f4/frame 0xfffffe00f0fd2990
igb_msix_que() at igb_msix_que+0x109/frame 0xfffffe00f0fd29e0
intr_event_execute_handlers() at intr_event_execute_handlers+0xec/frame 0xfffffe00f0fd2a20
ithread_loop() at ithread_loop+0xd6/frame 0xfffffe00f0fd2a70
fork_exit() at fork_exit+0x85/frame 0xfffffe00f0fd2ab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00f0fd2ab0

That code path is fairly deep in routing and packet processing, but not in an area we typically see issues. Doesn't look like mbuf exhaustion, no trace of ALTQ, and it doesn't match any of the previous queue-related panics that we have seen.

It's possible there is some new FreeBSD issue that is only affected by the specific combination of hardware you have, or it could be that hardware just can't handle the load of multiple queues. Given what you said the hardware is, I am more inclined to blame the hardware.
Fair enough, so if we accept that it's a hardware issue and there's a workaround that mitigates it.... Does it qualify for the wiki? I'd just like other people having the same problem to be able to find the workaround. I wasted more than a few hours searching for it.

3
General Questions / Re: Found a bugfix, how to get it added to the wiki?
« on: February 12, 2018, 04:42:46 pm »
The bug didn't reappear, but it's possible your hardware has a different bug or problem.

Putting "try this, it might fix it but we don't know why" on the wiki is definitely a bad thing. We need to know why the hardware is crashing with more than one queue.

In most cases it's as simple as posting the full crash report that shows up after the panic and reboot. The backtrace will likely have better information, and the message buffer may have some clues as well.
Well, I sent the full crash logs via the GUI, so they're available somewhere. Sadly they seem to have been deleted automatically from my box.

What I'm failing to understand is the difference between this and all the other tweaks on the troubleshooting page. Are they all better understood than this one?

4
General Questions / Re: Found a bugfix, how to get it added to the wiki?
« on: February 12, 2018, 02:35:07 pm »
It was on the wiki page as a crutch to help with mbuf issues. Those have been fixed. It wasn't really a fix, just a workaround. The instances linked in the book as being related to queues have been fixed as well.
Maybe the bug has reappeared, or the fixes aren't working any more? A regression?

If you still have a crash with igb that is helped by reducing the queues, it's better to find out why than to just reduce the queue count. Dig deeper in the crash dumps/back trace and see where the crash is happening. It may not even really be from igb, but reducing the queue counts may hide the problem.

We could add that back into the wiki but without any specific guidance as to when/why someone might want to try it, I'm hesitant to do so. The fact that it panics isn't enough, we need to know more about what is actually causing that panic.
I don't really have the knowledge, time or inclination to dig deeper, I just want the box to work. It's a cheap chinese Qotom machine which came with pfSense preinstalled (yes I've wiped it) so it could easily be something to do with their implementation.

It was rebooting every day, I tried some stuff, looked on the wiki, didn't find anything useful, tried a bunch more stuff, found an old cached version of the wiki that included this information, bang, no more reboots.

I appreciate you both taking the time to respond, but 'we don't know why this fixes it' feels like a poor reason not to include it on a troubleshooting page.

5
General Questions / Re: Found a bugfix, how to get it added to the wiki?
« on: February 11, 2018, 01:38:33 pm »
As you please, I don't see the harm of it being in the wiki as "something to try if it's being weird", and I don't know why it was removed.
My box has currently been up for over 2 weeks after daily reboots before I applied the tweak, so while I don't know how it works, I can confirm that it resolved my issue, so I think it has value.

6
General Questions / Re: Found a bugfix, how to get it added to the wiki?
« on: February 11, 2018, 09:25:51 am »
My panics looked like this:

Code: [Select]
Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 01
fault virtual address  = 0xb0
fault code    = supervisor read data, page not present
instruction pointer  = 0x20:0xffffffff80d0e74d
stack pointer          = 0x28:0xfffffe00f0fd2624
frame pointer          = 0x28:0xfffffe00f0fd2650
code segment    = base 0x0, limit 0xfffff, type 0x1b
      = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags  = interrupt enabled, resume, IOPL = 0
current process    = 12 (irq273: igb2:que 1)

Code: [Select]
Fatal trap 9: general protection fault while in kernel mode
cpuid = 2; apic id = 02
instruction pointer  = 0x20:0xffffffff80d0e762
stack pointer          = 0x28:0xfffffe00f0f88624
frame pointer          = 0x28:0xfffffe00f0f88650
code segment    = base 0x0, limit 0xfffff, type 0x1b
      = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags  = interrupt enabled, resume, IOPL = 0
current process    = 0 (igb2 taskq)

It's still in the pfSense book.
This is entirely possible, but it's not on the wiki and that's why you didn't find it yet :-)


7
General Questions / Found a bugfix, how to get it added to the wiki?
« on: February 03, 2018, 01:45:36 pm »
I was having daily kernel panics on a box with a 4 port igb NIC and I found something that fixed it

Code: [Select]
hw.igb.num_queues=1
It used to be on the "Tuning and Troubleshooting Network Cards" page of the wiki until 2015 when it was removed, but apparently it's still useful, how can I get it put back on there?

8
Traffic Shaping / Re: CoDel - How to use
« on: July 27, 2015, 03:20:51 am »
Given that the 2.2.4 'correct' settings seem to give worse results than the 'incorrect' 2.2.3 ones (for me at least), it seems that we need the ability to tune both interval and target in order to make codel useful for everyone.
I'm guessing it would be complicated to add an extra field to the traffic shaper setup page, but since the queue limit field is currently being reused to set the target, could we add some logic to use it to set both target and interval? if the field contains a single integer, use it as a target, if it contains something else (t5i100? 5:100?) then use it as target and interval?
It's a bit beyond my skill level but it seems like it should be possible in theory, or is there a better way to do it?

edit: can we just set the interval and derive the target from that, if it's easier?

9
Traffic Shaping / Re: CoDel - How to use
« on: July 24, 2015, 06:41:32 am »
Well, this is fun. It seems to actually perform worse with the 'correct' values in place.
With 50/5 I was seeing mostly <200ms response time with upstream saturated and a 'B' on dslreports bufferbloat test
With 5/100 I'm seeing mostly <300ms response time, with more between 200 and 300ms than before, and a 'C' on dslreports bufferbloat test

I think you may have another problem/misconfiguration. You should be seeing MUUUUCH better than 200ms. My ADSL connection goes from 600ms without any traffic-shaping, to 50ms with CoDel on upstream during a fully-saturating, single-stream upload test. My idle ping to first hop is ~10ms.

but lol.... I have been laughing that the fixed parameter values would actually cause a performance decrease...

You're absolutely right, my problem is my ISP and their crappy excuse for a router, which I can't easily replace because it also handles the phones.
My connection will easily hit 2000ms+ if someone is uploading, <200ms is a massive improvement.

I'm also laughing a little at the results, based on your previous tests it's not a huge surprise but an explaination would be nice!

You might test enabling net.inet.tcp.inflight.enable=1 in the System->Advanced->System Tunables tab.

Quote
TCP bandwidth delay product limiting can be enabled by setting the net.inet.tcp.inflight.enable sysctl(8) variable to 1. This instructs the system to attempt to calculate the bandwidth delay product for each connection and limit the amount of data queued to the network to just the amount required to maintain optimum throughput.

This feature is useful when serving data over modems, Gigabit Ethernet, high speed WAN links, or any other link with a high bandwidth delay product, especially when also using window scaling or when a large send window has been configured. When enabling this option, also set net.inet.tcp.inflight.debug to 0 to disable debugging. For production use, setting net.inet.tcp.inflight.min to at least 6144 may be beneficial. Setting high minimums may effectively disable bandwidth limiting, depending on the link. The limiting feature reduces the amount of data built up in intermediate route and switch packet queues and reduces the amount of data built up in the local host's interface queue. With fewer queued packets, interactive connections, especially over slow modems, will operate with lower Round Trip Times. This feature only effects server side data transmission such as uploading. It has no effect on data reception or downloading.

Adjusting net.inet.tcp.inflight.stab is not recommended. This parameter defaults to 20, representing 2 maximal packets added to the bandwidth delay product window calculation. The additional window is required to stabilize the algorithm and improve responsiveness to changing conditions, but it can also result in higher ping(8) times over slow links, though still much lower than without the inflight algorithm. In such cases, try reducing this parameter to 15, 10, or 5 and reducing net.inet.tcp.inflight.min to a value such as 3500 to get the desired effect. Reducing these parameters should be done as a last resort only.

https://www.freebsd.org/doc/handbook/configtuning-kernel-limits.html



Seems like exactly the type of thing we would be interested in.
Just for fun, I enabled it and disabled the traffic shaper, during the upload portion of a dslreports test, my ping hit 2700ms :)
With inflight and codel enabled it seems to be behaving fine, possibly slightly better than without but i'll have to do more testing.

10
Traffic Shaping / Re: CoDel - How to use
« on: July 22, 2015, 03:48:47 pm »
Well, this is fun. It seems to actually perform worse with the 'correct' values in place.
With 50/5 I was seeing mostly <200ms response time with upstream saturated and a 'B' on dslreports bufferbloat test
With 5/100 I'm seeing mostly <300ms response time, with more between 200 and 300ms than before, and a 'C' on dslreports bufferbloat test

I think you may have another problem/misconfiguration. You should be seeing MUUUUCH better than 200ms. My ADSL connection goes from 600ms without any traffic-shaping, to 50ms with CoDel on upstream during a fully-saturating, single-stream upload test. My idle ping to first hop is ~10ms.

but lol.... I have been laughing that the fixed parameter values would actually cause a performance decrease...

You're absolutely right, my problem is my ISP and their crappy excuse for a router, which I can't easily replace because it also handles the phones.
My connection will easily hit 2000ms+ if someone is uploading, <200ms is a massive improvement.

I'm also laughing a little at the results, based on your previous tests it's not a huge surprise but an explaination would be nice!

11
Traffic Shaping / Re: CoDel - How to use
« on: July 22, 2015, 01:08:06 pm »
Well, this is fun. It seems to actually perform worse with the 'correct' values in place.
With 50/5 I was seeing mostly <200ms response time with upstream saturated and a 'B' on dslreports bufferbloat test
With 5/100 I'm seeing mostly <300ms response time, with more between 200 and 300ms than before, and a 'C' on dslreports bufferbloat test

12
Traffic Shaping / Re: CoDel - How to use
« on: July 22, 2015, 11:40:13 am »
Yeah, it is pretty confusing but I'll take CoDel however I can get it. :)
Ermal ported himself, iirc. Ahead of the curve, that guy! :)


I still dunno how to view or set codel's parameters when it is a sub-discipline though. Default or gtfo, I suppose...
I've just had a tinker and I can't find anything, but that certainly doesn't mean it's not there.
I've rarely used BSD, is there some /proc type interface where the information comes from that can be queried directly?

13
Traffic Shaping / Re: CoDel - How to use
« on: July 22, 2015, 09:49:09 am »
If qlimit is 0, it defaults to 50, and codel gets the (initial?) target value from qlimit.
is qlimit the queue length, or something else entirely?

qlimit is the queue length which becomes useless when codel is axtive, since codel dynamically controls queue length (AQM).
So when using codel the 'queue limit' setting seems to change the target instead... handy, but not very obvious..
Thanks!

14
Traffic Shaping / Re: CoDel - How to use
« on: July 22, 2015, 09:34:28 am »
If qlimit is 0, it defaults to 50, and codel gets the (initial?) target value from qlimit.
is qlimit the queue length, or something else entirely?

15
Traffic Shaping / Re: CoDel - How to use
« on: July 22, 2015, 04:09:13 am »
So this is from the latest nightly:

Code: [Select]
[2.2.4-DEVELOPMENT][admin@pfSense.localdomain]/root: pfctl -vs queue
altq on em0 codel( target 50 interval 100) bandwidth 600Kb tbrsize 1500

Interval successfully changed, now we just have to figure out where the target of 50 is coming from....

Edit: I just set the 'queue limit' to 25 in the GUI and my target is now 25.... Victory?


Edit2: From 2.2.4 19/07/2015 nightly, with queue limit set to 5:
Code: [Select]
[2.2.4-DEVELOPMENT][admin@pfSense.localdomain]/root: pfctl -vs queue
altq on em1 codel( target 5 interval 100) bandwidth 6Mb tbrsize 6000
  [ pkts:         85  bytes:       9938  dropped pkts:      0 bytes:      0 ]
  [ qlength:   0/ 50 ]

So it wasn't anything I did yesterday that fixed it, but it does seem to be fixed/workable in 2.2.4

Pages: [1] 2