Pound -> Varnish/Nginx/Apache - Time out problem

From: Niels Wahlgreen <none_at_niels--wahlgreen.dk.lh.bsd-dk.dk>
Date: Tue, 2 Oct 2012 18:26:24 +0000
To: "bsd-dk_at_bsd-dk.dk" <none_at_bsd-dk--bsd-dk.dk.lh.bsd-dk.dk>

Hej alle på listen.

Jette/Nerdgirl og jeg har et problem som vi har brugt en del tid på at finde hoved og hale i, så nu poster jeg det her og håber at nogen af jer har lidt input til hvordan vi kommer videre.

Vi har et setup, som indtil i går bestod af en server med FreeBSD og Pound, som modtager en masse HTTPS forespørgsler, som den så fordeler videre til tre forskellige backend-typer:

1. Pound -> Varnish -> 4 Apache backend servere

2. Pound -> Nginx som leverer en masse statiske billeder

3. Pound -> Apache som kører en lille applikation som kun har meget begrænset belastning

Indtil i går kørte Pound på sin egen server (commodus), forbandt til serveren som kørte Varnish/Nginx/Apache (vi kalder den varnish-serveren), som derfra enten leverede indhold eller hentede det fra de 4 apache backend servere.

Så skete der det, at varnish-serveren stod af med et defekt bundkort og vi flyttede alle dens funktioner så de nu kører på samme server som Pound.

Nu har vi så det problem, at efter sammenlægningen af disse funktioner på serveren, går sitet stort set i stå når der er belastning på.

Pound skriver følgende i loggen:
Oct 2 20:13:56 commodus pound: (8012a1980) error read from 2.111.17.254: Operation timed out
Oct 2 20:13:57 commodus pound: (8012a2940) error read from 188.177.158.126: Operation timed out
Oct 2 20:13:57 commodus pound: (801209e80) error read from 90.184.87.246: Operation timed out
Oct 2 20:13:57 commodus pound: (80129d1c0) connect_nb: error after getsockopt: Connection refused
Oct 2 20:13:57 commodus pound: (8012a1600) error read from 90.184.87.246: Operation timed out
Oct 2 20:13:57 commodus pound: (801208600) error read from 90.184.87.246: Operation timed out
Oct 2 20:13:57 commodus pound: (8012a02c0) error read from 90.184.87.246: Operation timed out
Oct 2 20:13:58 commodus pound: (80129f840) error read from 31.25.22.148: Operation timed out
Oct 2 20:14:00 commodus pound: (8012a2b00) error read from 87.104.188.67: Operation timed out
Oct 2 20:14:01 commodus pound: (80129dfc0) error read from 83.89.31.35: Operation timed out
Oct 2 20:14:01 commodus pound: (80129d8c0) error read from 83.89.31.35: Operation timed out
Oct 2 20:14:01 commodus pound: (801206300) error read from 83.89.31.35: Operation timed out
Oct 2 20:14:01 commodus pound: (801207d40) error read from 83.89.31.35: Operation timed out
Oct 2 20:14:01 commodus pound: (801208ec0) error read from 83.89.31.35: Operation timed out
Oct 2 20:14:08 commodus pound: (80129f4c0) error read from 212.242.48.91: Operation timed out
Oct 2 20:14:09 commodus pound: (80129ea40) error read from 80.62.116.43: Operation timed out
Oct 2 20:14:09 commodus pound: (801208600) error read from 62.44.135.186: Operation timed out
Oct 2 20:14:09 commodus pound: (8012a10c0) error read from 80.162.59.189: Operation timed out
Oct 2 20:14:10 commodus pound: (801208d00) error flush headers to 80.71.135.10: Connection reset by peer
Oct 2 20:14:10 commodus pound: (8012087c0) error read from 80.71.135.10: Operation timed out
Oct 2 20:14:11 commodus pound: (8012a3200) error read from 90.184.16.125: Operation timed out
Oct 2 20:14:15 commodus pound: (80129ec00) error read from 62.199.160.110: Operation timed out
Oct 2 20:14:17 commodus pound: (801207f00) error read from 87.48.28.78: Operation timed out
Oct 2 20:14:19 commodus pound: (8012a2e80) error read from 212.10.83.79: Operation timed out
Oct 2 20:14:20 commodus pound: (8012064c0) e500 for 80.161.116.25 response error read from 10.0.0.49:80/POST /element.php HTTP/1.1: Operation timed out (33.543 secs)
Oct 2 20:14:20 commodus pound: (801208d00) error read from 89.233.31.194: Operation timed out
Oct 2 20:14:20 commodus pound: (8012a3c80) error read from 83.94.247.69: Operation timed out
Oct 2 20:14:20 commodus pound: (80129e180) error read from 89.150.77.252: Operation timed out
Oct 2 20:14:21 commodus pound: (801206f40) error read from 80.197.98.250: Operation timed out

Efter hvad jeg kan se, betyder det at den mister forbindelsen til browserne før den når at svare. Serveren har 99.7% idle CPU, 14G ledig ram ud af 16G og ikke voldsomt meget netværkstrafik, men et par tusinde åbne connections hele tiden.

Når belastningen aftager, virker alt som det skal.

Varnish svarer lynhurtigt når først den modtager forespørgslerne, så efter hvad jeg kan tænke mig frem til, må problemet ligge i netværksstacken et sted, som skaber en kø og forhindrer afviklingen af forbindelserne. Højst sandsynligt fordi der nu, i stedet for kun at være forbindelser fra brugerne til serveren og nogen få forbindelser videre, nu er en masse flere connections til både pound, nginx, apache, varnish og videre fra varnish til backend serverne. Det virker som om vi har ramt en flaskehals på netværkstrafikken.

Jeg har givet serveren følgende parametre i /boot/loader.conf:
kern.ipc.somaxconn=4096
kern.threads.max_threads_per_proc=4096
kern.maxvnodes=120000
kern.ipc.maxsockbuf=262144
kern.ipc.nmbclusters=131072
net.inet.ip.rtminexpire=5
net.inet.ip.rtexpire=5
net.inet.tcp.ecn.enable=1
net.inet.tcp.msl=10000

Jeg har indsat en dmesg nederst i mailen. Serveren er en HP DL165 med så mange cores som vi kunne komme i nærheden af. Det hele har kørt glimrende indtil i går, hvor vi blev nødt til at flytte funktionerne fra den gamle til den nye server.

Enhver form for input vil være meget kærkomment, hvordan opsætter I jeres servere som modtager mange connections? Kan netkortet være et problem, det er det som er standard embedded i en ProLiant?

Med venlig hilsen

Niels

****
Copyright (c) 1992-2011 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 8.2-RELEASE #0: Thu Feb 17 02:41:51 UTC 2011
    root_at_mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC amd64
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: AMD Opteron(tm) Processor 6172 (2100.05-MHz K8-class CPU)
  Origin = "AuthenticAMD" Id = 0x100f91 Family = 10 Model = 9 Stepping = 1
  Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
  Features2=0x802009<SSE3,MON,CX16,POPCNT>
  AMD Features=0xee500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM,3DNow!+,3DNow!>
  AMD Features2=0x837ff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,IBS,SKINIT,WDT,<b19>>
  TSC: P-state invariant
real memory = 17179869184 (16384 MB)
avail memory = 16524075008 (15758 MB)
ACPI APIC Table: <HP ProLiant>
FreeBSD/SMP: Multiprocessor System Detected: 24 CPUs
FreeBSD/SMP: 2 package(s) x 12 core(s)
cpu0 (BSP): APIC ID: 16
cpu1 (AP): APIC ID: 17
cpu2 (AP): APIC ID: 18
cpu3 (AP): APIC ID: 19
cpu4 (AP): APIC ID: 20
cpu5 (AP): APIC ID: 21
cpu6 (AP): APIC ID: 22
cpu7 (AP): APIC ID: 23
cpu8 (AP): APIC ID: 24
cpu9 (AP): APIC ID: 25
cpu10 (AP): APIC ID: 26
cpu11 (AP): APIC ID: 27
cpu12 (AP): APIC ID: 32
cpu13 (AP): APIC ID: 33
cpu14 (AP): APIC ID: 34
cpu15 (AP): APIC ID: 35
cpu16 (AP): APIC ID: 36
cpu17 (AP): APIC ID: 37
cpu18 (AP): APIC ID: 38
cpu19 (AP): APIC ID: 39
cpu20 (AP): APIC ID: 40
cpu21 (AP): APIC ID: 41
cpu22 (AP): APIC ID: 42
cpu23 (AP): APIC ID: 43
ACPI Warning: Optional field Pm2ControlBlock has zero address or length: 0x0000000000000000/0x1 (20101013/tbfadt-655)
ioapic0 <Version 2.1> irqs 0-23 on motherboard
ioapic1 <Version 2.1> irqs 24-55 on motherboard
kbd1 at kbdmux0
acpi0: <HP ProLiant> on motherboard
acpi0: [ITHREAD]
ACPI Error: [PCI0] Namespace lookup failure, AE_NOT_FOUND (20101013/dswload-772)
ACPI Exception: AE_NOT_FOUND, During name lookup/catalog (20101013/psloop-326)
ACPI Error: Method parse/execution failed [\\] (Node 0xffffffff80b2e160), AE_NOT_FOUND (20101013/psparse-633)
acpi0: Power Button (fixed)
acpi0: reservation of ffffff00, 4000 (3) failed
acpi0: reservation of fec00000, 1000 (3) failed
acpi0: reservation of fee00000, 1000 (3) failed
acpi0: reservation of ffb80000, 80000 (3) failed
acpi0: reservation of fec10000, 20 (3) failed
acpi0: reservation of 0, a0000 (3) failed
acpi0: reservation of 100000, bff00000 (3) failed
Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x808-0x80b on acpi0
cpu0: <ACPI CPU> on acpi0
cpu1: <ACPI CPU> on acpi0
cpu2: <ACPI CPU> on acpi0
cpu3: <ACPI CPU> on acpi0
cpu4: <ACPI CPU> on acpi0
cpu5: <ACPI CPU> on acpi0
cpu6: <ACPI CPU> on acpi0
cpu7: <ACPI CPU> on acpi0
cpu8: <ACPI CPU> on acpi0
cpu9: <ACPI CPU> on acpi0
cpu10: <ACPI CPU> on acpi0
cpu11: <ACPI CPU> on acpi0
cpu12: <ACPI CPU> on acpi0
cpu13: <ACPI CPU> on acpi0
cpu14: <ACPI CPU> on acpi0
cpu15: <ACPI CPU> on acpi0
cpu16: <ACPI CPU> on acpi0
cpu17: <ACPI CPU> on acpi0
cpu18: <ACPI CPU> on acpi0
cpu19: <ACPI CPU> on acpi0
cpu20: <ACPI CPU> on acpi0
cpu21: <ACPI CPU> on acpi0
cpu22: <ACPI CPU> on acpi0
cpu23: <ACPI CPU> on acpi0
acpi_hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff on acpi0
Timecounter "HPET" frequency 14318180 Hz quality 900
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
pcib1: <ACPI PCI-PCI bridge> irq 52 at device 4.0 on pci0
pci5: <ACPI PCI bus> on pcib1
igb0: <Intel(R) PRO/1000 Network Connection version - 2.0.7> port 0xe400-0xe41f mem 0xfeb60000-0xfeb7ffff,0xfeb40000-0xfeb5ffff,0xfeb98000-0xfeb9bfff irq 44 at device 0.0 on pci5
igb0: Using MSIX interrupts with 9 vectors
igb0: [ITHREAD]
igb0: [ITHREAD]
igb0: [ITHREAD]
igb0: [ITHREAD]
igb0: [ITHREAD]
igb0: [ITHREAD]
igb0: [ITHREAD]
igb0: [ITHREAD]
igb0: [ITHREAD]
igb0: Ethernet address: 78:e3:b5:11:4d:44
igb1: <Intel(R) PRO/1000 Network Connection version - 2.0.7> port 0xe800-0xe81f mem 0xfebe0000-0xfebfffff,0xfebc0000-0xfebdffff,0xfeb9c000-0xfeb9ffff irq 45 at device 0.1 on pci5
igb1: Using MSIX interrupts with 9 vectors
igb1: [ITHREAD]
igb1: [ITHREAD]
igb1: [ITHREAD]
igb1: [ITHREAD]
igb1: [ITHREAD]
igb1: [ITHREAD]
igb1: [ITHREAD]
igb1: [ITHREAD]
igb1: [ITHREAD]
igb1: Ethernet address: 78:e3:b5:11:4d:45
pcib2: <ACPI PCI-PCI bridge> irq 53 at device 6.0 on pci0
pci4: <ACPI PCI bus> on pcib2
igb2: <Intel(R) PRO/1000 Network Connection version - 2.0.7> port 0xd400-0xd41f mem 0xfea60000-0xfea7ffff,0xfea40000-0xfea5ffff,0xfea98000-0xfea9bfff irq 51 at device 0.0 on pci4
igb2: Using MSIX interrupts with 9 vectors
igb2: [ITHREAD]
igb2: [ITHREAD]
igb2: [ITHREAD]
igb2: [ITHREAD]
igb2: [ITHREAD]
igb2: [ITHREAD]
igb2: [ITHREAD]
igb2: [ITHREAD]
igb2: [ITHREAD]
igb2: Ethernet address: 78:e3:b5:11:6d:be
igb3: <Intel(R) PRO/1000 Network Connection version - 2.0.7> port 0xd800-0xd81f mem 0xfeae0000-0xfeafffff,0xfeac0000-0xfeadffff,0xfea9c000-0xfea9ffff irq 48 at device 0.1 on pci4
igb3: Using MSIX interrupts with 9 vectors
igb3: [ITHREAD]
igb3: [ITHREAD]
igb3: [ITHREAD]
igb3: [ITHREAD]
igb3: [ITHREAD]
igb3: [ITHREAD]
igb3: [ITHREAD]
igb3: [ITHREAD]
igb3: [ITHREAD]
igb3: Ethernet address: 78:e3:b5:11:6d:bf
pcib3: <ACPI PCI-PCI bridge> irq 54 at device 10.0 on pci0
pci3: <ACPI PCI bus> on pcib3
vgapci0: <VGA-compatible display> mem 0xfc000000-0xfcffffff,0xfdfec000-0xfdfeffff,0xfe000000-0xfe7fffff irq 47 at device 0.0 on pci3
pcib4: <ACPI PCI-PCI bridge> irq 54 at device 11.0 on pci0
pci2: <ACPI PCI bus> on pcib4
ciss0: <HP Smart Array P410> port 0xc800-0xc8ff mem 0xfdc00000-0xfddfffff,0xfde7f000-0xfde7ffff irq 32 at device 0.0 on pci2
ciss0: PERFORMANT Transport
ciss0: [ITHREAD]
atapci0: <ATI IXP700/800 SATA300 controller> port 0xb000-0xb007,0xa000-0xa003,0x9000-0x9007,0x8000-0x8003,0x7000-0x700f mem 0xfe9fe400-0xfe9fe7ff irq 22 at device 17.0 on pci0
atapci0: [ITHREAD]
atapci0: AHCI v1.10 controller with 4 3Gbps ports, PM supported
ata2: <ATA channel 0> on atapci0
ata2: [ITHREAD]
ohci0: <OHCI (generic) USB controller> mem 0xfe9fa000-0xfe9fafff irq 16 at device 18.0 on pci0
ohci0: [ITHREAD]
usbus0: <OHCI (generic) USB controller> on ohci0
ohci1: <OHCI (generic) USB controller> mem 0xfe9fb000-0xfe9fbfff irq 16 at device 18.1 on pci0
ohci1: [ITHREAD]
usbus1: <OHCI (generic) USB controller> on ohci1
ehci0: <EHCI (generic) USB 2.0 controller> mem 0xfe9fe800-0xfe9fe8ff irq 17 at device 18.2 on pci0
ehci0: [ITHREAD]
usbus2: EHCI version 1.0
usbus2: <EHCI (generic) USB 2.0 controller> on ehci0
ohci2: <OHCI (generic) USB controller> mem 0xfe9fc000-0xfe9fcfff irq 18 at device 19.0 on pci0
ohci2: [ITHREAD]
usbus3: <OHCI (generic) USB controller> on ohci2
ohci3: <OHCI (generic) USB controller> mem 0xfe9fd000-0xfe9fdfff irq 18 at device 19.1 on pci0
ohci3: [ITHREAD]
usbus4: <OHCI (generic) USB controller> on ohci3
ehci1: <EHCI (generic) USB 2.0 controller> mem 0xfe9fec00-0xfe9fecff irq 19 at device 19.2 on pci0
ehci1: [ITHREAD]
usbus5: EHCI version 1.0
usbus5: <EHCI (generic) USB 2.0 controller> on ehci1
pci0: <serial bus, SMBus> at device 20.0 (no driver attached)
atapci1: <ATI IXP700/800 UDMA133 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xff00-0xff0f at device 20.1 on pci0
ata0: <ATA channel 0> on atapci1
ata0: [ITHREAD]
ata1: <ATA channel 1> on atapci1
ata1: [ITHREAD]
isab0: <PCI-ISA bridge> at device 20.3 on pci0
isa0: <ISA bus> on isab0
pcib5: <ACPI PCI-PCI bridge> at device 20.4 on pci0
pci1: <ACPI PCI bus> on pcib5
ohci4: <OHCI (generic) USB controller> mem 0xfe9ff000-0xfe9fffff irq 18 at device 20.5 on pci0
ohci4: [ITHREAD]
usbus6: <OHCI (generic) USB controller> on ohci4
acpi_button0: <Power Button> on acpi0
atrtc0: <AT realtime clock> port 0x70-0x71 irq 8 on acpi0
uart0: <16550 or compatible> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
uart0: [FILTER]
acpi_hpet1: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff on acpi0
device_attach: acpi_hpet1 attach returned 12
orm0: <ISA Option ROMs> at iomem 0xc0000-0xc7fff,0xc8000-0xcbfff on isa0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x300>
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
kbd0 at atkbd0
atkbd0: [GIANT-LOCKED]
atkbd0: [ITHREAD]
ppc0: cannot reserve I/O port range
acpi_throttle0: <ACPI CPU Throttling> on cpu0
hwpstate0: <Cool`n'Quiet 2.0> on cpu0
Timecounters tick every 1.000 msec
usbus0: 12Mbps Full Speed USB v1.0
usbus1: 12Mbps Full Speed USB v1.0
usbus2: 480Mbps High Speed USB v2.0
usbus3: 12Mbps Full Speed USB v1.0
usbus4: 12Mbps Full Speed USB v1.0
usbus5: 480Mbps High Speed USB v2.0
usbus6: 12Mbps Full Speed USB v1.0
ugen0.1: <ATI> at usbus0
uhub0: <ATI OHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus0
ugen1.1: <ATI> at usbus1
uhub1: <ATI OHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus1
ugen2.1: <ATI> at usbus2
uhub2: <ATI EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus2
ugen3.1: <ATI> at usbus3
uhub3: <ATI OHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus3
ugen4.1: <ATI> at usbus4
uhub4: <ATI OHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus4
ugen5.1: <ATI> at usbus5
uhub5: <ATI EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus5
ugen6.1: <ATI> at usbus6
uhub6: <ATI OHCI root HUB, class 9/0, rev 1.00/1.00, addr 1> on usbus6
uhub6: 2 ports with 2 removable, self powered
uhub0: 3 ports with 3 removable, self powered
uhub1: 3 ports with 3 removable, self powered
uhub3: 3 ports with 3 removable, self powered
uhub4: 3 ports with 3 removable, self powered
uhub2: 6 ports with 6 removable, self powered
uhub5: 6 ports with 6 removable, self powered
da0 at ciss0 bus 0 scbus0 target 0 lun 0
da0: <COMPAQ RAID 1(1VOLUME OK> Fixed Direct Access SCSI-5 device
da0: 135.168MB/s transfers
da0: Command Queueing enabled
da0: 139979MB (286677120 512 byte sectors: 255H 32S/T 35132C)
SMP: AP CPU #2 Launched!
SMP: AP CPU #4 Launched!
SMP: AP CPU #1 Launched!
SMP: AP CPU #3 Launched!
SMP: AP CPU #5 Launched!
SMP: AP CPU #6 Launched!
SMP: AP CPU #8 Launched!
SMP: AP CPU #9 Launched!
SMP: AP CPU #11 Launched!
SMP: AP CPU #10 Launched!
SMP: AP CPU #7 Launched!
SMP: AP CPU #22 Launched!
SMP: AP CPU #19 Launched!
SMP: AP CPU #21 Launched!
SMP: AP CPU #20 Launched!
SMP: AP CPU #23 Launched!
SMP: AP CPU #18 Launched!
SMP: AP CPU #13 Launched!
SMP: AP CPU #12 Launched!
SMP: AP CPU #15 Launched!
SMP: AP CPU #17 Launched!
SMP: AP CPU #14 Launched!
SMP: AP CPU #16 Launched!
Trying to mount root from ufs:/dev/da0s1a
WARNING: / was not properly dismounted
WARNING: /usr was not properly dismounted
WARNING: /var was not properly dismounted
ipfw2 (+ipv6) initialized, divert loadable, nat loadable, rule-based forwarding disabled, default to deny, logging disabled
igb2: link state changed to UP
igb3: link state changed to UP
Received on Tue 02 Oct 2012 - 20:42:50 CEST

This archive was generated by hypermail 2.2.0 : Wed 27 Mar 2013 - 10:40:16 CET