Quantcast
Channel: VMware Communities: Message List
Viewing all articles
Browse latest Browse all 294344

centos 6 64 bit guest problem with multiple cpu Esxi 3.5 attempting to enable VMI

$
0
0

I had Linux Centos 5.5 with 64 bit kernel :

 

2.6.18-274.3.1.el5 #1 SMP Tue Sep 6 20:13:52 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

 

on 4 cpu Vmware Esxi set to Red Hat Enterprise 5 64 bit guest, with 2 e1000 ethernet nics with driver :

 

# modinfo e1000
filename:       /lib/modules/2.6.18-274.3.1.el5/kernel/drivers/net/e1000/e1000.ko
version:        7.3.21-k4-3-NAPI
license:        GPL
description:    Intel(R) PRO/1000 Network Driver
author:         Intel Corporation, <linux.nics@intel.com>
srcversion:     ED6A6823DBD3577BF429CF3

 

I had a multi threaded application receiving about 2 Mbytes/s over Cisco Systems VPN Client Version 4.8.02 (0030) which worked uninterrupted for months at a time without any problem with 4 cpus in Esxi 3.5.

I installed Centos 6 also with 4 cpus, kernel :

 

Linux centos6 2.6.32-71.29.1.el6.x86_64 #1 SMP Mon Jun 27 19:49:27 BST 2011 x86_64 x86_64 x86_64 GNU/Linux

 

recompiled the Cisco vpn version 4.8.02 (0030) and immediately started to have kernel panics relating to the e1000 nic.

I then installed the updated e1000 driver e1000-8.0.30.tar.gz from the Intel website but I still had problems.

I then substituted the e1000 nic with vmxnet open-vm-tools-8.6.0-425873.tar   :

 

# modinfo vmxnet
filename:       /lib/modules/2.6.32-71.29.1.el6.x86_64/kernel/drivers/net/vmxnet.ko
supported:      external
version:        2.0.9.0
license:        GPL v2
description:    VMware Virtual Ethernet driver
author:         VMware, Inc.
srcversion:     6E7C0312C2FB464593B6250
alias:          pci:v00001022d00002000sv*sd*bc*sc*i*
alias:          pci:v000015ADd00000720sv*sd*bc*sc*i*
depends:
vermagic:       2.6.32-71.29.1.el6.x86_64 SMP mod_unload modversions
parm:           debug:int
parm:           disable_lro:int

 

In the .vmx configuration of the vm I have :

 

ethernet0.present = "true"
ethernet0.virtualDev = "vmxnet"
ethernet0.networkName = "xxx"
ethernet0.addressType = "generated"
ethernet0.generatedAddress = "00:0c:29:f8:a1:1b"
ethernet0.wakeOnPcktRcv = "false"
ethernet0.checkMACAddress = "false"
ethernet0.features = "15"

 

I had the eth0 nic dropping the connection without any apparent warning in /var/log/messages after about 10 minutes of intensive receive over vpn connection I would get kernel panics relating to interrupts, I then configured Centos kernel with notsc, divider=10 as from Vmware documents regarding Red Hat guests, assigned 2 cpus to the vm in Esxi, put hyperthreading mode to "internal" and assigned affinity to 2 cpus from the vsphere client, and the nic would work for many hours under heavy traffic load until again eth0 would drop out.

I noticed that the eth1 which would transmit on the local network at about 500 kbyte/s (much less than received) remained in a working state, whereas eth0 which had much more load would become unresponsive, it was possible to bring back up with ifdown eth0/ifup eth0.

I also disabled ipv6 in case it might have been a source of problems.

The irqs among the 2 nics seem to be kept mostly to separate cpus :

 

# more /proc/interrupts
           CPU0       CPU1
  0:        246          0   IO-APIC-edge      timer
  1:        260        412   IO-APIC-edge      i8042
  8:          0          0   IO-APIC-edge      rtc0
  9:          0          0   IO-APIC-fasteoi   acpi
12:        118          2   IO-APIC-edge      i8042
14:         13         62   IO-APIC-edge      ata_piix
15:          0          0   IO-APIC-edge      ata_piix
17:       3077       9291   IO-APIC-fasteoi   ioc0
18:      16650   11315642   IO-APIC-fasteoi   vmxnet ether <--- eth0 giving problems/flaky
19:    1114584        151   IO-APIC-fasteoi   vmxnet ether   <--- eth1 working ok
NMI:          0          0   Non-maskable interrupts
LOC:    3269152    2548397   Local timer interrupts
SPU:          0          0   Spurious interrupts
PMI:          0          0   Performance monitoring interrupts
PND:          0          0   Performance pending work
RES:    2802190     304539   Rescheduling interrupts
CAL:       5820        164   Function call interrupts
TLB:        824       1168   TLB shootdowns
TRM:          0          0   Thermal event interrupts
THR:          0          0   Threshold APIC interrupts
MCE:          0          0   Machine check exceptions
MCP:         14         14   Machine check polls
ERR:          0
MIS:          0

 

When eth0 drops out it is still present in ifconfig as if nothing had happened and there aren't any errors on eth0, the errors on the vpn interface are considered normal :

 

# ifconfig
cipsec0   Link encap:Ethernet  HWaddr 00:0B:FC:F8:01:8F
          inet addr:10.10.10.22  Mask:255.255.255.255
          UP RUNNING NOARP  MTU:1356  Metric:1
          RX packets:9530289 errors:0 dropped:1001 overruns:0 frame:0
          TX packets:6480096 errors:0 dropped:1 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:5544402306 (5.1 GiB)  TX bytes:436765130 (416.5 MiB)

 

eth0      Link encap:Ethernet  HWaddr 00:0C:29:F8:A1:11
          inet addr:172.31.252.35  Bcast:172.31.252.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:9539518 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6481380 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:6368508134 (5.9 GiB)  TX bytes:932631137 (889.4 MiB)
          Interrupt:18 Base address:0x1400

 

eth1      Link encap:Ethernet  HWaddr 00:0C:29:F8:A1:1B
          inet addr:10.50.10.45  Bcast:10.50.10.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:664238 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1207966 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:58520766 (55.8 MiB)  TX bytes:1285092300 (1.1 GiB)
          Interrupt:19 Base address:0x1440

 

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:40553386 errors:0 dropped:0 overruns:0 frame:0
          TX packets:40553386 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:2138146677 (1.9 GiB)  TX bytes:2138146677 (1.9 GiB)

 

I tried to set txquelen on eth0/1 to 30000 but the problem persists.

Is it possible that the problem might be relating large receive offload :

 

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1027511

 

ethtool -K eth0 lro off

Cannot set large receive offload settings: Operation not supported

 

Is it possible there is cpu contention due to preemption among processes and that the vmxnet driver gets starved or similar, both kernels appear to have this configuration :

 

# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_BKL=y
CONFIG_PREEMPT_NOTIFIERS=y

 

I recompiled the 2.6.32 kernel without preemption but it did not solve the problem that I also get sporadic segmentation faults compiling the kernel with 4 cpus, but not with a single cpu, so it is not just an ethernet i/o issue but more to do with multiple cpu scheduling.

 

This problem seems to occur only with multiple cpus inside VMware, it could be because recent kernels use TSC for multiple cpu scheduling which is perhaps not sufficiently emulated with low latencies on Vmware with kernel 2.6.32.

The 2.6.32 kernel runs quite well on VMware workstation 7.1.5 when the cpus are set to cores, i.e. not multiple sockets but multiple cores within a single socket, so I assume that there is less troublesome multi core scheduling with virtual cores rather than virtual sockets, it seems VMware 3.5 does not support the virtual cores, are there any patches available from VMware in order to enable multiple cores on a single cpu.

 

The reason to try to use Centos 6 was to use VMI paravirtualization, I was aware it was only available for kernels after 2.6.21, but had not realised it was only available for 32 bit guests (having enabled VMI on a 64 bit guest should have been silently denied by VMware, Centos 6 64 bit should have VMI disabled in the default configuration, so it shouldn't have given rise to any instability ? however I should repeat the tests with VMI disabled on 64 bit to be sure) :

 

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1003644

 

so then installed Centos 6 32 bit version, since it ships with kernel parameter CONFIG_VMI switched off, I enabled it and recompiled the kernel version 2.6.32-71.29.1, it detects the VMI ROM on VMware workstation 7.1.5 running with 4 cpu sockets (not cores) set to "Centos" running on Intel Core 2, and also on ESX Server 3i 3.5.0 build-207095 with Intel Xeon 3040 host cpu:

 

Oct 23 21:21:13 localhost kernel: VMI: Found VMware, Inc. Hypervisor OPROM, API version 3.0, ROM version 1.0

 

but freezes on VMware ESX Server 3.5 Update 4 build 153875 right at the very start just after writing "probing for edd" on AMD Opteron 280 physical host.

 

Update: I eventually managed to boot into Centos 6 VMI enabled kernel by enabling AMD specific kernel cpu type I enclose the kernel .config with CONFIG_MK8=y to enable AMD Opteron/K8 that seems to enable booting VMI on AMD host on ESX Server 3.5 Update 4 build 153875 according to enclosed dmesg log, I would appreciate any information on experiences of using VMI in ESX

 

Update: what ultimately seemed to be the problem was an incorrectly patched vpn module, that caused some leakage of buffers in the main eth0, ultimately leading to "no buffer space" and an un responding eth 0, with the correct patch the vpn connection and the system has been stable for 10 days and counting.


Viewing all articles
Browse latest Browse all 294344

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>