I have an issue with 20x ProLiant DL380p Gen8 servers. These servers are part of HP Vertica cluster. All servers are running on power performance mode.
OS: CentOS release 6.6 (Final), kernel 2.6.32-504.8.1.el6.x86_64
there are a lot of messages in /var/log/mcelog and dmesg:
CPU27: Package temperature above threshold, cpu clock throttled (total events = 86472)
CPU15: Package temperature above threshold, cpu clock throttled (total events = 86471)
CPU14: Package temperature above threshold, cpu clock throttled (total events = 86472)
CPU28: Package temperature above threshold, cpu clock throttled (total events = 86472)
CPU24: Package temperature above threshold, cpu clock throttled (total events = 86472)
CPU 25 THERMAL EVENT TSC d944f35ad5d65e
TIME 1447757250 Tue Nov 17 11:47:30 2015
Processor heated above trip temperature. Throttling enabled.
Please check your system cooling. Performance will be impacted
mcelog: Too many trigger children running already
STATUS 880003c3 MCGSTATUS 0
MCGCAP 1000819 APICID 23 SOCKETID 1
CPUID Vendor Intel Family 6 Model 62
Hardware event. This is not a software error.
Integrated Management Log show no error's.. Temparature in server room is normal.
Server Firmware Version Info:
HP ProLiant System ROM 08/02/2014
HP Smart Array P420i Controller 6.00 Embedded
iLO 2.10 Jan 15 2015
Intelligent Provisioning 1.61.45
Power Management Controller Firmware 3.3
Power Management Controller Firmware Bootloader 2.7
SAS Programmable Logic Device Version 0x0C
Server Platform Services (SPS) Firmware 2.1.7.E7.4
System Programmable Logic Device Version 0x32
It looks like mcelog bug according to https://bugzilla.redhat.com/show_bug.cgi?id=924570
Anyone facing same issue? Can you recommend how to resolve it?