I have several DL320e v2's here.
They have the B120i disabled and the drives in AHCI mode.
OS is Linux / Debian. Drives are Seagate 600 PRO SSD's, the 200 GB version.
When I have drive write caching enabled (hdparm -W1 or the bios setting) and NCQ enabled for the drives, I get these errors periodically in the logs:
Mar 22 14:25:19 rack3 kernel: [ 905.893359] ata1: hard resetting link Mar 22 14:25:20 rack3 kernel: [ 906.211921] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Mar 22 14:25:20 rack3 kernel: [ 906.212372] ata1.00: configured for UDMA/133 Mar 22 14:25:20 rack3 kernel: [ 906.212377] ata1: EH complete Mar 22 14:25:20 rack3 kernel: [ 906.228293] ata1: hard resetting link Mar 22 14:25:20 rack3 kernel: [ 906.547636] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Mar 22 14:25:20 rack3 kernel: [ 906.548075] ata1.00: configured for UDMA/133 Mar 22 14:25:20 rack3 kernel: [ 906.548078] ata1: EH complete Mar 22 14:25:20 rack3 kernel: [ 906.577595] ata1: hard resetting link Mar 22 14:25:20 rack3 kernel: [ 906.895341] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Mar 22 14:25:20 rack3 kernel: [ 906.895782] ata1.00: configured for UDMA/133 Mar 22 14:25:20 rack3 kernel: [ 906.895786] ata1: EH complete Mar 22 14:25:20 rack3 kernel: [ 906.914616] ata1: limiting SATA link speed to 3.0 Gbps Mar 22 14:25:20 rack3 kernel: [ 906.915488] ata1: hard resetting link Mar 22 14:25:21 rack3 kernel: [ 907.235052] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 320) Mar 22 14:25:21 rack3 kernel: [ 907.235509] ata1.00: configured for UDMA/133 Mar 22 14:25:21 rack3 kernel: [ 907.235514] ata1: EH complete
tl;dr version - hard resets on the SATA connections under heavy workloads.
Googling for this indicates in generic hardware the most likely cause is poor SATA cables or bugs in the SATA controller. Suggested work-arounds include disabling NCQ. Well, the machines have the factory-installed mini-SAS cable going from the mainboard to the SAS/SATA backplane. And it happens in all of the DL320e v2's I have here to test (more than 2).
Turns out, I can work around the bug by disabling write caching (hdparm -W0) or disabling NCQ (set /sys/block/sda/device/queue_depth to 1). But that is not acceptible because drive performance suffers a 2-3x hit. (Write caching will increase effective write speeds by a factor of 2.5x and disabling NCQ is about a 50% hit.)
I've used at least 4 unique SSD drives of the same model to reproduce this error. Reproduction is fairly easy, just run a 100 GB "bonnie++" run several times and it'll pop up at least once, usually about a half dozen times.
Now the smoking gun is that if I drop in an LSI 9207-8i PCI-e 3.0 SAS/SATA controller in the left-hand expansion slot, using the same internal cabling inside the DL320e's, the errors go completely away and I get about 5% more throughput performance to boot. The 9207 is the "IT"/JBOD mode version, so it's not running any RAID stuff either.
To me this indicates the on-board SATA controller when running in AHCI mode has bugs or at least an incompatibility with the particular drives I'm using.
The DL320e's are flashed to the latest BIOS and ILO versions.
Any ideas?
I will be contacting HP support next week because the situation is unsatisfactory but I wonder if anyone else has any good ideas. I sort-of suspect they will tell me to buy their HP-branded $900 SSD drives. Not going to happen at a 2-3x the cost.