RAID on the screenshot server

Disclaimer: All information on this page is provided AS IS and without any warranty. I've been striving for accuracy, but there may be severe errors in this page. Fiddling with hard drives can have disastrous effects. Please be careful and always have backups.

It's a good idea to run a RAID (Redundant Array of Independent Disks) on the central server for performance and data integrity. The server at browsershots.org has a hardware RAID-1 with a 8006-2LP controller from 3ware. Here's a loose collection of tricks for RAID maintenance.

smartmontools

$ sudo apt-get install smartmontools
$ sudo mknod /dev/twe0 u 164 0

Configuration lines in /etc/smartd.conf:

# Monitor 2 ATA disks connected to a 3ware 6/7/8000 controller which uses
# the 3w-xxxx driver. Start long self-tests on Saturday and Sunday.
/dev/twe0 -d 3ware,0 -a -s L/../../6/01
/dev/twe0 -d 3ware,1 -a -s L/../../7/01

tw_cli

A proprietary command line tool for 3ware RAID controllers can be downloaded from http://www.3ware.com/support/download.asp.

Sometimes I get SMARTD errors from the weekly selftest:

$ sudo smartctl -q errorsonly -l selftest -d 3ware,1 /dev/twe0
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       40%      8771         10826523

In this case, it helps to run the following:

$ sudo ./tw_cli /c0/u0 start verify

The verify and repair procedure can take several hours. You can monitor its progress:

$ sudo ./tw_cli /c0 show

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    RAID-1    INITIALIZING   79     -       152.669   ON     -        -        

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     152.67 GB   320173056     L41LB68G            
p1     OK               u0     152.67 GB   320173056     L308BKBH            

After the verify, you may get the following:

$ sudo ./tw_cli /c0 show

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    RAID-1    DEGRADED       -      -       152.669   ON     -        -        

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     152.67 GB   320173056     L41LB68G            
p1     DEGRADED         u0     152.67 GB   320173056     L308BKBH            

When that happens, I do this:

$ sudo ./tw_cli maint remove c0 p1
Exporting port /c0/p1 ... Done.

$ sudo ./tw_cli info c0

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    RAID-1    DEGRADED       -      -       152.669   ON     -        -        

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     152.67 GB   320173056     L41LB68G            
p1     NOT-PRESENT      -      -           -             -

$ sudo ./tw_cli maint rescan 
Rescanning controller /c0 for units and drives ...Done.
Found the following unit(s): [none].
Found the following drive(s): [/c0/p1].

$ sudo ./tw_cli maint rebuild c0 u0 p1 
Sending rebuild start request to /c0/u0 on 1 disk(s) [1] ... Done.

$ sudo ./tw_cli info c0

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    RAID-1    REBUILDING     12     -       152.669   ON     -        -        

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     152.67 GB   320173056     L41LB68G            
p1     DEGRADED         u0     152.67 GB   320173056     L308BKBH            

The rebuilding makes your hard drives run much slower. In my case, the web server got too unresponsive because of slow database access. It may help to stop the web server until the rebuild is finished. I think the rebuild went faster when most other hard drive access was paused. In the end it should look like this again:

$ sudo ./tw_cli info c0

Unit  UnitType  Status         %Cmpl  Stripe  Size(GB)  Cache  AVerify  IgnECC
------------------------------------------------------------------------------
u0    RAID-1    OK             -      -       152.669   ON     -        -        

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     OK               u0     152.67 GB   320173056     L41LB68G            
p1     OK               u0     152.67 GB   320173056     L308BKBH            

There's a new syntax for the commands above, and the old syntax will be removed in a future release of tw_cli. I think these are the above commands in new syntax, but I'm just guessing from the documentation, and haven't tried these:

$ sudo ./tw_cli /c0/p1 export
$ sudo ./tw_cli /c0 show
$ sudo ./tw_cli /c0 rescan
$ sudo ./tw_cli /c0/u0 start rebuild disk=p1
$ sudo ./tw_cli /c0/u0 pause rebuild
$ sudo ./tw_cli /c0/u0 resume rebuild

After a successful rebuild, you should run a long selftest on the drive. This takes several hours as well, but it doesn't affect the server performance as heavily as the rebuild. The two lines in /etc/smartd.conf near the top of this page schedule the same test for every weekend.

$ sudo smartctl --test=long -d 3ware,1 /dev/twe0
$ sudo smartctl --all -d 3ware,1 /dev/twe0 | grep -B1 remaining
$ sudo smartctl --log=selftest -d 3ware,1 /dev/twe0
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      8910         -
# 2  Extended offline    Aborted by host               40%      8902         -
# 3  Extended offline    Completed: read failure       40%      8771         10826523
...
#19  Extended offline    Completed without error       00%      6103         -
#20  Short offline       Completed without error       00%         0         -