Helpful Information

Category: PostgreSQL Help

Any RAID striping recommendations?

Okay, we are setting up a system with about 70% random reads and the other 30% will be writes. That may even be an 80/20 mix. So you could say it is primarily an OLTP environment with some characteristics of DSS. (For example, most user queries can return a query with say, 200 records).

Our database RAID is 4 drives going through 1 card. I know I'd rathar have 4 cards, but unfortunatley we have to use only 1. Initially, I set it up my FreeBSD system with:

OS block size = 16K
database block size =16K
RAID5 stripe size = 16K

Reasoning for the large 16K size was to help balance between many small concurrent transactions, and those large queries that can return 50-1000 records at once.

Questions:

1. Do you believe 16K is too big? Would it be better to drop down to 8K?

2. Would increasing the stripe size of the RAID to a multiple of the database block size have any advantages? I believe it would if we had 4 cards, but since we have only 1, I believe this would not have any advantage.

3. We can configure our RAID to use RAID10 (0/1), but it defaults to 64K stripe size (and 32K per RAID1 array within the RAID10), so I figured that was too large of a stripe size to use, especially if we want to have a smaller database block size.

So right now I am leaning towards a 8K block size with a 8K RAID5 stripe.

Here is a really great INDEPTH article involving tuning UFS filesystems. If anybody knows engineering and fine tuning, it is the Germans!

http://www.physik.rwth-aachen.de/docs/html/AQ0R3FTE/CHXXXXXX.HTM

5.4.3.3 Choosing the Correct Stripe Width
The performance benefit of striping depends on the size of the stripe width and the characteristics of the I/O load. Stripes of data are allocated alternately and evenly to the subdisks of a striped plex. A striped plex consists of a number of equal-sized subdisks located on different disks.

The number of blocks in a stripe determines the stripe width. LSM uses a default stripe width of 64 KB (or 128 sectors), which works well in most environments.

Use the volstat command to determine the number of data transfer splits. For volumes that receive only small I/O transfers, you may not want to use striping because disk access time is important. Striping is beneficial for large data transfers.

To improve performance of large sequential data transfers, use a stripe width that will divide each individual data transfer and distribute the blocks equally across the disks.

To improve the performance of multiple simultaneous small data transfers, make the stripe width the same size as the data transfer. However, an excessively small stripe width can result in poor system performance.

If you are striping mirrored volumes, ensure that the stripe width is the same for each plex.

Reasoning for the large 16K size was to help balance between many small concurrent transactions, and those large queries that can return 50-1000 records at once

Um, the size of the result set that is returned will have no bearing on the stripe width you need. The db engine has to read an entire index (assuming that you query provides a means to use an index) to return the results, and it has to read the entire data file to pick out those records that it should return.

I think you'll want to look at the size of your files, not worry about the record set size. Then again, I'm not a RAID expert...

Thanks for the reply rodk. :)

Well, once again with keeping with my bad habit of mooting my own questions, it seems that the 64K stripe size is optimal for our particular card, and not only that, we can use RAID10 at this particular size of stripe.

Alot of the most recent reading I have been seeing is that most cards seem to do well at the 64KB stripe size.

I am leaving everything at the defaults (except for FreeBSD 4.5 which is defaulting to 16K block sizes):

RAID10 stripe 64K
postgres 8K
FreeBSD change from 16K to 8K

I am curious. Why are you choosing RAID 1/0 over 5?

Ted,

I'm curious about the other specs on your system. (Sounds like a kicking system ;)). Can you tell us the overall specs?

Any high-availability system that combines PostgreSQL and FreeBSD piques my interest, since I consider these to be two of the best things ever to come out of open source. Let us know how it goes.

And I think your RAID choice is right, to a point. Yes RAID 10 is apparently the most "correct" choice for databases, since it optimizes for read AND write. However, you are setting things up so your whole system depends on a single possible point of failure: the SCSI RAID card itself.

I am very interested in the possibility of using Vinum (http://www.freebsd.org/doc/en_US.ISO8859-1/articles/vinum/index.html) (FreeBSD software RAID) for failover/HA systems, since it lets you stack multiple drives in parallel, without needing a single device to connect them to the rest of the server. Have you considered this? I know historically software RAID is supposed to be inferior, but there are cases for it. FreeBSD is an especially good case for it, because it is so stable, and uses CPU resources so efficiently. Also, Vinum could actually be a "meta-RAID", binding together multiple RAID cards into one or more logical partitions, making it so that the system could potentially even continue working after a full RAID card crash.

Vinum looks to me like a very elegant solution to many high-availability problems, but I wonder whether it's still in active development. The http://www.vinumvm.org/ page shows some documents not updated for more than a year. Anyone else here dealt with Vinum?

rodk,

We chose the RAID10 over RAID5 because firstly of the fault tolerance. A drive in a RAID10 can go down and there will be no loss of performance. When a drive goes down in the RAID5, rebuilding can be tricky, and performance will be really bad until the new drive comes back online. Alot of manufacturers have gotten around this though with hot swap/hot spare technology.

Secondly, the rebuild time on the RAID10 is about twice as fast. Our testing so far has a rebuild time on the RAID10 at about 2 hours, and the RAID5 was around 4.

Another reason for choosing the RAID10 was it is faster than the RAID5, and is recommended as the RAID of choice by most database vendors. RAID5 has a history of being especially poor in write performance.

Here's a pretty good article from Oracle on choosing a RAID and a stripe size. They recommend 32K or 64K striping for OLTP.

http://www.oracle.com/oramag/oracle/00-May/index.html?o30avail.html

rycamor,

You are correct in that our 1 card is a single point of failure. I would like to get around this, unfortunatley we have some cost constraints for this particular machine. It's a modest machine, but if it can prove itself, I believe I can get the people with the money to spring for some faster hardware, especially for some SCSI cards and drives.

Our server specs:

Dual P3 933mhz
512 MB memory (will be upgraded to 1 gig)
133 bus
Duplidisk IDE RAID1 40GB
Adaptec ATA100 RAID10 (SCSI is always a better choice on FreeBSD though) 80GB

Software:
FreeBSD
postgres
Apache
PHP

You make a good point about Vinum. For the above server, we are going to use a remote backup server. (Not a failover server, but one that holds backup copies of the database). It would be nice to throw together a couple of disks into a RAID1 just to be safe.