Hello All, I am seeing some really slow performance regarding large files on linux. I write a lot of data points from a light sensor. The stream is about 53 Mb/s and i need to keep this rate for 7 minutes, that's a total of about 22Gb. I can sustain 53Mb/s pretty well until the file grows to over 1Gb or so, then things hit the wall and the writes to the filesystem can't keep up. The writes go from 20ms in duration to 500ms. I assume the filesystem/operating system is caching writes. Do you have any suggestions on how to speed up performance on these writes, filesystem options, kernel options, other strategies, etc? Things I have tried: - I have tried this on a ext3 file system as well as an xfs filesystem with the same result. - I have also tried spooling over several files (a la multiple volumes) but i see no difference in performance. In fact, i think this actually hinders performance a bit. - I keep my own giant memory buffer where all the data is stored and then it is written to disk in a background thread. This helps, but i run out of space in the buffer before i finish taking data. Thanks, -- Brad
==> On Wed, 16 May 2007 14:53:16 +0000, brad noyes <maitre@ccs.neu.edu> said: brad> Hello All, brad> I am seeing some really slow performance regarding large files on linux. I brad> write a lot of data points from a light sensor. The stream is about 53 Mb/s and brad> i need to keep this rate for 7 minutes, that's a total of about 22Gb. I brad> can sustain 53Mb/s pretty well until the file grows to over 1Gb or so, then brad> things hit the wall and the writes to the filesystem can't keep up. The writes brad> go from 20ms in duration to 500ms. I assume the filesystem/operating system brad> is caching writes. Do you have any suggestions on how to speed up performance brad> on these writes, filesystem options, kernel options, other strategies, etc? Of course. Your data set is larger than the page cache, so when you hit the low watermark, it starts write-back. You can deal with this a few different ways, and I'll throw out the easiest ways first: 1) Get more memory 2) Get a faster disk If those are not options, then you can tweak your application by using AIO and O_DIRECT. This will allow you to drive your disk queue depths a bit further and avoid the page cache. Check the man pages for io_setup, io_submit, and io_getevents to get started. brad> Things I have tried: brad> - I have tried this on a ext3 file system as well as an xfs filesystem brad> with the same result. You may not want to use a journalled file system. If you must, though, with ext3 you could try running with the data=writeback option. brad> - I have also tried spooling over several files (a la multiple volumes) brad> but i see no difference in performance. In fact, i think this actually brad> hinders performance a bit. I'm not sure I fully understand what you mean. Are you saying you write to separate physical volumes, and that you don't see any performance increase from doing so? brad> - I keep my own giant memory buffer where all the data is stored and brad> then it is written to disk in a background thread. This helps, but brad> i run out of space in the buffer before i finish taking data. Right, this is exactly what happens in the OS. ;) Speaking of which, you don't mention which kernel you are using. Could you please provide that information? There are a few vm tunables that you could try tweaking, but I really don't think they will help if your data set is larger than memory. We can explore that option, though, if you like. For now, my suggestion is to try using AIO with the open flag O_DIRECT. This will require you to align your data on 512 byte boundaries (and the size of the I/Os has to be a multiple of 512 as well). If you need any help converting your app, feel free to contact me off-list. -Jeff p.s. In your head, is Mb Megabit or Megabyte?
-----Original Message----- From: Jeff Moyer
p.s. In your head, is Mb Megabit or Megabyte?
I always thought that lower-case b meant bit and upper-case B meant byte, but the rest of the world doesn't seem to care what I think. -- Rich
On Wed, May 16, 2007 at 11:38:09AM -0400, Jeff Moyer wrote:
==> On Wed, 16 May 2007 14:53:16 +0000, brad noyes <maitre@ccs.neu.edu> said:
brad> Hello All, brad> I am seeing some really slow performance regarding large files on linux. I brad> write a lot of data points from a light sensor. The stream is about 53 Mb/s and brad> i need to keep this rate for 7 minutes, that's a total of about 22Gb. I brad> can sustain 53Mb/s pretty well until the file grows to over 1Gb or so, then brad> things hit the wall and the writes to the filesystem can't keep up. The writes brad> go from 20ms in duration to 500ms. I assume the filesystem/operating system brad> is caching writes. Do you have any suggestions on how to speed up performance brad> on these writes, filesystem options, kernel options, other strategies, etc?
Of course. Your data set is larger than the page cache, so when you hit the low watermark, it starts write-back. You can deal with this a few different ways, and I'll throw out the easiest ways first: 1) Get more memory 2) Get a faster disk
Ha :). I have 12GB of memory. Which actually brings me to another question. How do i alter the per-process memory limit? I can only allocate a memory buffer that is 3GB. I'd like to make use of the other 8GB left in the machine. If i can double my buffer size i think i could sustain the 53MB/s for 7 minutes that i need.
If those are not options, then you can tweak your application by using AIO and O_DIRECT. This will allow you to drive your disk queue depths a bit further and avoid the page cache. Check the man pages for io_setup, io_submit, and io_getevents to get started.
I'll check out these options and man pages.
brad> Things I have tried: brad> - I have tried this on a ext3 file system as well as an xfs filesystem brad> with the same result.
You may not want to use a journalled file system. If you must, though, with ext3 you could try running with the data=writeback option.
yup. I'll check this option out.
brad> - I have also tried spooling over several files (a la multiple volumes) brad> but i see no difference in performance. In fact, i think this actually brad> hinders performance a bit.
I'm not sure I fully understand what you mean. Are you saying you write to separate physical volumes,
Not physical volumes, but different files. By the end of the data acquisition i will end up with the files: data.01, data.02, data.03 ... etc. Each file is a 1GB in size or whatever i set the limit to be. The reason i did this is because i thought that as the file grows larger there are several layers of indirection in the inode to get to the actual data blocks on disk; and perhaps that might hinder performance.
and that you don't see any performance increase from doing so?
Correct. I don't see any improvement. At least no measurable performance improvement in the kind of rates i'm dealing with.
brad> - I keep my own giant memory buffer where all the data is stored and brad> then it is written to disk in a background thread. This helps, but brad> i run out of space in the buffer before i finish taking data.
Right, this is exactly what happens in the OS. ;) Speaking of which, you don't mention which kernel you are using. Could you please provide that information? There are a few vm tunables that you could try tweaking, but I really don't think they will help if your data set is larger than memory. We can explore that option, though, if you like.
i'm using the 2.6.20 kernel from the ubuntu source tree. I recompiled it to get the large memory support, up to 64GB. I was looking for some tunable vm options in sysctl, but i didn't see much that made sense to me. If nothing else helps perhaps i will ask about the vm options.
p.s. In your head, is Mb Megabit or Megabyte?
the latter. Jamie already pointed this typo out to me :). Perhaps this time around my unit abbreviations are correct. Thanks for your input. I'll keep the list posted. -- Brad
==> On Wed, 16 May 2007 16:26:53 +0000, brad noyes <maitre@ccs.neu.edu> said: brad> On Wed, May 16, 2007 at 11:38:09AM -0400, Jeff Moyer wrote: brad> > ==> On Wed, 16 May 2007 14:53:16 +0000, brad noyes <maitre@ccs.neu.edu> said: brad> > brad> > brad> Hello All, brad> > brad> I am seeing some really slow performance regarding large files on linux. I brad> > brad> write a lot of data points from a light sensor. The stream is about 53 Mb/s and brad> > brad> i need to keep this rate for 7 minutes, that's a total of about 22Gb. I brad> > brad> can sustain 53Mb/s pretty well until the file grows to over 1Gb or so, then brad> > brad> things hit the wall and the writes to the filesystem can't keep up. The writes brad> > brad> go from 20ms in duration to 500ms. I assume the filesystem/operating system brad> > brad> is caching writes. Do you have any suggestions on how to speed up performance brad> > brad> on these writes, filesystem options, kernel options, other strategies, etc? brad> > brad> > Of course. Your data set is larger than the page cache, so when you brad> > hit the low watermark, it starts write-back. You can deal with this a brad> > few different ways, and I'll throw out the easiest ways first: brad> > 1) Get more memory brad> > 2) Get a faster disk brad> > brad> Ha :). I have 12GB of memory. Which actually brings me to another question. brad> How do i alter the per-process memory limit? I can only allocate a memory brad> buffer that is 3GB. I'd like to make use of the other 8GB left in the machine. brad> If i can double my buffer size i think i could sustain the 53MB/s for 7 brad> minutes that i need. I'm guessing that you're using a 32 bit system, is that right? 32 bit systems have the 3/1 memory split, unless you're using Ingo's 4/4 split patch (which Ubuntu doesn't carry, I think). brad> i'm using the 2.6.20 kernel from the ubuntu source tree. I recompiled it to get brad> the large memory support, up to 64GB. OK, yes, 32 bit system. brad> I was looking for some tunable vm options in sysctl, but i didn't see much that brad> made sense to me. If nothing else helps perhaps i will ask about the vm brad> options. Look under /proc/sys/vm. Documentation for these variables might be in Documentation/filesystems/proc.txt (it's not always up-to-date). But, as I said, I don't think this is the right avenue to explore. You can get more predictable results by using AIO+O_DIRECT (or maybe even O_SYNC as another mentioned). -Jeff
==> On Wed, 16 May 2007 12:59:54 -0400, Jeff Moyer <jmoyer@redhat.com> said: Jeff> Look under /proc/sys/vm. Documentation for these variables might be Jeff> in Documentation/filesystems/proc.txt (it's not always up-to-date). Jeff> But, as I said, I don't think this is the right avenue to explore. Jeff> You can get more predictable results by using AIO+O_DIRECT (or maybe Jeff> even O_SYNC as another mentioned). One other thing worth mentioning is that you should be doing I/O in large block sizes. What size are you currently using for your write buffers? -Jeff
On Wed, May 16, 2007 at 02:40:38PM -0400, Jeff Moyer wrote:
==> On Wed, 16 May 2007 12:59:54 -0400, Jeff Moyer <jmoyer@redhat.com> said:
Jeff> Look under /proc/sys/vm. Documentation for these variables might be Jeff> in Documentation/filesystems/proc.txt (it's not always up-to-date). Jeff> But, as I said, I don't think this is the right avenue to explore. Jeff> You can get more predictable results by using AIO+O_DIRECT (or maybe Jeff> even O_SYNC as another mentioned).
One other thing worth mentioning is that you should be doing I/O in large block sizes. What size are you currently using for your write buffers?
i'm writing in 16777216 byte chunks. That happens to be evenly divisible by 512 for the O_DIRECT flag. However every time i try to use that flag the file gets created, but nothing gets written. I've been looking online for an example. I don't know if this means anything, but i ran hdparm -T /dev/sdb1 Timing cached reads: 1369MB in 2.00 seconds == 698.14MB/sec hdparm -T --direct /dev/sdb1 Timing O_DIRECT cached reads: 136 in 2.00 seconds == 66.54MB/sec It really seems that 53MB/s shouldn't be hard. I have fairly heavy hardware, scsi320 in a raid1 configuration. -- Brad
brad noyes wrote:
On Wed, May 16, 2007 at 02:40:38PM -0400, Jeff Moyer wrote:
==> On Wed, 16 May 2007 12:59:54 -0400, Jeff Moyer <jmoyer@redhat.com> said:
Jeff> Look under /proc/sys/vm. Documentation for these variables might be Jeff> in Documentation/filesystems/proc.txt (it's not always up-to-date). Jeff> But, as I said, I don't think this is the right avenue to explore. Jeff> You can get more predictable results by using AIO+O_DIRECT (or maybe Jeff> even O_SYNC as another mentioned).
One other thing worth mentioning is that you should be doing I/O in large block sizes. What size are you currently using for your write buffers?
i'm writing in 16777216 byte chunks. That happens to be evenly divisible by 512 for the O_DIRECT flag. However every time i try to use that flag the file gets created, but nothing gets written. I've been looking online for an example.
I don't know if this means anything, but i ran hdparm -T /dev/sdb1 Timing cached reads: 1369MB in 2.00 seconds == 698.14MB/sec hdparm -T --direct /dev/sdb1 Timing O_DIRECT cached reads: 136 in 2.00 seconds == 66.54MB/sec
It really seems that 53MB/s shouldn't be hard. I have fairly heavy hardware, scsi320 in a raid1 configuration.
-- Brad _______________________________________________
That is CACHED read.. your uncached write will be slower.. run a bonnie++ benchmark on your disks. You might want to switch to raid 0 or raid 1+0. And use a 64 bit kernel to be able to allocate more than 3G of ram. -- -- Karl Hiramoto http://karl.hiramoto.org/ US VOIP: (1) 603.966.4448 Spain Casa (34) 951.273.347 Spain Mobil (34) 617.463.826 Yahoo_IM = karl_hiramoto GTalk= karl.hiramoto [at] gmail [d0t] com -- Everyone complains of his memory, no one of his judgement.
==> On Wed, 16 May 2007 19:19:45 +0000, brad noyes <maitre@ccs.neu.edu> said: brad> On Wed, May 16, 2007 at 02:40:38PM -0400, Jeff Moyer wrote: brad> > ==> On Wed, 16 May 2007 12:59:54 -0400, Jeff Moyer <jmoyer@redhat.com> said: brad> > brad> > Jeff> Look under /proc/sys/vm. Documentation for these variables might be brad> > Jeff> in Documentation/filesystems/proc.txt (it's not always up-to-date). brad> > Jeff> But, as I said, I don't think this is the right avenue to explore. brad> > Jeff> You can get more predictable results by using AIO+O_DIRECT (or maybe brad> > Jeff> even O_SYNC as another mentioned). brad> > brad> > One other thing worth mentioning is that you should be doing I/O in brad> > large block sizes. What size are you currently using for your write brad> > buffers? brad> > brad> i'm writing in 16777216 byte chunks. That happens to be evenly divisible by 512 brad> for the O_DIRECT flag. However every time i try to use that flag the file gets brad> created, but nothing gets written. I've been looking online for an example. Are you aligning your buffers on a 512 byte boundary? Man posix_memalign. brad> I don't know if this means anything, but i ran brad> hdparm -T /dev/sdb1 brad> Timing cached reads: 1369MB in 2.00 seconds == 698.14MB/sec brad> hdparm -T --direct /dev/sdb1 brad> Timing O_DIRECT cached reads: 136 in 2.00 seconds == 66.54MB/sec brad> It really seems that 53MB/s shouldn't be hard. I have fairly heavy hardware, brad> scsi320 in a raid1 configuration. So long as nothing else is going on at the time. =) -Jeff
"brad" == brad noyes <maitre@ccs.neu.edu> writes:
brad> I am seeing some really slow performance regarding large files brad> on linux. I write a lot of data points from a light sensor. The brad> stream is about 53 Mb/s and i need to keep this rate for 7 brad> minutes, that's a total of about 22Gb. I can sustain 53Mb/s brad> pretty well until the file grows to over 1Gb or so, then things brad> hit the wall and the writes to the filesystem can't keep up. The brad> writes go from 20ms in duration to 500ms. I assume the brad> filesystem/operating system is caching writes. Do you have any brad> suggestions on how to speed up performance on these writes, brad> filesystem options, kernel options, other strategies, etc? You've already had a good bunch of suggestions, but I've got some questions on your hardware. - cpu? - memory - 12gb I know - disk(s) - RAID setup at all? One way to get more performance would be to add another disk or two and to stripe your data between them. This assumes you have enough PCI bus bandwidth available as well. You don't say how you're capturing the light sensor data, but it's obviosly not over a serial port or some other slow device. Network? So if you've got 53 Mbyte/second comming into the system, and another 53Mbytes/second writing out to disk, then you're starting to get close to the 132Mbytes/sec bandwidth of the PCI bus. Finding a motherboard with two or more PCI busses would help. Or something with PCI-E busses. It all depends on your budget and the data acquisition tool you're using. John
On Wed, May 16, 2007 at 03:19:17PM -0400, John Stoffel wrote:
"brad" == brad noyes <maitre@ccs.neu.edu> writes:
brad> I am seeing some really slow performance regarding large files brad> on linux. I write a lot of data points from a light sensor. The brad> stream is about 53 Mb/s and i need to keep this rate for 7 brad> minutes, that's a total of about 22Gb. I can sustain 53Mb/s brad> pretty well until the file grows to over 1Gb or so, then things brad> hit the wall and the writes to the filesystem can't keep up. The brad> writes go from 20ms in duration to 500ms. I assume the brad> filesystem/operating system is caching writes. Do you have any brad> suggestions on how to speed up performance on these writes, brad> filesystem options, kernel options, other strategies, etc?
You've already had a good bunch of suggestions, but I've got some questions on your hardware.
i don't mind answering more questions :).
- cpu? dual xeons (not sure if they are hyper threaded or dual core) cpu MHz: 3067.044 cache size: 512 KB
- memory - 12gb I know - disk(s) SCSI hard drives, i believe they are SCSI 320. I have tried in a raid1 as well as stand alone.
- RAID setup at all? tried using RAID1. I'm afraid to try RAID0 b/c the data is pretty vital, but i may try it.
One way to get more performance would be to add another disk or two and to stripe your data between them. This assumes you have enough PCI bus bandwidth available as well. You don't say how you're capturing the light sensor data, but it's obviosly not over a serial port or some other slow device. Network? So if you've got 53 Mbyte/second comming into the system, and another 53Mbytes/second writing out to disk, then you're starting to get close to the 132Mbytes/sec bandwidth of the PCI bus.
Good point. Its not network, but it is still on the PCI bus. It doesn't seem that bandwidth is really a problem b/c it works great for the first minute. Once the file grows past 1GB the writes are extremely slow. Like i said above i might try the RAID0 configuration.
Finding a motherboard with two or more PCI busses would help. Or something with PCI-E busses. It all depends on your budget and the data acquisition tool you're using.
If i were to redesign this system i would really like to use PXI, which was really meant for this sort of thing. I'm really retrofitting my design to to something different than intended. Thanks -- Brad
--- brad noyes <maitre@ccs.neu.edu> wrote:
Do you have any suggestions on how to speed up performance on these writes, filesystem options, kernel options, other strategies, etc?
How are you doing the writes, using fwrite(3), write(2), or mmap(2)? I've seen dramatic speedups reading large files using memory-mapped I/O.
- I have also tried spooling over several files (a la multiple volumes) but i see no difference in performance. In fact, i think this actually hinders performance a bit.
Do you mean multiple physical spindles or drives here? (I'd expect a slow down using multiple partitions on the same physical drive.)
- I keep my own giant memory buffer where all the data is stored and then it is written to disk in a background thread. This helps, but i run out of space in the buffer before i finish taking data.
With memory-mapped I/O you could just maintain the buffer and have the kernel page it out to disk for you as needed; no need to shuffle stuff around with a background thread. --Andre ____________________________________________________________________________________Building a website is a piece of cake. Yahoo! Small Business gives you all the tools to get online. http://smallbusiness.yahoo.com/webhosting
All, here's a summary of what i have tried so far. I have tried using the O_DIRECT and O_SYNC flags with open(2) and i have been using a non-journalling filesystem, ext2. None of these options have yielded any better performance at the rates i'm looking for. I have not tried AIO. I don't think i can use a non 32-bit kernel on the xeons. Correct me if i'm wrong (it's happened before). As a temporary solution i created a memory filesystem, by $> mount -t tmpfs tmpfs -o size=6G /data/memory The writes are fast however i need another process to progressively move data out of the memory filesystem since i don't have enough memory to hold all 22GB of data i need. For those interested i have attached a small program which will use different options to open and profile the fwrite calls. To compile simply do $> sh ./testwrite.c $> ./testwrite -h $> ./testwrite -l 16 file.out I just wanted to keep you all updated. Thanks for all your help. I'll keep trying various ideas. I wish i could make it to the meeting to buy you all pizza for your help. Perhaps next month. Thanks, -- Brad
==> On Wed, 16 May 2007 22:19:14 +0000, brad noyes <maitre@ccs.neu.edu> said: brad> All, brad> here's a summary of what i have tried so far. I have tried using the O_DIRECT brad> and O_SYNC flags with open(2) and i have been using a non-journalling brad> filesystem, ext2. None of these options have yielded any better performance at brad> the rates i'm looking for. brad> I have not tried AIO. AIO will allow you to drive your disk subsystem better than synchronous writes; I really think this would solve your problem. For an example of an AIO app, you can google for aio-stress.c. Or, send me some code and I'll help get you started. -Jeff
participants (6)
-
Andre Lehovich
-
brad noyes
-
Jeff Moyer
-
John Stoffel
-
Karl Hiramoto
-
Klein, Richard