If you want a simple mirror, you can't get simpler than mdadm with xfs on top of it.
Now, is that your best option? Not sure. It *sounds* like your application is going to be come i/o bound really quickly.

To John's point, I'd like to understand what you're better.

One option you might want to investigate is something like a highpoint nvme card with a bunch of fast nvme's on it. I have such a setup with 4 nvme's in a raid0 (mdadm) and it'll do 800k IOPS.

Tim.

On Fri, Jan 22, 2021 at 5:01 PM John Stoffel via WLUG <wlug@lists.wlug.org> wrote:

Michael> In Linux I can do software RAID using LVM or with MD (i.e.,
Michael> mdadm) as a basis.  I'm currently thinking of a simple mirror
Michael> of two conventional SATA disks.

Michael> The machine is being built to do a compute-workload involving
Michael> examination of many small(ish?) files, and will not be a
Michael> desktop or a recreational/gaming machine.  I don't know
Michael> anything about the files or how they'll be examined.

Will all the working set be local to the machine, or will they be
stored on a NAS, but copied to the system and processed locally?

Since it's a dedicated compute box, if I knew I could rebuild it
easily, and I needed maximum local disk performance, I's be tempted to
setup RAID0 stripe across the two disks.  But... with SATA SSDs, I'd
probably just go with a RAID1 mirror using mdadm, then setup lvm on
top of the /dev/md0 device to create my local filesystems. 

Michael> How would YOU setup a simple mirror in whatever Linux you
Michael> use?  Why do you prefer your selection?

I prefer mdadm on the bottom, then lvm on top, then ext4/xfs on top of
that.

But... in this case, the workload will impact the design.  One thing
to keep in mind is that directories with lots and lots of files will
have scaling problems past a certain point.  The old NNTP news spool
servers used to setup a directory structure with three, four or more
directory layers so they didn't end up with too many files in any one
directory.

Now, if all the data is local, will you be doing backups?  Can you do
backups of just intermediate results?  Does it all need to be backed
up?

Again, if there will be lots of IO, going with SSDS will be best,
otherwise each disk will limit you to 100-120 IOPs/disk.  SSDS handle
it so much better.

But of course... under sustained load, some SSDs will slow down to a
crawl as they hit the internal cache limits and they need to start
doing garbage collection while still writing.

But in general, your question didn't give us enough info to give you a
good answer.  Maybe try it top down?  I.e.:

    I've got an application which needs to process hundreds (thousands?
    millions?) of small files which are downloaded, parsed with a compute
    light/heavy process, and intermediate/final results then saved for
    furthre processing.

Describe the problem you're trying to solve, without assuming a
design.

John
_______________________________________________
WLUG mailing list -- wlug@lists.wlug.org
To unsubscribe send an email to wlug-leave@lists.wlug.org
Create Account: https://wlug.mailman3.com/accounts/signup/
Change Settings: https://wlug.mailman3.com/postorius/lists/wlug.lists.wlug.org/
Web Forum/Archive: https://wlug.mailman3.com/hyperkitty/list/wlug@lists.wlug.org/message/S7MPWBIUJIFSFQ6AWXDWJAC6GEGWVO52/


--
I am leery of the allegiances of any politician who refers to their constituents as "consumers".