Lustre solves the problem by getting out of the way.  The metadata server simply tells the client on which of the nodes the pieces of the file resides.  Reading the documentation, when it comes to Lustre, split brain can't happen because the moment any of the meta controllers go offline, the whole thing stops.

Definitely with ZFS if you don't start with a plan you're going to have a bad day.

The current design for a node is a 1U HP DL360G7 server with ~144 or 288GB of ram.
Internal storage is handled by the smart array controller.  The next one I build I think I'm going to buy a 9200-8i and route the cables so that all the internal and external storage is JBOD.
My current HBA card dejour for external storage is the LSI 9200-8e.
As for external JBOD enclosures, there are LOTS of choices.  Generally I've stuck with the promise J610s.  It's essentially the "expansion" cabinet for a "smart" promise array.  It's beauty is in it's abject stupidity.

As for drives, currently all my stuff is running 4TB Seagate SAS drives.

For "mainline" nodes, I configure them as raid10 pools.  This gets me ~29TB of storage per pool

For "backup" nodes, I configure them as raidz2 pools. All my backup nodes are 32 disk boxes so the pools are 116TB.

Obviously this design is a balance between cost and performance.

If I need more slots I'll use a 380 instead of 360 and I'll stick in an Intel X540-T2 10GigE card.

It's fun to imagine if you scaled up the disks and the interconnects what a filesystem would look like.


On Tue, Mar 28, 2017 at 8:45 PM, John Stoffel <> wrote:

Tim> My first experiment was simply setting up a distributed volume
Tim> that needs to have each of the "bricks" as they refer to them as
Tim> being reliable members.  


Tim> However you can configure it to mirror and strip across the
Tim> "bricks" so they don't have to be.  In fact you can setup
Tim> stripped mirrors, etc.

That would be safer for sure.  I can just see a network problem taking
down a bunch of bricks and leading to a split-brain situation if
you're not careful.

Tim> I've now got a number of these storage nodes as I call them.  It
Tim> generally consists of a 1U box, stuffed full of ram, connected to
Tim> a JBOD chassis full of disks.  I then use ZFS to create storage
Tim> pools.

So I used to like ZFS and think it was "da bomb" but after using it
for several years, and esp after using the ZFS appliances from
Sun/Oracle, I'm not really enamored of the entire design any more.

I think they have some great ideas in terms of checksumming all the
files and metadata.  But the layering (or lack of) disks/devices
inside zpools just drives me nuts.  It's just really inflexible and
can get you in trouble if you're not careful.

Tim> The problem with this is that your scale is limited to the size
Tim> of a single "node" and while you can play games with autofs it's
Tim> not a cohesive filesystem.

Can you give some more details of your setup?

Tim> This solves that problem.  My only complaint is that it's fuse based.

Yeah, not really high performance at the end of the day.

Tim> Lustre is a kernel loaded filesystem that'll do the same thing as
Tim> gluster.   However, it doesn't have any of the redunancy stuff,
Tim> it simply assumes your underlying storage is reliable.

This is the key/kicker for all these systems.

I'm waiting for someone to come up with an opensource sharding system
where you use erasure codes for low-level block storage so that you
have alot of the RAID6 advantages, but even more reliability AND
speed.  But it's also a hard hard problem space to get right.

Thanks for sharing!

Wlug mailing list

I am leery of the allegiances of any politician who refers to their constituents as "consumers".