Filesystems
Jan 13, 2006
This is the outline to a seminar that I gave at CSH on filesystems. The impetus was simply my lack of knowledge for how filesystems work and the misconceptions (particularly about journaling) that most people have. Synopsis: use ReiserFS, but don’t believe the Gentoo kiddies who say you’ll get a massive performance boost; you wont. ReiserFS is slightly faster than EXT3, a little bit more space efficient and gets beyond the 2GB filesize limit. I run EXT2 because it’s a touche faster than all the journaling filesystems and I know that my systems aren’t going to go down uncleanly. Most people, however, aren’t as meticulous as I am so a Journaling filesystem is a better fit.
- Filesystem writes
- Non-journaling (EXT2, FAT)
- Update meta-data (size)
- Update indirect mapping blocks
- Write data
- Journaling modes
- Journal
- Logs data and metadata changes
- Slowest, most secure
- Requires each write to be made twice
- Written to journal
- Then written to disk
- Log line only deleted once the transaction is done
- Ordered (default for EXT3)
- Only logs changes to metadata
- Writes data updates before metadata is changed
- If metadata wasn’t updated, the change never happened
- Writeback (default for ReiserFS and XFS)
- Only logs changes to metdata
- Uses regular write process
- Causes problems if metadata is written but data isn’t
- Journal
- Non-journaling (EXT2, FAT)
- EXT2 - non-journaling Linux
- Maximum file size of 2GB
- Supports maximum filesystem size of 4TB
- Maximum file name size of 255 characters
- Supports regular files, directories, devices and symlinks
- Files created within a directory inherit attributes (permission, owner, group)
- Allows user defined block sizes
- Typically 1024, 2048 or 4096 bytes
- Larger blocks speed up I/O; fewer requests issued, less head movement
- Fast symlinks
- Target name stored in inode, not data area
- Max size of link is 60 character
- Superblocks track clean/not clean state of filesystem
- Marked not clean when mounted read/write
- Marked clean on unmount or remounted read only
- Uses mount counter; after X mounts, ignores state and forces fsck
- Includes support for secure deletion; random data written to deleted blocks
- Physical structure - disk divided into number block groups (see Fig. 1)
- Each block group contains redundant filesystem info and data for the block (see Fig. 2, reference 3)
- Super block - total number of blocks in filesystem, check counter, etc.
- Group descriptor - pointers to block bitmap, inode bitmap, and inode table
- Block bitmap - tells which blocks are in use
- Inode bitmap - tells which inodes are in use
- Inode table - each file has an inode; stores file attributes
- Create time (ctime)
- Modify time (mtime)
- Permissions
- Owner
- Type (regular file, directory, device, symlink)
- Where file is stored on disk (see Fig. 3)
- Fifteen pointers in each inode
- First thirteen point to actual data blocks
- Fourteenth is indirect pointer - points to a block of pointers
- Fifteenth is doubly indirect - points to a block of pointers which point to blocks of pointers
- Data blocks - actual data is here
- Each block group contains redundant filesystem info and data for the block (see Fig. 2, reference 3)
- EXT3 - EXT2 with journaling
- Built on top of EXT2; it just adds journaling
- Supports all three modes of journaling
- ReiserFS - Journaling with good small file performance
- Maximum file size of 8TB
- Maximum filesystem size of 16TB
- Writeback journaling by default; supports ordered and journaled modes
- Metadata organized in B+ trees
- Allocates space for file size exactly, rather than in blocks
- Tail packing
- Small files and less-than-block-sized tails are stored in the B+ tree
- Small file accesses are much faster
- Slightly more efficient use of storage space (~5% comp. to EXT2)
- Overall performance hit (it’s always repacking)
- XFS
- Maximum file size of 9 exabytes
- Maximum filesystem size of 18 exabytes (64-bit mode)
- Metadata organized in B+ trees
- Dynamically allocated inodes
- Block size 512 bytes to 64 kilobytes
- Uses writeback journaling
- Delayed allocation
- Data to be written is cached in RAM
- Space is reserved for data, but location is not
- Once enough data is collected (or an entire file) extents are found and data is written to the disk
- Helps improve allocation of single, contiguous regions for files
- Guranteed rate I/O - allows apps to reserve bandwidth
- Physically arranged into allocation groups (much like EXT2 block groups)
- Each group is mostly independent, managing it’s own free space
- Act like transparent sub-filesystems
- Each allocation group has two B+ trees
- One manages extents (ranges) of free space
- Another to manage inodes
- UFS - BSD loves thee
- Physical Structure
- Boot block
- Only in first cylinder group
- 8k, contains information for booting from this filesystem
- Blank if said filesystem isn’t used for booting
- Superblock
- Size of the filesystem
- Label (name)
- Size in blocks
- Date of last update
- Cylinder group size
- Data blocks per cylinder group
- Summary
- Cylinder group structure
- Copy of the superblock
- Cylinder group header (free/used table, etc.)
- Inodes
- Data blocks
- Boot block
- Physical Structure
- ZFS - Hott
- Endian-neutral - magic allows the filesystem to work on SPARC and x86
- Fully 128-bit - More addressable bits than god can count
- Negates the need for external volume managers
- NTFS - very little known
- FAT 12/16/32, VFAT - like EXT, but stupid
- ISO 9660 - CD-ROM Filesystem
- Comparison
- Small files
- UFS and EXT2/3 have to allocate at least 1 kilobyte blocks; lost storage efficiency
- ReiserFS significantly faster
- Small files
- Tweaks (see reference 4)
- Noatime
- ReiserFS - notail; swap usage for efficiency
- Tmpfs - /dev/shm
- Filesystem stored in RAM/swap space
- Like ramdisks, but hotter
- Stack mounting - mount an already-mounted filesystem somewhere else
- Not covered
- OCFS - Oracle filesystem
- Hybrid filesystem/raw device access
- Databases do their own caching/read ahead; having an FS cache is bad
- http://otn.oracle.com/tech/linux/pdf/Linux-FS-Performance-Comparison.pdf
- OCFS - Oracle filesystem
Sources
- A Non-Technical Look Inside the EXT2 File System. - http://www.linuxgazette.com/issue21/gx/ext/layout.gif
- Filesystems HOWTO. - http://www.tldp.org/HOWTO/Filesystems-HOWTO.html
- Analysis of the EXT2fs Structure. - http://www.nondot.org/sabre/os/files/FileSystems/ext2fs/
- Advanced Filesystem Implementor’s Guide. - http://www-106.ibm.com/developerworks/library/l-fs.html
- ReiserFS Docs - http://p-nand-q.com/download/rfstool/reiserfsdocs.html <li class=“ww-preformattedtext” dir=“ltr” style=“text-align:left;margin-bottom:14pt”> Scalability in the XFS File System. - http://oss.sgi.com/projects/xfs/papers/xfsusenix/index.html <li class=“ww-preformattedtext” dir=“ltr” style=“text-align:left;margin-bottom:14pt”> Getting Started with XFS Filesystems. - http://oss.sgi.com/projects/xfs/papers/gettingstartedwithxfs.pdf <li class=“ww-preformattedtext” dir=“ltr” style=“text-align:left;margin-bottom:14pt”> Understanding Filesystem Types. - http://uw713doc.sco.com/en/FSadmin/CONTENTS.html <li class=“ww-preformattedtext” dir=“ltr” style=“text-align:left;margin-bottom:14pt”> NodeWorks Encyclopedia: UFS. - http://pedia.nodeworks.com/U/UF/UFS/
- Wikipedia. - http://en.wikipedia.org