Filesystems

Jan 13, 2006

This is the outline to a seminar that I gave at CSH on filesystems. The impetus was simply my lack of knowledge for how filesystems work and the misconceptions (particularly about journaling) that most people have. Synopsis: use ReiserFS, but don’t believe the Gentoo kiddies who say you’ll get a massive performance boost; you wont. ReiserFS is slightly faster than EXT3, a little bit more space efficient and gets beyond the 2GB filesize limit. I run EXT2 because it’s a touche faster than all the journaling filesystems and I know that my systems aren’t going to go down uncleanly. Most people, however, aren’t as meticulous as I am so a Journaling filesystem is a better fit.

  1.  Filesystem writes
    1. Non-journaling (EXT2, FAT)
      1. Update meta-data (size)
      2. Update indirect mapping blocks
      3. Write data
    2. Journaling modes
      1. Journal
        1. Logs data and metadata changes
        2. Slowest, most secure
        3. Requires each write to be made twice
          1. Written to journal
          2. Then written to disk
        4. Log line only deleted once the transaction is done
      2. Ordered (default for EXT3)
        1. Only logs changes to metadata
        2. Writes data updates before metadata is changed
        3. If metadata wasn’t updated, the change never happened
      3. Writeback (default for ReiserFS and XFS)
        1. Only logs changes to metdata
        2. Uses regular write process
        3. Causes problems if metadata is written but data isn’t
  2. EXT2 - non-journaling Linux
    1. Maximum file size of 2GB
    2. Supports maximum filesystem size of 4TB
    3. Maximum file name size of 255 characters
    4. Supports regular files, directories, devices and symlinks
    5. Files created within a directory inherit attributes (permission, owner, group)
    6. Allows user defined block sizes
      1. Typically 1024, 2048 or 4096 bytes
      2. Larger blocks speed up I/O; fewer requests issued, less head movement
    7. Fast symlinks
      1. Target name stored in inode, not data area
      2. Max size of link is 60 character
    8. Superblocks track clean/not clean state of filesystem
      1. Marked not clean when mounted read/write
      2. Marked clean on unmount or remounted read only
      3. Uses mount counter; after X mounts, ignores state and forces fsck
    9. Includes support for secure deletion; random data written to deleted blocks
    10. Physical structure - disk divided into number block groups (see Fig. 1)
      1. Each block group contains redundant filesystem info and data for the block (see Fig. 2, reference 3)
        1. Super block - total number of blocks in filesystem, check counter, etc.
        2. Group descriptor - pointers to block bitmap, inode bitmap, and inode table
        3. Block bitmap - tells which blocks are in use
        4. Inode bitmap - tells which inodes are in use
        5. Inode table - each file has an inode; stores file attributes
          1. Create time (ctime)
          2. Modify time (mtime)
          3. Permissions
          4. Owner
          5. Type (regular file, directory, device, symlink)
          6. Where file is stored on disk (see Fig. 3)
            1. Fifteen pointers in each inode
            2. First thirteen point to actual data blocks
            3. Fourteenth is indirect pointer - points to a block of pointers
            4. Fifteenth is doubly indirect - points to a block of pointers which point to blocks of pointers
        6. Data blocks - actual data is here
  3. EXT3 - EXT2 with journaling
    1. Built on top of EXT2; it just adds journaling
    2. Supports all three modes of journaling
  4. ReiserFS - Journaling with good small file performance
    1. Maximum file size of 8TB
    2. Maximum filesystem size of 16TB
    3. Writeback journaling by default; supports ordered and journaled modes
    4. Metadata organized in B+ trees
    5. Allocates space for file size exactly, rather than in blocks
    6. Tail packing
      1. Small files and less-than-block-sized tails are stored in the B+ tree
      2. Small file accesses are much faster
      3. Slightly more efficient use of storage space (~5% comp. to EXT2)
      4. Overall performance hit (it’s always repacking)
  5. XFS
    1. Maximum file size of 9 exabytes
    2. Maximum filesystem size of 18 exabytes (64-bit mode)
    3. Metadata organized in B+ trees
    4. Dynamically allocated inodes
    5. Block size 512 bytes to 64 kilobytes
    6. Uses writeback journaling
    7. Delayed allocation
      1. Data to be written is cached in RAM
      2. Space is reserved for data, but location is not
      3. Once enough data is collected (or an entire file) extents are found and data is written to the disk
      4. Helps improve allocation of single, contiguous regions for files
    8. Guranteed rate I/O - allows apps to reserve bandwidth
    9. Physically arranged into allocation groups (much like EXT2 block groups)
      1. Each group is mostly independent, managing it’s own free space
      2. Act like transparent sub-filesystems
      3. Each allocation group has two B+ trees
        1. One manages extents (ranges) of free space
        2. Another to manage inodes
  6. UFS - BSD loves thee
    1. Physical Structure
      1. Boot block
        1. Only in first cylinder group
        2. 8k, contains information for booting from this filesystem
        3. Blank if said filesystem isn’t used for booting
      2. Superblock
        1. Size of the filesystem
        2. Label (name)
        3. Size in blocks
        4. Date of last update
        5. Cylinder group size
        6. Data blocks per cylinder group
        7. Summary
      3. Cylinder group structure
        1. Copy of the superblock
        2. Cylinder group header (free/used table, etc.)
        3. Inodes
        4. Data blocks
  7. ZFS - Hott
    1. Endian-neutral - magic allows the filesystem to work on SPARC and x86
    2. Fully 128-bit - More addressable bits than god can count
    3. Negates the need for external volume managers
  8. NTFS - very little known
  9. FAT 12/16/32, VFAT - like EXT, but stupid
  10. ISO 9660 - CD-ROM Filesystem
  11. Comparison
    1. Small files
      1. UFS and EXT2/3 have to allocate at least 1 kilobyte blocks; lost storage efficiency
      2. ReiserFS significantly faster
  12. Tweaks (see reference 4)
    1. Noatime
    2. ReiserFS - notail; swap usage for efficiency
    3. Tmpfs - /dev/shm
      1. Filesystem stored in RAM/swap space
      2. Like ramdisks, but hotter
    4. Stack mounting - mount an already-mounted filesystem somewhere else
  13. Not covered
    1. OCFS - Oracle filesystem
      1. Hybrid filesystem/raw device access
      2. Databases do their own caching/read ahead; having an FS cache is bad
      3.  http://otn.oracle.com/tech/linux/pdf/Linux-FS-Performance-Comparison.pdf

 Sources

  1.  A Non-Technical Look Inside the EXT2 File System. - http://www.linuxgazette.com/issue21/gx/ext/layout.gif
  2. Filesystems HOWTO. - http://www.tldp.org/HOWTO/Filesystems-HOWTO.html
  3. Analysis of the EXT2fs Structure. - http://www.nondot.org/sabre/os/files/FileSystems/ext2fs/
  4. Advanced Filesystem Implementor’s Guide. - http://www-106.ibm.com/developerworks/library/l-fs.html
  5. ReiserFS Docs - http://p-nand-q.com/download/rfstool/reiserfsdocs.html
  6. <li class=“ww-preformattedtext” dir=“ltr” style=“text-align:left;margin-bottom:14pt”> Scalability in the XFS File System. - http://oss.sgi.com/projects/xfs/papers/xfsusenix/index.html <li class=“ww-preformattedtext” dir=“ltr” style=“text-align:left;margin-bottom:14pt”> Getting Started with XFS Filesystems. - http://oss.sgi.com/projects/xfs/papers/gettingstartedwithxfs.pdf <li class=“ww-preformattedtext” dir=“ltr” style=“text-align:left;margin-bottom:14pt”> Understanding Filesystem Types. - http://uw713doc.sco.com/en/FSadmin/CONTENTS.html <li class=“ww-preformattedtext” dir=“ltr” style=“text-align:left;margin-bottom:14pt”> NodeWorks Encyclopedia: UFS. - http://pedia.nodeworks.com/U/UF/UFS/
  7.  Wikipedia. - http://en.wikipedia.org