当前位置:Linux教程 - Linux - Linux Kernel Internals(3cont.)--Virtual Filesystem

Linux Kernel Internals(3cont.)--Virtual Filesystem

3.5 Superblock and Mountpoint Management

Under Linux, information about mounted filesystems is kept in two separate structures - super_block and vfsmount. The reason for this is that Linux allows to mount the same filesystem (block device) under multiple mount points, which means that the same super_block can correspond to multiple vfsmount structures.

Let us look at struct super_block first, declared in include/linux/fs.h:



--------------------------------------------------------------------------------

struct super_block {
struct list_head s_list; /* Keep this first */
kdev_t s_dev;
unsigned long s_blocksize;
unsigned char s_blocksize_bits;
unsigned char s_lock;
unsigned char s_dirt;
struct file_system_type *s_type;
struct super_operations *s_op;
struct dquot_operations *dq_op;
unsigned long s_flags;
unsigned long s_magic;
struct dentry *s_root;
wait_queue_head_t s_wait;

struct list_head s_dirty; /* dirty inodes */
struct list_head s_files;

struct block_device *s_bdev;
struct list_head s_mounts; /* vfsmount(s) of this one */
struct quota_mount_options s_dquot; /* Diskquota specific options */

union {
struct minix_sb_info minix_sb;
struct ext2_sb_info ext2_sb;
..... all filesystems that need sb-private info ...
void *generic_sbp;
} u;
/*
* The next field is for VFS *only*. No filesystems have any business
* even looking at it. You had been warned.
*/
struct semaphore s_vfs_rename_sem; /* Kludge */

/* The next field is used by knfsd when converting a (inode number based)
* file handle into a dentry. As it builds a path in the dcache tree from
* the bottom up, there may for a time be a subpath of dentrys which is not
* connected to the main tree. This semaphore ensure that there is only ever
* one such free path per filesystem. Note that unconnected files (or other
* non-directories) are allowed, but not unconnected diretories.
*/
struct semaphore s_nfsd_free_path_sem;
};


--------------------------------------------------------------------------------

The various fields in the super_block structure are:


s_list: a doubly-linked list of all active superblocks; note I don''t say ""of all mounted filesystems"" because under Linux one can have multiple instances of a mounted filesystem corresponding to a single superblock.
s_dev: for filesystems which require a block to be mounted on, i.e. for FS_REQUIRES_DEV filesystems, this is the i_dev of the block device. For others (called anonymous filesystems) this is an integer MKDEV(UNNAMED_MAJOR, i) where i is the first unset bit in unnamed_dev_in_use array, between 1 and 255 inclusive. See fs/super.c:get_unnamed_dev()/put_unnamed_dev(). It has been suggested many times that anonymous filesystems should not use s_dev field.
s_blocksize, s_blocksize_bits: blocksize and log2(blocksize).
s_lock: indicates whether superblock is currently locked by lock_super()/unlock_super().
s_dirt: set when superblock is changed, and cleared whenever it is written back to disk.
s_type: pointer to struct file_system_type of the corresponding filesystem. Filesystem''s read_super() method doesn''t need to set it as VFS fs/super.c:read_super() sets it for you if fs-specific read_super() succeeds and resets to NULL if it fails.
s_op: pointer to super_operations structure which contains fs-specific methods to read/write inodes etc. It is the job of filesystem''s read_super() method to initialise s_op correctly.
dq_op: disk quota operations.
s_flags: superblock flags.
s_magic: filesystem''s magic number. Used by minix filesystem to differentiate between multiple flavours of itself.
s_root: dentry of the filesystem''s root. It is the job of read_super() to read the root inode from the disk and pass it to d_alloc_root() to allocate the dentry and instantiate it. Some filesystems spell ""root"" other than ""/"" and so use more generic d_alloc() function to bind the dentry to a name, e.g. pipefs mounts itself on ""pipe:"" as its own root instead of ""/"".
s_wait: waitqueue of processes waiting for superblock to be unlocked.
s_dirty: a list of all dirty inodes. Recall that if inode is dirty (inode->i_state & I_DIRTY) then it is on superblock-specific dirty list linked via inode->i_list.
s_files: a list of all open files on this superblock. Useful for deciding whether filesystem can be remounted read-only, see fs/file_table.c:fs_may_remount_ro() which goes through sb->s_files list and denies remounting if there are files opened for write (file->f_mode & FMODE_WRITE) or files with pending unlink (inode->i_nlink == 0).
s_bdev: for FS_REQUIRES_DEV, this points to the block_device structure describing the device the filesystem is mounted on.
s_mounts: a list of all vfsmount structures, one for each mounted instance of this superblock.
s_dquot: more diskquota stuff.
The superblock operations are described in the super_operations structure declared in include/linux/fs.h:



--------------------------------------------------------------------------------

struct super_operations {
void (*read_inode) (struct inode *);
void (*write_inode) (struct inode *, int);
void (*put_inode) (struct inode *);
void (*delete_inode) (struct inode *);
void (*put_super) (struct super_block *);
void (*write_super) (struct super_block *);
int (*statfs) (struct super_block *, struct statfs *);
int (*remount_fs) (struct super_block *, int *, char *);
void (*clear_inode) (struct inode *);
void (*umount_begin) (struct super_block *);
};


--------------------------------------------------------------------------------


read_inode: reads the inode from the filesystem. It is only called from fs/inode.c:get_new_inode() from iget4() (and therefore iget()). If a filesystem wants to use iget() then read_inode() must be implemented - otherwise get_new_inode() will panic. While inode is being read it is locked (inode->i_state = I_LOCK). When the function returns, all waiters on inode->i_wait are woken up. The job of the filesystem''s read_inode() method is to locate the disk block which contains the inode to be read and use buffer cache bread() function to read it in and initialise the various fields of inode structure, for example the inode->i_op and inode->i_fop so that VFS level knows what operations can be performed on the inode or corresponding file. Filesystems that don''t implement read_inode() are ramfs and pipefs. For example, ramfs has its own inode-generating function ramfs_get_inode() with all the inode operations calling it as needed.
write_inode: write inode back to disk. Similar to read_inode() in that it needs to locate the relevant block on disk and interact with buffer cache by calling mark_buffer_dirty(bh). This method is called on dirty inodes (those marked dirty with mark_inode_dirty()) when the inode needs to be sync''d either individually or as part of syncing the entire filesystem.
put_inode: called whenever the reference count is decreased.
delete_inode: called whenever both inode->i_count and inode->i_nlink reach 0. Filesystem deletes the on-disk copy of the inode and calls clear_inode() on VFS inode to ""terminate it with extreme prejudice"".
put_super: called at the last stages of umount(2) system call to notify the filesystem that any private information held by the filesystem about this instance should be freed. Typically this would brelse() the block containing the superblock and kfree() any bitmaps allocated for free blocks, inodes, etc.
write_super: called when superblock needs to be written back to disk. It should find the block containing the superblock (usually kept in sb-private area) and mark_buffer_dirty(bh) . It should also clear sb->s_dirt flag.
statfs: implements fstatfs(2)/statfs(2) system calls. Note that the pointer to struct statfs passed as argument is a kernel pointer, not a user pointer so we don''t need to do any I/O to userspace. If not implemented then statfs(2) will fail with ENOSYS.
remount_fs: called whenever filesystem is being remounted.
clear_inode: called from VFS level clear_inode(). Filesystems that attach private data to inode structure (via generic_ip field) must free it here.
umount_begin: called during forced umount to notify the filesystem beforehand, so that it can do its best to make sure that nothing keeps the filesystem busy. Currently used only by NFS. This has nothing to do with the idea of generic VFS level forced umount support.
So, let us look at what happens when we mount a on-disk (FS_REQUIRES_DEV) filesystem. The implementation of the mount(2) system call is in fs/super.c:sys_mount() which is the just a wrapper that copies the options, filesystem type and device name for the do_mount() function which does the real work:


Filesystem driver is loaded if needed and its module''s reference count is incremented. Note that during mount operation, the filesystem module''s reference count is incremented twice - once by do_mount() calling get_fs_type() and once by get_sb_dev() calling get_filesystem() if read_super() was successful. The first increment is to prevent module unloading while we are inside read_super() method and the second increment is to indicate that the module is in use by this mounted instance. Obviously, do_mount() decrements the count before returning, so overall the count only grows by 1 after each mount.
Since, in our case, fs_type->fs_flags & FS_REQUIRES_DEV is true, the superblock is initialised by a call to get_sb_bdev() which obtains the reference to the block device and interacts with the filesystem''s read_super() method to fill in the superblock. If all goes well, the super_block structure is initialised and we have an extra reference to the filesystem''s module and a reference to the underlying block device.
A new vfsmount structure is allocated and linked to sb->s_mounts list and to the global vfsmntlist list. The vfsmount field mnt_instances allows to find all instances mounted on the same superblock as this one. The mnt_list field allows to find all instances for all superblocks system-wide. The mnt_sb field points to this superblock and mnt_root has a new reference to the sb->s_root dentry.

3.6 Example Virtual Filesystem: pipefs

As a simple example of Linux filesystem that does not require a block device for mounting, let us consider pipefs from fs/pipe.c. The filesystem''s preamble is rather straightforward and requires little explanation:



--------------------------------------------------------------------------------

static DECLARE_FSTYPE(pipe_fs_type, ""pipefs"", pipefs_read_super,
FS_NOMOUNT|FS_SINGLE);

static int __init init_pipe_fs(void)
{
int err = register_filesystem(&pipe_fs_type);
if (!err) {
pipe_mnt = kern_mount(&pipe_fs_type);
err = PTR_ERR(pipe_mnt);
if (!IS_ERR(pipe_mnt))
err = 0;
}
return err;
}

static void __exit exit_pipe_fs(void)
{
unregister_filesystem(&pipe_fs_type);
kern_umount(pipe_mnt);
}

module_init(init_pipe_fs)
module_exit(exit_pipe_fs)


--------------------------------------------------------------------------------

The filesystem is of type FS_NOMOUNT|FS_SINGLE, which means it cannot be mounted from userspace and can only have one superblock system-wide. The FS_SINGLE file also means that it must be mounted via kern_mount() after it is successfully registered via register_filesystem(), which is exactly what happens in init_pipe_fs(). The only bug in this function is that if kern_mount() fails (e.g. because kmalloc() failed in add_vfsmnt()) then the filesystem is left as registered but module initialisation fails. This will cause cat /proc/filesystems to Oops. (have just sent a patch to Linus mentioning that although this is not a real bug today as pipefs can''t be compiled as a module, it should be written with the view that in the future it may become modularised).

The result of register_filesystem() is that pipe_fs_type is linked into the file_systems list so one can read /proc/filesystems and find ""pipefs"" entry in there with ""nodev"" flag indicating that FS_REQUIRES_DEV was not set. The /proc/filesystems file should really be enhanced to support all the new FS_ flags (and I made a patch to do so) but it cannot be done because it will break all the user applications that use it. Despite Linux kernel interfaces changing every minute (only for the better) when it comes to the userspace compatibility, Linux is a very conservative operating system which allows many applications to be used for a long time without being recompiled.

The result of kern_mount() is that:


A new unnamed (anonymous) device number is allocated by setting a bit in unnamed_dev_in_use bitmap; if there are no more bits then kern_mount() fails with EMFILE.
A new superblock structure is allocated by means of get_empty_super(). The get_empty_super() function walks the list of superblocks headed by super_block and looks for empty entry, i.e. s->s_dev == 0. If no such empty superblock is found then a new one is allocated using kmalloc() at GFP_USER priority. The maximum system-wide number of superblocks is checked in get_empty_super() so if it starts failing, one can adjust the tunable /proc/sys/fs/super-max.
A filesystem-specific pipe_fs_type->read_super() method, i.e. pipefs_read_super(), is invoked which allocates root inode and root dentry sb->s_root, and sets sb->s_op to be &pipefs_ops.
Then kern_mount() calls add_vfsmnt(NULL, sb->s_root, ""none"") which allocates a new vfsmount structure and links it into vfsmntlist and sb->s_mounts.
The pipe_fs_type->kern_mnt is set to this new vfsmount structure and it is returned. The reason why the return value of kern_mount() is a vfsmount structure is because even FS_SINGLE filesystems can be mounted multiple times and so their mnt->mnt_sb will point to the same thing which would be silly to return from multiple calls to kern_mount().
Now that the filesystem is registered and inkernel-mounted we can use it. The entry point into the pipefs filesystem is the pipe(2) system call, implemented in arch-dependent function sys_pipe() but the real work is done by a portable fs/pipe.c:do_pipe() function. Let us look at do_pipe() then. The interaction with pipefs happens when do_pipe() calls get_pipe_inode() to allocate a new pipefs inode. For this inode, inode->i_sb is set to pipefs'' superblock pipe_mnt->mnt_sb, the file operations i_fop is set to rdwr_pipe_fops and the number of readers and writers (held in inode->i_pipe) is set to 1. The reason why there is a separate inode field i_pipe instead of keeping it in the fs-private union is that pipes and FIFOs share the same code and FIFOs can exist on other filesystems which use the other access paths within the same union which is very bad C and can work only by pure luck. So, yes, 2.2.x kernels work only by pure luck and will stop working as soon as you slightly rearrange the fields in the inode.

Each pipe(2) system call increments a reference count on the pipe_mnt mount instance.

Under Linux, pipes are not symmetric (bidirection or STREAM pipes), i.e. two sides of the file have different file->f_op operations - the read_pipe_fops and write_pipe_fops respectively. The write on read side returns EBADF and so does read on write side.



3.7 Example Disk Filesystem: BFS

As a simple example of ondisk Linux filesystem, let us consider BFS. The preamble of the BFS module is in fs/bfs/inode.c:



--------------------------------------------------------------------------------

static DECLARE_FSTYPE_DEV(bfs_fs_type, ""bfs"", bfs_read_super);

static int __init init_bfs_fs(void)
{
return register_filesystem(&bfs_fs_type);
}

static void __exit exit_bfs_fs(void)
{
unregister_filesystem(&bfs_fs_type);
}

module_init(init_bfs_fs)
module_exit(exit_bfs_fs)


--------------------------------------------------------------------------------

A special fstype declaration macro DECLARE_FSTYPE_DEV() is used which sets the fs_type->flags to FS_REQUIRES_DEV to signify that BFS requires a real block device to be mounted on.

The module''s initialisation function registers the filesystem with VFS and the cleanup function (only present when BFS is configured to be a module) unregisters it.

With the filesystem registered, we can proceed to mount it, which would invoke out fs_type->read_super() method which is implemented in fs/bfs/inode.c:bfs_read_super(). It does the following:


set_blocksize(s->s_dev, BFS_BSIZE): since we are about to interact with the block device layer via the buffer cache, we must initialise a few things, namely set the block size and also inform VFS via fields s->s_blocksize and s->s_blocksize_bits.
bh = bread(dev, 0, BFS_BSIZE): we read block 0 of the device passed via s->s_dev. This block is the filesystem''s superblock.
Superblock is validated against BFS_MAGIC number and, if valid, stored in the sb-private field s->su_sbh (which is really s->u.bfs_sb.si_sbh).
Then we allocate inode bitmap using kmalloc(GFP_KERNEL) and clear all bits to 0 except the first two which we set to 1 to indicate that we should never allocate inodes 0 and 1. Inode 2 is root and the corresponding bit will be set to 1 a few lines later anyway - the filesystem should have a valid root inode at mounting time!
Then we initialise s->s_op, which means that we can from this point invoke inode cache via iget() which results in s_op->read_inode() to be invoked. This finds the block that contains the specified (by inode->i_ino and inode->i_dev) inode and reads it in. If we fail to get root inode then we free the inode bitmap and release superblock buffer back to buffer cache and return NULL. If root inode was read OK, then we allocate a dentry with name / (as becometh root) and instantiate it with this inode.
Now we go through all inodes on the filesystem and read them all in order to set the corresponding bits in our internal inode bitmap and also to calculate some other internal parameters like the offset of last inode and the start/end blocks of last file. Each inode we read is returned back to inode cache via iput() - we don''t hold a reference to it longer than needed.
If the filesystem was not mounted read-only, we mark the superblock buffer dirty and set s->s_dirt flag (TODO: why do I do this? Originally, I did it because minix_read_super() did but neither minix nor BFS seem to modify superblock in the read_super()).
All is well so we return this initialised superblock back to the caller at VFS level, i.e. fs/super.c:read_super().
After the read_super() function returns successfully, VFS obtains the reference to the filesystem module via call to get_filesystem(fs_type) in fs/super.c:get_sb_bdev() and a reference to the block device.

Now, let us examine what happens when we do I/O on the filesystem. We already examined how inodes are read when iget() is called and how they are released on iput(). Reading inodes sets up, among other things, inode->i_op and inode->i_fop; opening a file will propagate inode->i_fop into file->f_op.

Let us examine the code path of the link(2) system call. The implementation of the system call is in fs/namei.c:sys_link():


The userspace names are copied into kernel space by means of getname() function which does the error checking.
These names are nameidata converted using path_init()/path_walk() interaction with dcache. The result is stored in old_nd and nd structures.
If old_nd.mnt != nd.mnt then ""cross-device link"" EXDEV is returned - one cannot link between filesystems, in Linux this translates into - one cannot link between mounted instances of a filesystem (or, in particular between filesystems).
A new dentry is created corresponding to nd by lookup_create() .
A generic vfs_link() function is called which checks if we can create a new entry in the directory and invokes the dir->i_op->link() method which brings us back to filesystem-specific fs/bfs/dir.c:bfs_link() function.
Inside bfs_link(), we check if we are trying to link a directory and if so, refuse with EPERM error. This is the same behaviour as standard (ext2).
We attempt to add a new directory entry to the specified directory by calling the helper function bfs_add_entry() which goes through all entries looking for unused slot (de->ino == 0) and, when found, writes out the name/inode pair into the corresponding block and marks it dirty (at non-superblock priority).
If we successfully added the directory entry then there is no way to fail the operation so we increment inode->i_nlink, update inode->i_ctime and mark this inode dirty as well as instantiating the new dentry with the inode.
Other related inode operations like unlink()/rename() etc work in a similar way, so not much is gained by examining them all in details.


3.8 Execution Domains and Binary Formats

Linux supports loading user application binaries from disk. More interestingly, the binaries can be stored in different formats and the operating system''s response to programs via system calls can deviate from norm (norm being the Linux behaviour) as required, in order to emulate formats found in other flavours of UNIX (COFF, etc) and also to emulate system calls behaviour of other flavours (Solaris, UnixWare, etc). This is what execution domains and binary formats are for.

Each Linux task has a personality stored in its task_struct (p->personality). The currently existing (either in the official kernel or as addon patch) personalities include support for FreeBSD, Solaris, UnixWare, OpenServer and many other popular operating systems. The value of current->personality is split into two parts:


high three bytes - bug emulation: STICKY_TIMEOUTS, WHOLE_SECONDS, etc.
low byte - personality proper, a unique number.
By changing the personality, we can change the way the operating system treats certain system calls, for example adding a STICKY_TIMEOUT to current->personality makes select(2) system call preserve the value of last argument (timeout) instead of storing the unslept time. Some buggy programs rely on buggy operating systems (non-Linux) and so Linux provides a way to emulate bugs in cases where the source code is not available and so bugs cannot be fixed.

Execution domain is a contiguous range of personalities implemented by a single module. Usually a single execution domain implements a single personality but sometimes it is possible to implement ""close"" personalities in a single module without too many conditionals.

Execution domains are implemented in kernel/exec_domain.c and were completely rewritten for 2.4 kernel, compared with 2.2.x. The list of execution domains currently supported by the kernel, along with the range of personalities they support, is available by reading the /proc/execdomains file. Execution domains, except the PER_LINUX one, can be implemented as dynamically loadable modules.

The user interface is via personality(2) system call, which sets the current process'' personality or returns the value of current->personality if the argument is set to impossible personality 0xffffffff. Obviously, the behaviour of this system call itself does not depend on personality..

The kernel interface to execution domains registration consists of two functions:


int register_exec_domain(struct exec_domain *): registers the execution domain by linking it into single-linked list exec_domains under the write protection of the read-write spinlock exec_domains_lock. Returns 0 on success, non-zero on failure.
int unregister_exec_domain(struct exec_domain *): unregisters the execution domain by unlinking it from the exec_domains list, again using exec_domains_lock spinlock in write mode. Returns 0 on success.

The reason why exec_domains_lock is a read-write is that only registration and unregistration requests modify the list, whilst doing cat /proc/filesystems calls fs/exec_domain.c:get_exec_domain_list(), which needs only read access to the list. Registering a new execution domain defines a ""lcall7 handler"" and a signal number conversion map. Actually, ABI patch extends this concept of exec domain to include extra information (like socket options, socket types, address family and errno maps).

The binary formats are implemented in a similar manner, i.e. a single-linked list formats is defined in fs/exec.c and is protected by a read-write lock binfmt_lock. As with exec_domains_lock, the binfmt_lock is taken read on most occasions except for registration/unregistration of binary formats. Registering a new binary format enhances the execve(2) system call with new load_binary()/load_shlib() functions as well as ability to core_dump() . The load_shlib() method is used only by the old uselib(2) system call while the load_binary() method is called by the search_binary_handler() from do_execve() which implements execve(2) system call.

The personality of the process is determined at binary format loading by the corresponding format''s load_binary() method using some heuristics. For example to determine UnixWare7 binaries one first marks the binary using the elfmark(1) utility, which sets the ELF header''s e_flags to the magic value 0x314B4455 which is detected at ELF loading time and current->personality is set to PER_UW7. If this heuristic fails, then a more generic one, such as treat ELF interpreter paths like /usr/lib/ld.so.1 or /usr/lib/libc.so.1 to indicate a SVR4 binary, is used and personality is set to PER_SVR4. One could write a little utility program that uses Linux''s ptrace(2) capabilities to single-step the code and force a running program into any personality.

Once personality (and therefore current->exec_domain) is known, the system calls are handled as follows. Let us assume that a process makes a system call by means of lcall7 gate instruction. This transfers control to ENTRY(lcall7) of arch/i386/kernel/entry.S because it was prepared in arch/i386/kernel/traps.c:trap_init(). After appropriate stack layout conversion, entry.S:lcall7 obtains the pointer to exec_domain from current and then an offset of lcall7 handler within the exec_domain (which is hardcoded as 4 in asm code so you can''t shift the handler field around in C declaration of struct exec_domain) and jumps to it. So, in C, it would look like this:



--------------------------------------------------------------------------------

static void UW7_lcall7(int segment, struct pt_regs * regs)
{
abi_dispatch(regs, &uw7_funcs[regs->eax & 0xff], 1);
}


--------------------------------------------------------------------------------

where abi_dispatch() is a wrapper around the table of function pointers that implement this personality''s system calls uw7_funcs.



--------------------------------------------------------------------------------
Next Previous Contents ''