当前位置:Linux教程 - Linux - Linux Kernel Internals(3)--Virtual Filesystem

Linux Kernel Internals(3)--Virtual Filesystem

Next Previous Contents
--------------------------------------------------------------------------------

3. Virtual Filesystem (VFS)


3.1 Inode Caches and Interaction with Dcache

In order to support multiple filesystems, Linux contains a special kernel interface level called VFS (Virtual Filesystem Switch). This is similar to the vnode/vfs interface found in SVR4 derivatives (originally it came from BSD and Sun original implementations).

Linux inode cache is implemented in a single file, fs/inode.c, which consists of 977 lines of code. It is interesting to note that not many changes have been made to it for the last 5-7 years: one can still recognise some of the code comparing the latest version with, say, 1.3.42.

The structure of Linux inode cache is as follows:


A global hashtable, inode_hashtable, where each inode is hashed by the value of the superblock pointer and 32bit inode number. Inodes without a superblock (inode->i_sb == NULL) are added to a doubly linked list headed by anon_hash_chain instead. Examples of anonymous inodes are sockets created by net/socket.c:sock_alloc(), by calling fs/inode.c:get_empty_inode().
A global type in_use list (inode_in_use), which contains valid inodes with i_count>0 and i_nlink>0. Inodes newly allocated by get_empty_inode() and get_new_inode() are added to the inode_in_use list.
A global type unused list (inode_unused), which contains valid inodes with i_count = 0.
A per-superblock type dirty list (sb->s_dirty) which contains valid inodes with i_count>0, i_nlink>0 and i_state & I_DIRTY. When inode is marked dirty, it is added to the sb->s_dirty list if it is also hashed. Maintaining a per-superblock dirty list of inodes allows to quickly sync inodes.
Inode cache proper - a SLAB cache called inode_cachep. As inode objects are allocated and freed, they are taken from and returned to this SLAB cache.
The type lists are anchored from inode->i_list, the hashtable from inode->i_hash. Each inode can be on a hashtable and one and only one type (in_use, unused or dirty) list.

All these lists are protected by a single spinlock: inode_lock.

The inode cache subsystem is initialised when inode_init() function is called from init/main.c:start_kernel(). The function is marked as __init, which means its code is thrown away later on. It is passed a single argument - the number of physical pages on the system. This is so that the inode cache can configure itself depending on how much memory is available, i.e. create a larger hashtable if there is enough memory.

The only stats information about inode cache is the number of unused inodes, stored in inodes_stat.nr_unused and accessible to user programs via files /proc/sys/fs/inode-nr and /proc/sys/fs/inode-state.

We can examine one of the lists from gdb running on a live kernel thus:



--------------------------------------------------------------------------------

(gdb) printf ""%d "", (unsigned long)(&((struct inode *)0)->i_list)
8
(gdb) p inode_unused
$34 = 0xdfa992a8
(gdb) p (struct list_head)inode_unused
$35 = {next = 0xdfa992a8, prev = 0xdfcdd5a8}
(gdb) p ((struct list_head)inode_unused).prev
$36 = (struct list_head *) 0xdfcdd5a8
(gdb) p (((struct list_head)inode_unused).prev)->prev
$37 = (struct list_head *) 0xdfb5a2e8
(gdb) set $i = (struct inode *)0xdfb5a2e0
(gdb) p $i->i_ino
$38 = 0x3bec7
(gdb) p $i->i_count
$39 = {counter = 0x0}


--------------------------------------------------------------------------------

Note that we deducted 8 from the address 0xdfb5a2e8 to obtain the address of the struct inode (0xdfb5a2e0) according to the definition of list_entry() macro from include/linux/list.h.

To understand how inode cache works, let us trace a lifetime of an inode of a regular file on ext2 filesystem as it is opened and closed:



--------------------------------------------------------------------------------

fd = open(""file"", O_RDONLY);
close(fd);


--------------------------------------------------------------------------------

The open(2) system call is implemented in fs/open.c:sys_open function and the real work is done by fs/open.c:filp_open() function, which is split into two parts:


open_namei(): fills in the nameidata structure containing the dentry and vfsmount structures.
dentry_open(): given a dentry and vfsmount, this function allocates a new struct file and links them together; it also invokes the filesystem specific f_op->open() method which was set in inode->i_fop when inode was read in open_namei() (which provided inode via dentry->d_inode).
The open_namei() function interacts with dentry cache via path_walk(), which in turn calls real_lookup(), which invokes the filesystem specific inode_operations->lookup() method. The role of this method is to find the entry in the parent directory with the matching name and then do iget(sb, ino) to get the corresponding inode - which brings us to the inode cache. When the inode is read in, the dentry is instantiated by means of d_add(dentry, inode). While we are at it, note that for UNIX-style filesystems which have the concept of on-disk inode number, it is the lookup method''s job to map its endianness to current CPU format, e.g. if the inode number in raw (fs-specific) dir entry is in little-endian 32 bit format one could do:



--------------------------------------------------------------------------------

unsigned long ino = le32_to_cpu(de->inode);
inode = iget(sb, ino);
d_add(dentry, inode);


--------------------------------------------------------------------------------

So, when we open a file we hit iget(sb, ino) which is really iget4(sb, ino, NULL, NULL), which does:


Attempt to find an inode with matching superblock and inode number in the hashtable under protection of inode_lock. If inode is found, its reference count (i_count) is incremented; if it was 0 prior to incrementation and the inode is not dirty, it is removed from whatever type list (inode->i_list) it is currently on (it has to be inode_unused list, of course) and inserted into inode_in_use type list; finally, inodes_stat.nr_unused is decremented.
If inode is currently locked, we wait until it is unlocked so that iget4() is guaranteed to return an unlocked inode.
If inode was not found in the hashtable then it is the first time we encounter this inode, so we call get_new_inode(), passing it the pointer to the place in the hashtable where it should be inserted to.
get_new_inode() allocates a new inode from the inode_cachep SLAB cache but this operation can block (GFP_KERNEL allocation), so it must drop the inode_lock spinlock which guards the hashtable. Since it has dropped the spinlock, it must retry searching the inode in the hashtable afterwards; if it is found this time, it returns (after incrementing the reference by __iget) the one found in the hashtable and destroys the newly allocated one. If it is still not found in the hashtable, then the new inode we have just allocated is the one to be used; therefore it is initialised to the required values and the fs-specific sb->s_op->read_inode() method is invoked to populate the rest of the inode. This brings us from inode cache back to the filesystem code - remember that we came to the inode cache when filesystem-specific lookup() method invoked iget(). While the s_op->read_inode() method is reading the inode from disk, the inode is locked (i_state = I_LOCK); it is unlocked after the read_inode() method returns and all the waiters for it are woken up.
Now, let''s see what happens when we close this file descriptor. The close(2) system call is implemented in fs/open.c:sys_close() function, which calls do_close(fd, 1) which rips (replaces with NULL) the descriptor of the process'' file descriptor table and invokes the filp_close() function which does most of the work. The interesting things happen in fput(), which checks if this was the last reference to the file, and if so calls fs/file_table.c:_fput() which calls __fput() which is where interaction with dcache (and therefore with inode cache - remember dcache is a Master of inode cache!) happens. The fs/dcache.c:dput() does dentry_iput() which brings us back to inode cache via iput(inode) so let us understand fs/inode.c:iput(inode):


If parameter passed to us is NULL, we do absolutely nothing and return.
if there is a fs-specific sb->s_op->put_inode() method, it is invoked immediately with no spinlocks held (so it can block).
inode_lock spinlock is taken and i_count is decremented. If this was NOT the last reference to this inode then we simply check if there are too many references to it and so i_count can wrap around the 32 bits allocated to it and if so we print a warning and return. Note that we call printk() while holding the inode_lock spinlock - this is fine because printk() can never block, therefore it may be called in absolutely any context (even from interrupt handlers!).
If this was the last active reference then some work needs to be done.
The work performed by iput() on the last inode reference is rather complex so we separate it into a list of its own:


If i_nlink == 0 (e.g. the file was unlinked while we held it open) then the inode is removed from hashtable and from its type list; if there are any data pages held in page cache for this inode, they are removed by means of truncate_all_inode_pages(&inode->i_data). Then the filesystem-specific s_op->delete_inode() method is invoked, which typically deletes the on-disk copy of the inode. If there is no s_op->delete_inode() method registered by the filesystem (e.g. ramfs) then we call clear_inode(inode), which invokes s_op->clear_inode() if registered and if inode corresponds to a block device, this device''s reference count is dropped by bdput(inode->i_bdev).
if i_nlink != 0 then we check if there are other inodes in the same hash bucket and if there is none, then if inode is not dirty we delete it from its type list and add it to inode_unused list, incrementing inodes_stat.nr_unused. If there are inodes in the same hashbucket then we delete it from the type list and add to inode_unused list. If this was an anonymous inode (NetApp .snapshot) then we delete it from the type list and clear/destroy it completely.


3.2 Filesystem Registration/Unregistration

The Linux kernel provides a mechanism for new filesystems to be written with minimum effort. The historical reasons for this are:


In the world where people still use non-Linux operating systems to protect their investment in legacy software, Linux had to provide interoperability by supporting a great multitude of different filesystems - most of which would not deserve to exist on their own but only for compatibility with existing non-Linux operating systems.
The interface for filesystem writers had to be very simple so that people could try to reverse engineer existing proprietary filesystems by writing read-only versions of them. Therefore Linux VFS makes it very easy to implement read-only filesystems; 95% of the work is to finish them by adding full write-support. As a concrete example, I wrote read-only BFS filesystem for Linux in about 10 hours, but it took several weeks to complete it to have full write support (and even today some purists claim that it is not complete because ""it doesn''t have compactification support"").
The VFS interface is exported, and therefore all Linux filesystems can be implemented as modules.
Let us consider the steps required to implement a filesystem under Linux. The code to implement a filesystem can be either a dynamically loadable module or statically linked into the kernel, and the way it is done under Linux is very transparent. All that is needed is to fill in a struct file_system_type structure and register it with the VFS using the register_filesystem() function as in the following example from fs/bfs/inode.c:



--------------------------------------------------------------------------------

#include
#include

static struct super_block *bfs_read_super(struct super_block *, void *, int);

static DECLARE_FSTYPE_DEV(bfs_fs_type, ""bfs"", bfs_read_super);

static int __init init_bfs_fs(void)
{
return register_filesystem(&bfs_fs_type);
}

static void __exit exit_bfs_fs(void)
{
unregister_filesystem(&bfs_fs_type);
}

module_init(init_bfs_fs)
module_exit(exit_bfs_fs)


--------------------------------------------------------------------------------

The module_init()/module_exit() macros ensure that, when BFS is compiled as a module, the functions init_bfs_fs() and exit_bfs_fs() turn into init_module() and cleanup_module() respectively; if BFS is statically linked into the kernel, the exit_bfs_fs() code vanishes as it is unnecessary.

The struct file_system_type is declared in include/linux/fs.h:



--------------------------------------------------------------------------------

struct file_system_type {
const char *name;
int fs_flags;
struct super_block *(*read_super) (struct super_block *, void *, int);
struct module *owner;
struct vfsmount *kern_mnt; /* For kernel mount, if it''s FS_SINGLE fs */
struct file_system_type * next;
};


--------------------------------------------------------------------------------

The fields thereof are explained thus:


name: human readable name, appears in /proc/filesystems file and is used as a key to find a filesystem by its name; this same name is used for the filesystem type in mount(2), and it should be unique: there can (obviously) be only one filesystem with a given name. For modules, name points to module''s address spaces and not copied: this means cat /proc/filesystems can oops if the module was unloaded but filesystem is still registered.
fs_flags: one or more (ORed) of the flags: FS_REQUIRES_DEV for filesystems that can only be mounted on a block device, FS_SINGLE for filesystems that can have only one superblock, FS_NOMOUNT for filesystems that cannot be mounted from userspace by means of mount(2) system call: they can however be mounted internally using kern_mount() interface, e.g. pipefs.
read_super: a pointer to the function that reads the super block during mount operation. This function is required: if it is not provided, mount operation (whether from userspace or inkernel) will always fail except in FS_SINGLE case where it will Oops in get_sb_single(), trying to dereference a NULL pointer in fs_type->kern_mnt->mnt_sb with (fs_type->kern_mnt = NULL).
owner: pointer to the module that implements this filesystem. If the filesystem is statically linked into the kernel then this is NULL. You don''t need to set this manually as the macro THIS_MODULE does the right thing automatically.
kern_mnt: for FS_SINGLE filesystems only. This is set by kern_mount() (TODO: kern_mount() should refuse to mount filesystems if FS_SINGLE is not set).
next: linkage into singly-linked list headed by file_systems (see fs/super.c). The list is protected by the file_systems_lock read-write spinlock and functions register/unregister_filesystem() modify it by linking and unlinking the entry from the list.
The job of the read_super() function is to fill in the fields of the superblock, allocate root inode and initialise any fs-private information associated with this mounted instance of the filesystem. So, typically the read_super() would do:


Read the superblock from the device specified via sb->s_dev argument, using buffer cache bread() function. If it anticipates to read a few more subsequent metadata blocks immediately then it makes sense to use breada() to schedule reading extra blocks asynchronously.
Verify that superblock contains the valid magic number and overall ""looks"" sane.
Initialise sb->s_op to point to struct super_block_operations structure. This structure contains filesystem-specific functions implementing operations like ""read inode"", ""delete inode"", etc.
Allocate root inode and root dentry using d_alloc_root().
If the filesystem is not mounted read-only then set sb->s_dirt to 1 and mark the buffer containing superblock dirty (TODO: why do we do this? I did it in BFS because MINIX did it...)

3.3 File Descriptor Management

Under Linux there are several levels of indirection between user file descriptor and the kernel inode structure. When a process makes open(2) system call, the kernel returns a small non-negative integer which can be used for subsequent I/O operations on this file. This integer is an index into an array of pointers to struct file. Each file structure points to a dentry via file->f_dentry. And each dentry points to an inode via dentry->d_inode.

Each task contains a field tsk->files which is a pointer to struct files_struct defined in include/linux/sched.h:



--------------------------------------------------------------------------------

/*
* Open file table structure
*/
struct files_struct {
atomic_t count;
rwlock_t file_lock;
int max_fds;
int max_fdset;
int next_fd;
struct file ** fd; /* current fd array */
fd_set *close_on_exec;
fd_set *open_fds;
fd_set close_on_exec_init;
fd_set open_fds_init;
struct file * fd_array[NR_OPEN_DEFAULT];
};


--------------------------------------------------------------------------------

The file->count is a reference count, incremented by get_file() (usually called by fget()) and decremented by fput() and by put_filp(). The difference between fput() and put_filp() is that fput() does more work usually needed for regular files, such as releasing flock locks, releasing dentry, etc, while put_filp() is only manipulating file table structures, i.e. decrements the count, removes the file from the anon_list and adds it to the free_list, under protection of files_lock spinlock.

The tsk->files can be shared between parent and child if the child thread was created using clone() system call with CLONE_FILES set in the clone flags argument. This can be seen in kernel/fork.c:copy_files() (called by do_fork()) which only increments the file->count if CLONE_FILES is set instead of the usual copying file descriptor table in time-honoured tradition of classical UNIX fork(2).

When a file is opened, the file structure allocated for it is installed into current->files->fd[fd] slot and a fd bit is set in the bitmap current->files->open_fds . All this is done under the write protection of current->files->file_lock read-write spinlock. When the descriptor is closed a fd bit is cleared in current->files->open_fds and current->files->next_fd is set equal to fd as a hint for finding the first unused descriptor next time this process wants to open a file.


3.4 File Structure Management

The file structure is declared in include/linux/fs.h:



--------------------------------------------------------------------------------

struct fown_struct {
int pid; /* pid or -pgrp where SIGIO should be sent */
uid_t uid, euid; /* uid/euid of process setting the owner */
int signum; /* posix.1b rt signal to be delivered on IO */
};

struct file {
struct list_head f_list;
struct dentry *f_dentry;
struct vfsmount *f_vfsmnt;
struct file_operations *f_op;
atomic_t f_count;
unsigned int f_flags;
mode_t f_mode;
loff_t f_pos;
unsigned long f_reada, f_ramax, f_raend, f_ralen, f_rawin;
struct fown_struct f_owner;
unsigned int f_uid, f_gid;
int f_error;

unsigned long f_version;

/* needed for tty driver, and maybe others */
void *private_data;
};


--------------------------------------------------------------------------------

Let us look at the various fields of struct file:


f_list: this field links file structure on one (and only one) of the lists: a) sb->s_files list of all open files on this filesystem, if the corresponding inode is not anonymous, then dentry_open() (called by filp_open()) links the file into this list; b) fs/file_table.c:free_list, containing unused file structures; c) fs/file_table.c:anon_list, when a new file structure is created by get_empty_filp() it is placed on this list. All these lists are protected by the files_lock spinlock.
f_dentry: the dentry corresponding to this file. The dentry is created at nameidata lookup time by open_namei() (or rather path_walk() which it calls) but the actual file->f_dentry field is set by dentry_open() to contain the dentry thus found.
f_vfsmnt: the pointer to vfsmount structure of the filesystem containing the file. This is set by dentry_open() but is found as part of nameidata lookup by open_namei() (or rather path_init() which it calls).
f_op: the pointer to file_operations which contains various methods that can be invoked on the file. This is copied from inode->i_fop which is placed there by filesystem-specific s_op->read_inode() method during nameidata lookup. We will look at file_operations methods in detail later on in this section.
f_count: reference count manipulated by get_file/put_filp/fput.
f_flags: O_XXX flags from open(2) system call copied there (with slight modifications by filp_open()) by dentry_open() and after clearing O_CREAT, O_EXCL, O_NOCTTY, O_TRUNC - there is no point in storing these flags permanently since they cannot be modified by F_SETFL (or queried by F_GETFL) fcntl(2) calls.
f_mode: a combination of userspace flags and mode, set by dentry_open(). The point of the conversion is to store read and write access in separate bits so one could do easy checks like (f_mode & FMODE_WRITE) and (f_mode & FMODE_READ).
f_pos: a current file position for next read or write to the file. Under i386 it is of type long long, i.e. a 64bit value.
f_reada, f_ramax, f_raend, f_ralen, f_rawin: to support readahead - too complex to be discussed by mortals ;)
f_owner: owner of file I/O to receive asynchronous I/O notifications via SIGIO mechanism (see fs/fcntl.c:kill_fasync()).
f_uid, f_gid - set to user id and group id of the process that opened the file, when the file structure is created in get_empty_filp(). If the file is a socket, used by ipv4 netfilter.
f_error: used by NFS client to return write errors. It is set in fs/nfs/file.c and checked in mm/filemap.c:generic_file_write().
f_version - versioning mechanism for invalidating caches, incremented (using global event) whenever f_pos changes.
private_data: private per-file data which can be used by filesystems (e.g. coda stores credentials here) or by device drivers. Device drivers (in the presence of devfs) could use this field to differentiate between multiple instances instead of the classical minor number encoded in file->f_dentry->d_inode->i_rdev.
Now let us look at file_operations structure which contains the methods that can be invoked on files. Let us recall that it is copied from inode->i_fop where it is set by s_op->read_inode() method. It is declared in include/linux/fs.h:



--------------------------------------------------------------------------------

struct file_operations {
struct module *owner;
loff_t (*llseek) (struct file *, loff_t, int);
ssize_t (*read) (struct file *, char *, size_t, loff_t *);
ssize_t (*write) (struct file *, const char *, size_t, loff_t *);
int (*readdir) (struct file *, void *, filldir_t);
unsigned int (*poll) (struct file *, struct poll_table_struct *);
int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *);
int (*release) (struct inode *, struct file *);
int (*fsync) (struct file *, struct dentry *, int datasync);
int (*fasync) (int, struct file *, int);
int (*lock) (struct file *, int, struct file_lock *);
ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);
ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *);
};


--------------------------------------------------------------------------------


owner: a pointer to the module that owns the subsystem in question. Only drivers need to set it to THIS_MODULE, filesystems can happily ignore it because their module counts are controlled at mount/umount time whilst the drivers need to control it at open/release time.
llseek: implements the lseek(2) system call. Usually it is omitted and fs/read_write.c:default_llseek() is used, which does the right thing (TODO: force all those who set it to NULL currently to use default_llseek - that way we save an if() in llseek())
read: implements read(2) system call. Filesystems can use mm/filemap.c:generic_file_read() for regular files and fs/read_write.c:generic_read_dir() (which simply returns -EISDIR) for directories here.
write: implements write(2) system call. Filesystems can use mm/filemap.c:generic_file_write() for regular files and ignore it for directories here.
readdir: used by filesystems. Ignored for regular files and implements readdir(2) and getdents(2) system calls for directories.
poll: implements poll(2) and select(2) system calls.
ioctl: implements driver or filesystem-specific ioctls. Note that generic file ioctls like FIBMAP, FIGETBSZ, FIONREAD are implemented by higher levels so they never read f_op->ioctl() method.
mmap: implements the mmap(2) system call. Filesystems can use generic_file_mmap here for regular files and ignore it on directories.
open: called at open(2) time by dentry_open(). Filesystems rarely use this, e.g. coda tries to cache the file locally at open time.
flush: called at each close(2) of this file, not necessarily the last one (see release() method below). The only filesystem that uses this is NFS client to flush all dirty pages. Note that this can return an error which will be passed back to userspace which made the close(2) system call.
release: called at the last close(2) of this file, i.e. when file->f_count reaches 0. Although defined as returning int, the return value is ignored by VFS (see fs/file_table.c:__fput()).
fsync: maps directly to fsync(2)/fdatasync(2) system calls, with the last argument specifying whether it is fsync or fdatasync. Almost no work is done by VFS around this, except to map file descriptor to a file structure (file = fget(fd)) and down/up inode->i_sem semaphore. Ext2 filesystem currently ignores the last argument and does exactly the same for fsync(2) and fdatasync(2).
fasync: this method is called when file->f_flags & FASYNC changes.
lock: the filesystem-specific portion of the POSIX fcntl(2) file region locking mechanism. The only bug here is that because it is called before fs-independent portion (posix_lock_file()), if it succeeds but the standard POSIX lock code fails then it will never be unlocked on fs-dependent level..
readv: implements readv(2) system call.
writev: implements writev(2) system call.