제4장 파일의 내부표현
As observed in Chapter 2, every file on a UNIX system has a unique inode. The inode contains the information necessary for a process to access a file, such as file ownership, access rights, file size, and location of the file's data in the file system. Processes access files by a well defined set of system calls and specify a file by a character string that is the path name. Each path name uniquely specifics a file. and the kernel converts the path name to the file’s inode.
This chapter describes the internal structure of files in the UNIX system, and the next chapter describes the system call interface to files. Section 4.1 examines the inode and how the kernel manipulates it, and Section 4.2 examines the internal structure of regular files and how the kernel reads and writes their data. Section 4.3 investigates the structure of directories, the files that allow the kernel to organize the file system as a hierarchy of files, and Section 4.4 presents the algorithm for converting user file names to inodes. Section 4.5 gives the structure of the super block, and Sections 4.6 and 4.7 present the algorithms for assignment of disk inodes and disk blocks to files. Finally, Section 4.8 talks about other file types in the system, namely, pipes and device files.
Figure 4.1. File System Algorithms
The algorithms described in this chapter occupy the layer above the buffer cache algorithms explained in the last chapter (Figure 4.1). The algorithm
iget returns a previously identified inode, possibly reading it from disk via the buffer cache, and the algorithm
iput releases the inode. The algorithm
bmap sets kernel parameters for accessing a file, The algorithm
namei converts a user-level path name to an inode, using the algorithms
freeallocate and free disk blocks for files, and algorithms
ifree assign and free inodes for files.
Inodes exist in a static form on disk, and the kernel reads them into an in-core inode to manipulate them. Disk inodes consist of the following fields:
- File owner identifier. Ownership is divided between an individual owner and a "group" owner and defines the set of users who have access rights to a file. The superuser has access rights to all files in the system.
- File type. Files may be of type regular, directory, character or block special, or FIFO (pipes).
- File access permissions. The system protects files according to three classes: the owner and the group owner of the file, and other users; each class has access rights to read, write, and execute the file, which can be set individually. Because directories cannot be executed, execution permission for a directory gives the right to search the directory for a file name.
- File access times, giving the time the file was last modified, when it was last accessed, and when the inode was last modified.
- Number of links to the file, representing the number of names the file has in the directory hierarchy. Chapter 5 explains file links in detail.
- Table of contents for the disk addresses of data in a file. Although users treat the data in a file as a logical stream of bytes, the kernel saves the data in discontiguous disk blocks. The inode identifies the disk blocks that contain the file's data.
- File Size. Data in a file is addressable by the number of bytes from the beginning of the file, starting from byte offset 0, and the file size is 1 greater than the highest byte offset of data in the file. For example. if a user creates a file and writes only I byte of data at byte offset 1000 in the file, the size of the file is 1001 bytes.
The inode does not specify the path name(s) that access the file.
Figure 4.2. Sample Disk inode
Figure 4.2 shows the disk inode of a sample file. This inode is that of a regular file owned by "mjb,” which contains 6030 bytes. The system permits "mjb" to read, write, or execute the file; members of the group "os" and all other users can only read or execute the file, not write it. The last time anyone read the file was on October 23. 1984, at 1:48 in the afternoon, and the last time anyone wrote the file was on October 22, 1984, at 10:30 in the morning. The inode was last changed on October 23, 1984, at 1:30 in the afternoon, although the data in the file was not written at that time. The kernel encodes the above information in the inode. Note the distinction between writing the contents of an inode to disk and writing the contents of a file to disk. The contents of a file change only when writing it. The contents of an inode change when changing the contents of a file or when changing its owner, permission, or link settings. Changing the contents of a file automatically implies a change to the inode, but changing the inode does not imply that the contents of the file change.
The in-core copy of the inode contains the following fields in addition to the fields of the disk inode:
- The status of the in-core inode, indicating whether
- the inode is locked,
- a process is waiting for the inode to become unlocked,
- the in-core representation of the inode differs from the disk copy as a result of a change to the data in the inode,
- the in-core representation of the file differs from the disk copy as a result of a change to the file data.
- the file is a mount point (Section 5.15).
- The logical device number of the file system that contains the file.
- The inode number. Since inodes are stored in a linear array on disk (recall Section 2.2.1), the kernel identifies the number of a disk inode by its position in the array. The disk inode does not need this field.
- Pointers to other in-core inodes. The kernel links inodes on hash queues and a free list in the same way that it links buffers on buffer hash queues and 00 the buffer free list. A hash queue is identified according to the inode's logical device number and inode number. The kernel can contain at most one in-core copy of a disk inode, but inodes can be simultaneously 00 a hash queue and on the free list.
- A reference count, indicating the number of instances of the file that are active (such as when invoked
Many fields in the inode are analogous to fields in the buffer header, and the management of inodes is similar to the management of buffers. The inode lock, when set, prevents other processes from accessing the inode; other processes set a flag in the inode when attempting to access it to indicate that they should be awakened when the lock is released. The kernel sets other flags to indicate discrepancies between the disk inode and the in-core copy. When the kernel needs to record changes to the file or to the inode, it writes the in-core copy of the inode to disk after examining these flags.
The most striking difference between an in-core inode and a buffer header is the in-core reference count, which counts the number of active instances of the file. An inode is active when a process allocates it, such as when
open()ing a file. An inode is on the free list only if its reference count is 0, meaning that the kernel can reallocate the in-core inode to another disk inode. The free list of inodes thus serves as a cache of inactive inodes: If a process attempts to access a file whose inode is not currently in the in-core inode pool, the kernel reallocates an in-core inode from the free list for its use. On the other hand, a buffer has no reference count; it is on the free list if and only if it is unlocked.
4.1.2 Accessing Inodes
Figure 4.3. Algorithm for Allocation of In-Core Inodes
The kernel identifies particular inodes by their file system and inode number and allocates in-core inodes at the request of higher-level algorithms. The algorithm
iget allocates an in-core copy of an inode (Figure 4.3); it is almost identical to the algorithm
getblk for finding a disk block in the buffer cache. The kernel maps the device number and inode number into a hash queue and searches the queue for the inode. If it cannot find the inode. it allocates one from the free list and locks it. The kernel then prepares to read the disk copy of the newly accessed inode into the in-core copy. It already knows the inode number and logical device and computes the logical disk block that contains the inode according to how many disk inodes fit into a disk block. The computation follows the formula
block num = ((inode number - 1) / number of inodes per block) + start block of inode list
where the division operation returns the integer part of the quotient. For example, assuming that block 2 is the beginning of the inode list and that there are 8 inodes per block then inode number 8 is in disk block 2, and inode number 9 is in disk block 3. If there are 16 inodes in a disk block, then inode number 8 and 9 are in disk block 2, and inode number 17 is the first inode in disk block 3.
When the kernel knows the device and disk block number, it reads the block using the algorithm
bread (Chapter 2), then uses the following formula to compute the byte offset of the inode in the block:
((inode number - 1) modulo (number of inodes per block)) * size of disk inode
For example, if each disk inode occupies 64 bytes and there are 8 inodes per disk block, then inode number 8 starts at byte offset 448 in the disk block. The kernel removes the in-core inode from the free list, places it on the correct hash queue, and sets its in-core reference count to 1. It copies the file type, owner fields, permission settings, link count, file size, and the table of contents from the disk inode to the in-core inode, and returns a locked inode.
The kernel manipulates the inode lock and reference count independently. The lock is set during execution of a system call to prevent other processes from accessing the inode while it is in use (and possibly inconsistent). The kernel releases the lock at the conclusion of the system call: an inode is never locked across system calls. The kernel increments the reference count for every active reference to a file. For example, Section 5.1 will show that it increments the inode reference count when a process
open()s a file. It decrements the reference count only when the reference becomes inactive, for example, when a process
close()s a file. The reference count thus remains set across multiple system calls. The lock is free between system calls to allow processes to share simultaneous access to a file; the reference count remains set between system calls to prevent the kernel from reallocating an active in-core inode. Thus, the kernel can lock and unlock an allocated inode independent of the value of the reference count. System calls other than open allocate and release inodes, as will be seen in Chapter 5.
Returning to algorithm
iget, if the kernel attempts to take an inode from the free list but finds the free list empty, it reports an error. This is different from the philosophy the kernel follows for disk buffers, where a process sleeps until a buffer becomes free: Processes have control over the allocation of inodes at user level via execution of
close() system calls, and consequently the kernel cannot guarantee when an inode will become available. Therefore, a process that goes to sleep waiting for a free inode to become available may never wake up. Rather than leave such a process "hanging," the kernel fails the system call. However, processes do not have such control over buffers: Because a process cannot keep a buffer locked across system calls, the kernel can guarantee that a buffer will become free soon, and a process therefore sleeps until one is available.
The preceding paragraphs cover the case where the kernel allocated an inode that was not in the inode cache. If the inode is in the cache, the process (A) would find it on its hash queue and check if the inode was currently, locked by another process (B). If the inode is locked, process A sleeps, setting a flag in the inode to indicate that it is waiting for the inode to become free. When process B later unlocks the inode. it awakens all processes (including process A) waiting for the inode to become free. When process A is finally able to use the inode, it locks the inode so that other processes cannot allocate it. If the reference count was previously 0, the inode also appears on the free list, so the kernel removes it from there: the inode is no longer free. The kernel increments the inode reference count and returns a locked inode.
To summarize, the
iget algorithm is used toward the beginning of system calls when a process first accesses a file. The algorithm returns a locked inode structure with reference count 1 greater than it had previously been. The in-core inode contains up-to-date information on the state of the file. The kernel unlocks the inode before returning from the system call so that other system calls can access the inode if they wish. Chapter 5 treats these cases in greater detail.
4.1.3 Releasing Inodes
Figure 4.4. Releasing an Inode:
When the kernel releases an inode (algorithm
iput, Figure 4.4), it decrements its in-core reference count. If the count drops to 0, the kernel writes the inode to disk if the in-core copy differs from the disk copy. They differ if the file data has changed, if the file access time has changed, or if the file owner or access permissions have changed. The kernel places the inode on the free list of inodes, effectively caching the inode in case it is needed again soon. The kernel may also release all data blocks associated with the file and free the inode if the number of links to the file is 0.
4.2 일반파일의 구조
As mentioned above, the inode contains the table of contents to locate a file's data on disk. Since each block on a disk is addressable by number, the table of contents consists of a set of disk block numbers. If the data in a file were stored in a contiguous section of the disk (that is, the file occupied a linear sequence of disk blocks), then storing the start block address and the file size in the inode would suffice to access all the data in the file. However, such an allocation strategy would not allow for simple expansion and contraction of file in the file system without running the risk of fragmenting free storage area on the disk. Furthermore, the kernel would have to allocate and reserve contiguous space in the file system before allowing operations that would increase the file size.
Figure 4.5. Allocation of Contiguous Files and Fragmentation of Free Space
For example, suppose a user creates three files, A, B and C, each consisting of 10 disk blocks of storage, and suppose the system allocated storage for the three files contiguously. If the user then wishes to add 5 blocks of data to the middle file, B, the kernel would have to copy file B to a place in the file system that had room for 15 blocks of storage. Aside from the expense of such an operation, the disk blocks previously occupied by file B’s data would be unusable except for files smaller than 10 blocks (Figure 4.5). The kernel could minimize fragmentation of storage space by periodically running garbage collection procedures to compact available storage, but that would place an added drain on processing power.
For greater flexibility, the kernel allocates file space one block at a time and allows the data in a file to be spread throughout the file system. But this allocation scheme complicates the task of locating the data. The table of contents could consist of a list of block numbers such that the blocks contain the data belonging to the file, but simple calculations show that a linear list of file blocks in the inode is difficult to manage. If a logical block contains 1K bytes, then a file consisting of 10K bytes would require an index of 10 block numbers, but a file containing 100K bytes would require an index of 100 block numbers. Either the size of the inode would vary according to the size of the file, or a relatively low limit would have to be placed on the size of a file.
To keep the inode structure small yet still allow large files, the table of contents of disk blocks conforms to that shown in Figure 4.6. The System V UNIX system runs with 13 entries in the inode table of contents, but the principles are independent of the number of entries. The blocks marked "direct" in the figure contain the numbers of disk blocks that contain real data. The block marked "single indirect" refers to a block that contains a list of direct block numbers. To access the data via the indirect block, the kernel must read the indirect block, find the appropriate direct block: entry, and then read the direct block to find the data. The block marked "double indirect" contains a list of indirect block numbers, and the block marked "triple contains a list of double indirect block numbers.
In principle, the method could be extended to support “quadruple indirect block,”“quintuple indirect blocks,” and so on, but the current structure has sufficed in practice. Assume that a logical block on the file system holds 1K bytes and that a block number is addressable by a 32 bit (4 byte) integer. Then a block can hold up to 256 block numbers. The maximum number of bytes that could be held in a file is calculated (Figure 4.1) at well over 16 gigabytes, using 10 direct blocks and 1 indirect, 1 double indirect, and 1 triple indirect block in the inode. Given that the file size field in the inode is 32 bits, the size of a file is effectively limited to 4 gigabytes ().
Processes access data in a file by byte offset. They work in terms of byte counts and view a file as a stream of bytes starting at byte address 0 and going up to the size of the file. The kernel converts the user view of bytes into a view of blocks: The file starts at logical block 0 and continues to a logical block number corresponding to the file size. The kernel accesses the inode and converts the logical file block into the appropriate disk block. Figure 4.8 gives the algorithm
bmap for converting a file byte offset into a physical disk block.
Figure 4.6. Direct and Indirect Blocks in Inode
Figure 4.7. Byte Capacity of a File - 1K Bytes Per Block
Figure 4.8. Conversion of Byte Offset to Block Number in File System
Consider the block layout for the file in Figure 4.9 and assume that a disk block contains 1024 bytes. If a process wants to access byte offset 9000, the kernel calculates that the byte is in direct block 8 in the file (counting from 0). It then accesses block number 367; the 808th byte in that block (starting from 0) is byte 9000 in the file. If a process wants to access byte offset 350,000 in the file, it must access a double indirect block, number 9156 in the figure. Since an indirect block has room for 256 block numbers, the find byte accessed via the double indirect block is byte number 272,384 (256K + 10K); byte number 350,000 in a file is therefore byte number 77,616 of the double indirect block. Since each single indirect block accesses 256K bytes, byte number 350,000 must be in the 0th single indirect block of the double indirect block — block number 331. Since each direct block in a single indirect block contains 1K bytes, byte number 77,616 of a single indirect block is in the 75th direct block in the single indirect block — block number 3333. Finally, byte number 350,000 in the file is at byte number 816 in block 3333.
Figure 4.9. Block layout of a Sample File and its Inode
Examining Figure 4.9 more closely, several block entries in the inode are 0, meaning that the logical block entries contain no data. This happens if no process ever wrote data into the file at any byte offsets corresponding to those blocks and hence the block numbers remain at their initial value, 0. No disk space is wasted for such block. Processes can cause such a block layout in a file by using the
write() system calls, as described in the next chapter. The next chapter also describes how the kernel takes care of
read() system calls that access such blocks.
The conversion or a large byte offset, particularly one that is referenced via the triple indirect block, is an arduous procedure that could require the kernel to access three disk blocks in addition to the inode and data block. Even if the kernel find the blocks in the buffer cache, the operation is still expensive, because the kernel must make multiple requests of the buffer cache and may have to sleep awaiting locked buffers. How effective is the algorithm in practice? That depends on how the system is used and whether the user community and job mix are such that the kernel accesses large files or small files more frequently. It has been observed [Mullender 84], however, that most files on UNIX systems contain less than 10K bytes, and many contain less than 1K bytes! (fn) Since 10K bytes of a file are stored in direct blocks. most file data can be accessed with one disk access. So in spite of the fact that accessing large files is an expensive operation, accessing common-sized files is fast.
Two extensions to the inode structure just described attempt to take advantage of file size characteristics. A major principle in the 4.2 BSD file system implementation [McKusick 84] is that the more data the kernel can access on the disk in a single operation, the faster file access becomes. That argues for having larger logical disk blocks, and the Berkeley implementation allows logical disk blocks of 4K or 8K bytes. But having larger block sizes on disk increases block fragmentation, leaving large portions of disk space unused. For instance, if the logical block size is 8K bytes, then a file of size 12K bytes uses 1 complete block and half of a second block. The other half of the second block (4K bytes) is wasted; no other file can use the space for data storage. If the sizes of files are such that the number of bytes in the last block of a file is uniformly distributed, then the average wasted space is half a block per file; the amount of wasted disk space can be as high as 45% for a file system with logical blocks of size 4K bytes [McKusick 84]. The Berkeley implementation remedies the situation by allocating a block fragment to contain the last data in a file. One disk block can contain fragments belonging to several files. An exercise in Chapter 5 explores some details of the implementation.
The second extension to the classic inode structure described here is to store file data in the inode (see Mullender 84). By expanding the inode to occupy an entire disk block, a small portion of the block can be used for the inode structures and the remainder of the block can store the entire file, in many cases, or the end of a file otherwise. The main advantage is that only one disk access is necessary to get the inode and its data if the file fits in the inode block.
Recall from Chapter 1 that directories are the files that give the file system its hierarchical structure; they play an important role in conversion of a file name to an inode number. A directory is a file whose data is a sequence of entries, each consisting of an inode number and the name of a file contained in the directory. A path name is a null terminated character string divided into separate components by the slash (“/”) character. Each component except the last must be the name of a directory, but the last component may be a non-directory file. UNIX System V restricts component names to a maximum of 14 characters; with 2 byte entry for the inode number, the size of a directory entry is 16 bytes.
Figure 4.10. Directory Layout for
Figure 4.10 depicts the layout or the directory "etc". Every directory contains the file names dot and dot-dot (“.”and “..”) whose inode numbers are those of the directory and its parent directory, respectively. The inode number or “.” in “/etc” is located at offset 0 in the file, and its value is 83. The inode number of “..” is located at offset 16, and its value is 2. Directory entries may be empty, indicated by an inode number of O. For instance, the entry at address 224 in “/etc” is empty, although it once contained an entry for a file named "crash". The program
mkfs initializes a file system so that “.”and “..” of the root directory have the root inode number or the file system.
The kernel stores data for a directory just as it stores data for an ordinary file, using the inode structure and levels of direct and indirect blocks. Processes may read directories in the same way they read regular files, but the kernel reserves exclusive right to write a directory, thus insuring its correct structure. The access permissions of a directory have the following meaning: read permission on a directory allows a process to read a directory; write permission allows a process to create new directory entries or remove old ones (via the
unlink() system calls), thereby altering the contents of the directory, execute permission allows a process to search the directory for a file name (it is meaningless to execute a directory). Exercise 4.6 explores the difference between reading and searching a directory.
4.4 경로이름을 아이노드로 변환
The initial access to a file is by its path name, as in the
chdir()(change directory), or
link()system calls. Because the kernel works internally with inodes rather than with path names, it converts the path names to inodes to access files. The algorithm
namei parses the path name one component at a time, converting each component into an inode based on its name and the directory being searched, and eventually return the inode or the input path name (Figure 4.11).
Recall from Chapter 2 that every process is associated with (resides in) a current directory; the
u area contains a pointer to the current directory inode. The current directory of the first process in the system, process 0, is the root directory. The current directory or every other process starts out as the current directory or its parent process at the time it was created (see Section 5.10). Processes change their current directory by executing the
chdir() (change directory) system call. All path name searches start from the current directory or the process unless the path name starts with the slash character, signifying that the search should start from the root directory. In either case, the kernel can easily find the inode where the path name search starts: The current directory is stored in the process
u area, and the system root inode is stored in a global variable.(fn)
namei uses intermediate inodes as it parses a path name; call them working inodes. The inode where the search starts is the first working inode. During each iteration of the
namei loop, the kernel makes sure that the working inode is indeed that or a directory. Otherwise, the system would violate the assertion that non-directory files can only be leaf nodes of the file system tree. The process must also have permission to search the directory (read permission is insufficient). The user ID of the process must match the owner or group ID of the file, and execute permission must be granted, or the file must allow search to all users. Otherwise the search fails.
Figure 4.11. Algorithm for Conversion of a Path Name to an node
The kernel does a linear search of the directory file associated with the working inode, trying to match the path name component to a directory entry name. Starting at byte offset 0, it converts the byte offset in the directory to the appropriate disk block according to algorithm
bmap and reads the block using algorithm
bread. It searches the block for the path name component, treating the contents of the block as a sequence of directory entries. If it finds a match, it records the inode number of the matched directory entry, releases the block(algorithm
brelse) and the old working inode (algorithm
iput), and allocates the inode of the matched component (algorithm
iget). The new inode becomes the working inode. If the kernel docs not match the path name with any names in the bJock. it releases the block, adjusts the byte offset by the number of bytes in a bJock. converts the new offset to a disk block number (algorithm
bmap), and reads the next block. The kernel repeats the procedure until it matches the path name component with a directory entry name, or until it reaches the end of the directory.
For example, suppose a process wants to open the file “/etc/passwd”. When the kernel starts parsing the file name, it encounters “/” and gets the system root inode. Making root its current working inode, the kernel gathers in the string "etc". After checking that the current inode is that of a directory (“/”) and that the process has the necessary permissions to search it, the kernel searches root for a file whose name is “etc”: It accesses the data in the root directory block by block and searches each block one entry at a time until it locates an entry for “etc”. On finding the entry, the kernel releases the inode for root (algorithm
iput) and allocates the inode for “etc” (algorithm
iget) according to the inode number of the entry just found. After ascertaining that “etc” is a directory and that it has the requisite search permissions, the kernel searches “etc” block by block for a directory structure entry for the file "passwd". Referring to Figure 4.10, it would find the entry for “passwd” as the ninth entry of the directory. On finding it, the kernel releases the inode for “etc”, allocates the inode for "passwd", and — since the path name is exhausted — returns that inode.
It is natural to question the efficiency of a linear search of a directory for a path name component. Ritchie points out (see page 1968 of [Ritchie 78b]) that a linear search is efficient because it is bounded by the size of the directory. Furthermore, early UNIX system implementations did not run on machines with large memory space, so there was heavy emphasis on simple algorithms such as linear search schemes. More complicated search schemes could require a different, more complex, directory structure, and would probably run more slowly on small directories than the linear search scheme.
So far, this chapter has described the structure of a file, assuming that the inode was previously bound 10 a file and that the disk blocks containing the data were already assigned. The next sections cover how the kernel assigns inodes and disk blocks. To understand those algorithms, let us examine the structure of the super block.
The super block consists of the following fields:
- the size of the file system,
- the number of free blocks in the file system,
- a list of free blocks available on the file system,
- the index or the next free block in the free block list,
- the size of the inode list,
- the number of free inodes in the file system,
- a list of free inodes in the file system,
- the index of the next free inode in the free inode list,
- lock fields for the free block and free inode lists,
- a flag indicating that the super block has been modified.
The remainder of this chapter will explain the use of the arrays, indices and locks. The kernel periodically writes the super block to disk if it had been modified so that it is consistent with the data in the file system.
4.6 새로운 파일에 아이노드 할당
iget을 사용하여 아이노드를 가져오며, 아이노드는 이전에 할당된 것이었다. 예를 들어
namei알고리즘에서, 커널은 경로이름과 디렉토리안의 이름과 매칭되는 것으로 아이노드 번호가 결정된다. 또 다른 알고리즘
ialloc은 새로 생성되는 파일에 디스크 아이노드를 할당하는 것이다.
2장에서 언급한 것처럼, 파일 시스템은 아이노드의 선형리스트를 가지고 있다. 만일 아이노드의 타입 필드가 0이라면, 아이노드가 free라는 뜻이다. 프로세스가 새 아이노드가 필요할 때, 커널은 이론적으로 free 아이노드를 아이노드 리스트에서 찾아야 한다. 하지만, 그런 검색 비용은 매우 높으며, 매 아이노드마다 최소 한번의 read 연산(아마도 디스크)이 필요하다. 따라서 성능을 향상하기 위해, 파일 시스템 슈퍼블럭에 free 아이노드 번호를 캐시하고 있는 배열을 저장한다.
그림 4.12. Algorithm for Assigning New Inodes
그림 4.12는 새로운 아이노드를 할당하는
ialloc 알고리즘을 보여준다.
- 커널은 다른 프로세스가 슈퍼블럭의 프리 아이노드 리스트를 lock하고 있는지 검증한다.(이유는 나중에 언급)
- 만일 슈퍼블럭의 아이노드 번호 리스트가 비어있지 않다면, 커널은 다음 아이노드를 할당,
iget을 사용하여 새로이 할당된 디스크 아이노드에 대해 인코어 아이노드를 할당하며(필요하다면 디스크에서 아이노드를 읽음),
- 디스크 아이노드를 인코어 아이노드로 복사,
- 아이노드 필드를 초기화,
- lock된 아이노드를 반환한다.
커널은 디스크 아이노드를 사용중이라고 업데이트 한다. 파일 타입 필드가 0이 아니라면, 디스크 아이노드가 할당되었을 가리킨다. 단순한 경우에서는 커널은 바로 아이노드를 가지지만, 더 많은 검사를 필요로하는 경쟁상태(race condition)가 존재한다(곧 설명함). 경쟁상태를 느슨하게 서술하자면, 몇몇 프로세스가 공통의 자료구조를 변경하고자 할 때, 경쟁상태가 발생하며, 모든 프로세스가 locking 규약을 잘 따르더라도, 프로세스가 실행된 순서에 따라 계산 결과가 달라진다. 여기서, 프로세스가 (다른 프로세스에서) 사용중인 아이노드를 얻을 수 있다는 점을 암시한다. 경쟁상태는 2장에서 정의한 상호배제 문제와 관련이 있다. 단, 락킹 기법은 상호배제 문제를 해결할 수 있지만, 모든 경쟁상태를 해결할 수는 없다.
슈퍼블럭의 프리 아이노드 리스트가 비어있다면, 커널은 디스크를 검색하여, 가능한 많은 프리 아이노드 번호를 슈퍼블럭에 넣는다. 커널은 디스크 상의 아이노드 리스트를 한블럭씩 읽고, 슈퍼블럭의 아이노드 번호 리스트에 가득 채우는데, 커널이 찾은 가장 큰 아이노드번호를 기억해 놓는다. 이 아이노드를 "기억된(remembered)" 아이노드라고 부르자. 슈퍼블럭에 마지막으로 저장된 것이기도 하다. 다음 번에 커널이 프리 아이노드를 찾을 때, 기억된 아이노드를 시작점으로 이용하면, 프리 아이노드가 존재하지 않는 블럭을 읽는 것을 방지하여 시간을 아낄 수 있다. 프리 아이노드의 번호를 획득한 이후에, 아이노드 할당 알고리즘을 처음부터 시작한다. 언제라도 커널이 디스크 아이노드를 할당하면, 슈퍼블럭 안에 기록된 프리 아이노드의 수를 감소시킨다.
Figure 4.13. Two Arrays of Free Inode Numbers
그림 4.13에서 프리 아이노드 번호를 저장한 두 배열, (a)와 (b)를 고려해보자. 만약 커널이 아이노드를 할당하려 할 때, 슈퍼블럭안의 프리 아이노드 리스트가 그림 4.13(a)처럼 생겼다면, 커널은 다음 유효 아이노드 번호를 위한 인덱스를 1만큼 감소시키고, 48번 아이노드를 가지고 온다. 만일 슈퍼블럭 내에 프리 아이노드의 리스트가 그림 4.13(b)의 첫번째 그림과 같다면, 배열이 비어있음을 알아차리고, 프리 아이노드를 가져오기 위해 디스크를 검색하는데, 기억된 아이노드 인 470에서 시작한다. 커널이 슈퍼블럭의 프리 리스트를 가득 채우면, 마지막 아이노드를 다음 검색을 위한 시작 번호로 기억한다. 커널은 디스크에서 가져온 아이노드(471번) 아이노드를 할당하고 하던 일을 계속한다.
Figure 4.14. Algorithm for Freeing Inode
ifree for freeing an inode is much simpler. After incrementing the total number of available inodes in the file system, the kernel checks the lock on the super block. If locked, it avoids race conditions by returning immediately: The inode number is not put into the super block, but it can be found on disk and is available for reassignment. If the list is not locked, the kernel checks if it has room for more inode numbers and, if it does, places the inode number in the list and returns. If the list is full, the kernel may not save the newly freed inode there: It compares the number of the freed inode with that of the remembered inode. If the freed inode number is less than the remembered inode number, it "remembers" the newly freed inode number, discarding the old remembered inode number from the super block. The inode is not lost, because the kernel can find it by searching the inode list on disk. The kernel maintains the super block list such that the last inode it dispenses from the list is the remembered inode. Ideally, there should never be free inodes whose inode number is less than the remembered inode number, but exceptions are possible.
Figure 4.15. Placing Free Inode Numbers into the Super Block
Consider two examples of freeing inodes. If the super block list of free inodes has room for more free inode numbers as in Figure 4.13(a), the kernel places the inode number on the list, increments the index to the next free inode, and proceeds. But if the list of free inodes is full as in Figure 4.15, the kernel compares the inode number it has freed to the remembered inode number that will start the next disk search. Starting with the free inode List in Figure 4.15(a), if the kernel frees inode 499, it makes 499 the remembered inode and evicts number 535 from the free list. If the kernel then frees inode number 601, it does not change the contents of the free list. When it later uses up the inodes in the super block free list, it will search the disk for free inodes starting from inode number 499, and find inodes 535 and 601 again.
Figure 4.16. Race Condition in Assigning Inodes
The preceding paragraph described the simple cases of the algorithms. Now consider the case where the kernel assigns a new inode and then allocates an in-core copy for the inode. The algorithm implies that the kernel could find that the inode had already been assigned. Although rare, the following scenario shows such a case (refer to Figures 4.16 and 4.17). Consider three processes, A, B, and C, and suppose that the kernel, acting on behalf of process A(fn), assigns inode I but goes to sleep before it copies the disk inode into the in-core copy. Algorithms
iget (invoked by
bread (invoked by
iget) give process A ample opportunity to go to sleep. While process A is asleep, suppose process B attempts to assign a new inode but discovers that the super block list of free inodes is empty. Process B searches the disk for free inodes, and suppose it starts its search for free inodes at an inode number lower than that of the inode that A is assigning. It is possible for process B to find inode I free on the disk since process A is still asleep, and the kernel does not know that the inode is about to be assigned. Process B, not realizing the danger, completes its search of the disk, fills up the super block with (supposedly) free inodes, assigns an inode, and departs from the scene. However, inode I is in the super block free list of inode numbers. When process A wakes up, it completes the assignment of inode J. Now suppose process C later requests an inode and happens to pick inode I from the super block free list. When it gets the in-core copy of the inode, it will find its file type set, implying that the inode was already assigned. The kernel checks for this condition and, finding that the inode has been assigned, tries to assign a new one. Writing the updated inode to disk immediately after its assignment in
ialloc makes the chance of the race smaller, because the file type field will mark the inode in use.
Figure 4.17. Race Condition in Assigning Inodes (continued)
Locking the super block list of inodes while reading in a new set from disk prevents other race conditions. If the super block list were not locked, a process could find it empty and try to populate it from disk, occasionally sleeping while waiting for I/O completion. Suppose a second process also tried to assign a new inode and round the list empty. It, too, would try to populate the list from disk. At best, the two processes are duplicating their efforts and wasting CPU power. At worst, race conditions of the type described in the previous paragraph would be more frequent. Similarly, if a process freeing an inode did not check that the list is locked, it could overwrite inode numbers already in the free list while another process was populating it from disk. Again, the race conditions described above would be more frequent. Although the kernel handles them satisfactorily, system performance would suffer. Use of the lock on the super block free list prevents such race conditions.
4.7 디스크 블럭의 할당
When a process writes data to a file, the kernel must allocate disk blocks from the file system for direct data blocks and, sometimes, for indirect blocks. The file system super block contains an array that is used to cache the numbers of free disk blocks in the file system. The utility program
mkfs(make file system) organizes the data blocks of a file system in a linked list, such that each link of the list is a disk block that contains an array of free disk block numbers, and one array entry is the number of the next block of the linked list. Figure 4.18 shows an example of the linked list, where the first block is the super block free list and later blocks on the linked list contain more free block number.
Figure 4.19. Algorithm for Allocating Disk Block
When the kernel wants to allocate a block from a file system (algorithm
alloc, Figure 4.19), it allocates the next available block in the super block list. Once allocated, the block cannot be reallocated until it becomes free. If the allocated block is the last available block in the super block cache, the kernel treats it as a pointer to a block that contains a list free blocks. It reads the block, populates the super block array with the new list of block numbers, and then proceeds to use the original block number. It allocates a buffer for the block and clears the buffer’s data (zeros it). The disk block has now been assigned, and the kernel has a buffer to work with. If the file system contains no free blocks, the calling process receives an error.
If a process
writes a lot of data to a file, it repeatedly asks the system for blocks to store the data, but the kernel assigns only one block at a time. The program
mkft tries to organize the original linked list of free block numbers so that block numbers dispensed to a file are near each other. This helps performance, because it reduces disk seek time and latency when a process reads a file sequentially. Figure 4.18 depicts block number in a regular pattern, presumably based on the disk rotation speed. Unfortunately, the order of block numbers on the free block linked lists breaks down with heavy use as processes files and remove them, because block numbers enter and leave the free list at random. The kernel makes no attempt to sort block numbers on the free list.
Figure 4.18. Linked List of Free Disk Block Numbers
free for freeing a block is the reverse of the one for allocating a block. If the super block list is not full, the block number of the newly freed block is placed on the super block list. If, however, the super block list is full, the newly freed block becomes a link block; the kernel writes the super block list into the block and writes the block to disk. It then places the block number of the newly freed block in the super block list: That block number is the only member of the list.
Figure 4.20. Requesting and Freeing Disk Blocks
Figure 4.20 shows a sequence of
free operations, starting with one entry on the super block free list. The kernel frees block 949 and places the block number on the free list. It then allocates a block and removes block number 949 from the free list. Finally, it allocates a block and removes block number 109 from the free list. Because the super block free list is now empty, the kernel replenishes the list by copying in the contents or block 109, the next link on the linked list. Figure 4.20(d) shows full super block list and the next link block, block 211.
The algorithms for assigning and freeing inodes and disk blocks are similar in that the kernel uses the super block as a cache: containing indices of free resources, block numbers, and inode numbers. It maintains a linked list of block numbers such that every free block number in the file system appears in some element of the linked, but it maintains no such list of free inodes. There are three reasons for the different treatment.
The kernel can determine whether an inode is free by inspection: If the file type field is clear, the inode is free. The kernel needs no other mechanism to describe free inodes. However, it cannot determine whether a block is free just by looking at it. It could not distinguish between a bit pattern that indicates the block is free and data that happened to have that bit pattern. Hence, the kernel requires an external method to identify free blocks, and traditional implementations have used a linked list.
Disk blocks lend themselves to the use of linked list: A disk block easily holds large list of free block numbers. But inodes have no convenient place for bulk storage of large lists of free inode numbers.
Users tend to consume disk block resources more quickly than they consume inodes, so the apparent lag in performance when searching the disk for free inodes is not as critical as it would be for searching for free disk blocks.
4.8 다른 종류의 파일
The UNIX system supports two other file types: pipes and special files. A pipe, sometimes called a fifo(for "first-in-first-out"), differs from a regular file in that its data is transient: Once data is read from a pipe, it cannot be read again. Also, the data is read in the order that it was written to the pipe, and the system allows no deviation from that order. The kernel stores data in a pipe the same way it stores data in an ordinary file, except that it uses only the direct blocks, not the indirect blocks. The next chapter will examine the implementation of pipes.
The last file types in the UNIX system are special files, including block device special files and character device special files. Both types specify devices, and therefore the file inodes do not reference any data. Instead, the inode contains two numbers known as the major and minor device numbers. The major number indicates a device type such as terminal or disk, and the minor number indicates the unit number of the device. Chapter 10 examines special devices in detail.
The inode is the data structure that describes the attributes of a file, including the layout of its data on disk. There are two versions of the inode: the disk copy that stores the inode information when the file is not in use and the in-core copy that records information about active files. Algorithms
ifree control assignment of a disk inode to a file during the
unlink() system calls (next chapter), and the algorithms
iput control the allocation of in-core inodes when a process accesses a file. Algorithm
bmap locales the disk blocks of a file, according to a previously supplied byte offset in the file. Directories are files that correlate file name components to inode numbers. Algorithm
namei converts file names manipulated by processes to inodes, used internally by the kernel. Finally, the kernel controls assignment of new disk blocks to a file using algorithms
The data structures discussed in this chapter consist of linked lists, hash queues, and linear arrays, and the algorithms that manipulate the data structures are therefore simple. Complications arise due to race conditions caused by the interaction of the algorithms, and the text has indicated some of these timing problems. Nevertheless, the algorithms are not elaborate and illustrate the simplicity of the system design.
The structures and algorithms explained here are internal to the kernel and are not visible to the user. Referring to the overall system architecture (Figure 2.1), the algorithms described in this chapter occupy the lower half of the file subsystem. The next chapter examines the system calls that provide the user interface to the file system, and it describes the upper half of the file subsystem that invokes the internal algorithms described here.
'책 이야기 > The Design of The UNIX Operating System' 카테고리의 다른 글
|05 파일 시스템을 위한 시스템 콜 (0)||2017.02.28|
|UNIX OS Design - 04 파일의 내부표현 (0)||2017.02.12|
|UNIX OS Design - 03 버퍼 캐쉬 (0)||2017.01.30|
|UNIX OS Design - 02 커널의 개요 (0)||2017.01.30|
|UNIX OS Design - 01 시스템의 개관 (0)||2017.01.30|
|UNIX OS Design - 00 서문 (0)||2017.01.30|
- DBMS 개발
- vim 자동 넘버링
- 구조와 원리
- OS Internal
- Windows via c/c++
- UNIX Internals
- Design of UNIX Operating System
- 컴퓨터 강좌
- OS 커널