18 File Access and Control

A large part of every operating system seems to be concerned with data management and file management, and UNIX turns out to be no exception.

Section Four of the source code contains thirteen files.

The first four contain common declarations needed by various of the other routines:

“file.h”

describes the structure of the “file” array;

“filsvs.h”

describes the structure of the “super block” for “mounted” file systems;

“ino.h”

describes the structure of “inodes” recorded on “mounted” devices;

“inode.h”

describes the structure of the “inode” array;

The next two files, “sys2.c” and “sys3.c” contain code for system calls. (“sys1.c” and “sys4.c” were presented in Section Two).

Tne next five files, “rdwri.c”, “subr.c”, “fio.c”, “alloc.c” and “iget.c”, together present the principal routines for file management, and provide a link between the i/o oriented system calls and the basic i/o routines.

The file “nami.c” is concerned with searching directories to convert file pathnames into “inode” references.

Finally, “pipe.c” is the “device driver” for pipes.

18.1 File Characteristics

A UNIX file is conceptually a named character string, stored on one of a variety of peripheral devices (or in the main memory), and accessible via mechanisms appropriate to the usual peripheral devices.

It will be noted that there is no record structure associated with UNIX files. However “newline” characters may be inserted into the file to define substrings analogous to records.

UNIX carries the ideas of device independence to their logical extreme by allowing the file name in effect to determine uniquely all relevant attributes of the file.

18.2 System Calls

The following system calls are provided expressly for file manipulation:

#

Name

Line

#

Name

Line

3

read

5711

14

mknod

5952

4

write

5720

15

chmod

3560

5

open

5765

16

chown

3575

6

close

5846

19

seek

5861

8

creat

5781

21

mount

6086

9

link

5909

22

umount

6144

10

unlink

3510

41

dup

6069

12

chdir

3538

42

pipe

7723

18.3 Control Tables

The arrays “file” and “inode” are essential components of the file access mechanism.

18.4 file (5507)

The array “file” is defined as an array of structures (also named “file”).

An element of the “file” array is considered to be unallocated if “f_count” is zero.

Each “open” or “creat” system call results in the allocation of an element of the “file” array. The address of this element is stored in an element of the calling process’s array “u.u_ofile”. It is the index of the newly allocated element of the latter array which is passed back to the user process. Descendants of a process created by “newproc” inherit the contents of the parent’s “u.u_ofile” array.

Each element of “file” includes a counter, “f_count”, to determine the number of current processes which reference it.

“f_count” is incremented by “newproc” (1878), “dup” (6079) and “falloc” (6857); it is decremented by “closef” (6657) and (if the file can’t be opened) by “openl” (5836).

The “f_flag” (5509) of the “file” element notes whether the file is open for reading and/or writing or whether it is a “pipe” or not. (Further discussion of “pipes” will be deferred till Chapter Twenty-One.)

The “file” structure also contains a pointer, “f_inode” (5511) to an entry in the “inode” table, and a 32 bit integer, “f_offset” (5512), which is a logical pointer to a character within the file.

18.5 inode (5659)

“inode” is defined as an array of structures (also named “inode”).

An element of the “inode” is considered to be unallocated if the reference count, “i_count”, is zero.

At each point in time, “inode” contains a single entry for each file which may be referenced for normal i/o operations, or which is being executed or which has been executed and has the “sticky” bit set, or which is the working directory for some process.

Several “file” table entries may point to a single “inode” entry. The inode entry describes the general disporition of the file.

18.6 Resources Required

Each file requires the dedication of certain system resources. When a file exists, but is not being referenced in any way, it requires:

(a)

a directory entry (16 characters in a directory file);

(b)

a disk “inode” entry (32 characters in a table stored on the disk);

(c)

zero, one or more blocks of disk storage (512 characters each).

In addition if the file is being referenced for some purpose, it requires

(d)

a core “inode” entry (32 characters in the “inode” array);

Finally if a user program has “opened” the file for reading or writing, a number of resources are required:

(e)

a “file” array entry (8 characters);

(f)

an entry in the user program’s “u.u_ofile” array (one word per file, pointing to a “file” array entry);

Mechanisms have to be set up for allocating and deallocating each of these resources in an orderly manner. The following table gives the names of the principal procedures involved:

resource

obtain

free

directory entry

namei

namei

disk “inode” entry

ialloc

ifree

disk storage block

alloc

free

core “inode” entry

iget

iput

“file” table entry

falloc

closef

“u_ofile” entry

ufalloc

close

18.7 Opening a File

When a program wishes to reference a file which already exists, it must “open” the file to create a “bridge” to the file. (Note that in UNIX, processes usually inherit the open files of their parents or predecessors, so that often all needed files are already implicitly open.) If the file does not already exist, it must be “created”.

This second case will be investigated first:

18.8 creat (5781)

5786:

“namei” (7518) converts a pathname into an “inode” pointer. “uchar” is the name of a procedure which recovers the pathname, character by character, from the user program data area;

5787:

A null “inode” pointer indicates either an error or that no file of that name already exists;

5788:

For error conditions, see “CREAT (II)” in the UPM;

5790:

“maknode” (7455) creates a core “inode” via a call on “ialloc” and then initialises it and enters it into the appropriate directory. Note the explicit resetting of the “sticky” bit

18.9 openl (5804)

This procedure is called by “open” (5774) and “creat” (5793, 5795), passing values of the third parameter, “trf”, of 0, 2 and 1 respectively. The value 2 represents the case where no file of the desired name already exists.

5812:

The second parameter, “mode”, can take the values 01 (“FREAD”), 02 (“FWRITE”) or 03 (“FREAD | FWRITE”) when “trf” is 0, but only 02 otherwise;

Whete a file of the desired name already exists, check the access permissions for the desired mode(s) of activity via calls on “access” (6746), which may set “u.u_error” as a side-effect;

5824:

If the file is being “created”, eliminate its previous contents via a call on “itrunc” (7414). The code here could be improved by changing the test to “(trf == 1)”. Verify that this would be so.

5826:

“prele” (7882) is used to “unlock” “inodes”. Where, you may ask, did the “inode” get “locked”, and why?

5827:

Note that “falloc” (6847) calls “ufalloc” (6824) as the first thing it does;

5831:

“ufalloc” leaves the user file identifying number “u.u_ar0[R0]”. Why does this statement occur where it does, instead of after line 5834?

5832:

“openi” (6702) is called to call handlers for special files, in cae any device specific actions are required (for disk files there is no action);

5839:

In the case of an error while making the “file” array entry, the “inode” entry is released by a call on “iput”.

It will be seen that responsibility is quite widely distributed. The “file” table entry is initialised by “falloc” and “openl”; the “inode” table entry, by “iget”, “ialloc” and “maknode”.

Note that “ialloc” clears out the “i_addr” array of a newly allocated “inode” and “itrunc” does the same for a pre-existing “inode”, so that after the “creat” system call, there are no disk blocks associated with the file, now classed as “small”.

18.10 open (5763)

We now turn to consider the case where a program wishes to reference a file which already exists.

“namei” is called (5770) with a second parameter of zero to locate the named file. (“u.u_arg[0]” contains the address in the user space of a character string which defines a file path name.)

“u.u_arg[1]” has to be incremented by one, because there is a mismatch between the user programming conventions and the internal data representations.)

18.11 openl revisited

“trf” is now zero, so access permissions are checked (5813) but the existing file (if any) is not deallocated (5824).

What is a little disconcerting here is that, apart from the call on “falloc” (5827), there is no direct call on any of the “resource allocation” routines. Of course, for an existing file, neither directory entry nor disk “inode” entry nor disk blocks need be allocated. The core “inode” entry is allocated (if necessary) as a side-effect of the call on “namei”, but ... where is it initialised?

18.12 close (5846)

The “close” system call is used to sever explicitly the connection between a user program and a file and thus can be regarded as the inverse of “open”.

The user program’s file identification is passed via r0. The value is validated by “getf” (6619), the “u.u_ofile” entry is erased, and a call is made on “closef”.

18.13 closef (6643)

“closef” is called by “close” (5854) and by “exit” (3230). (The latter is more common since most files do not get closed explicitly but only implicitly when the user program terminates.)

6649:

If the file is a pipe, reset the mode of the pipe and “wakeup” any process which is waiting for the pipe, either for information or for space;

6655:

If this is the last process to reference the file, call “closei” (6672) to handle any special end of file processing for special files and then call “iput”;

6657:

Decrement the “file” entry reference count. If this now zero, the entry is no longer allocated.

18.14 iput (7344)

“closei”, as its last action calls “iput”. This routine is in fact called from many places, whenever a connection to a core “inode” is to be severed and the reference count decremented.

7350:

If the reference count is one at this point, the “inode” is to be released. While this is happening, it should be locked.

7352:

If the number of “links” to the file is zero (or less) the file is to be deallocated (see below);

7357:

“iupdat” (7374) updates the accessed and update times as recorded on the disk “inode”;

7358:

“prele” unlocks the “inode”. Why should it be called here as well as at line 7363?

18.15 Deletion of Files

New files are automatically entered into the file directory as permanent files as soon as they are “opened”. Subsequent “closing” of a file does not automatically cause its deletion. As was seen at line 7352, deletion will occur when the field “i_nlink” of the core “inode” entry is zero. This field is set to one initially by “maknode” (7464) when the file is first created. It may be incremented by the system call “link” (5941) and decremented by the system call “unlink” (3529).

Programs which create temporary “work files” should remove these files before terminating, by executing an “unlink” system call. Note that the “unlink” call does not of itself remove the file. This can only happen when the reference count (“i_count”) is about to be decremented to zero (7350, 7362).

To minimise the problems associated with “temporary” files which survive program or system crashes, programmers should observe the conventions that:

(a)

temporary files should be “unlinked” immediately after they are opened;

(b)

temporary files should always be placed in the “tmp” directory. Unique file names can be generated by incorporating the process’s identifying number into the file name (See “getpid” (3480)).

18.16 Reading and Writing

It is of interest to work through an abbreviated summary of the code which is invoked when a user process performs a “read” system call before examining the code in detail.

   .... read (f, b, n); /*user program/

                 {trap occurs}
   2693 trap

                 {system call #3}
   5711 read ();
   5713   rdwr (PREAD);

Execution of the system call by the user process results in the activation of “trap” running in kernel mode. “trap” recognises system call #3, and calls (via “trapl”) the routine “read”, which calls “rdwr”.

  5731 rdwr

  5736 fp = getf (u.u_ar0[R0];
  5743 u.u_base = u.u_arg[0];
  5744 u.u_count = u.u_arg[1];
  5745 u.u_segflg = 0;
  5751 u.u_offset[1] = f2->f offset[1];
  5752 u.u_offset[0] = fp->f offset[0];
  5754 readi(fp->f inode);
  5756 dpadd(fp->f offset,
              u.u_arg[1]-u.u_count);

“rdwr” includes much code which is common to both “read” and “write” operations. It converts, via “getf” (6619), the file identification supplied by the user process into the address of an entry in the “file” array.

Note that the first parameter of the system call is passed in a different way from the remaining two parameters.

“u.u_segflg” is set to zero to indicate that the operation destination is in the user address space. After “readi” is called with a parameter which is an “inode” pointer, the final accounting is performed by adding the number of characters requested for transfer less the residual number not transferred (left in “u.u_count”) to the file offset.

  6221 readi

  6239 lbn = lshift (u.u_offset, -9);
  6240 on = u.u_offset[1] & 0777;
  6241 n = min (512 - on, u.u_count);
  6248 bn = bmap(ip, lbn);
  6250 dn = ip->i_dev;
  6258 bp = bread (dn, bn);
  6260 iomove (bp, on, n, B READ);
  6261 brelse (bp);

“readi” converts the file offset into two parts: a logical block number, “lbn”, and an index into the block, “on”. The number of characters to be transferred is the minimum of “u.u_count” and the number of characters left in the block (in which case additional block(s) must be read (not shown)) (and the number of characters remaining in the file (this case is not shown)).

“dn” is the device number which is stored within the “inode”. “bn” is the actual block number on the device (disk), which is computed by “bmap” (6415) using “lbn”.

The call on “bread” finds the required block, copying it into core from disk if necessary. “iomove” (6364) transfers the appropriate characters to their destination, and performs accounting chores.

18.17 rdwr (5731)

“read” and “write” perform similar operations and share much code. The two system calls, “read” (5711) and “write” (5720), call “rdwr” immediately to:

5736:

Convert the user program file identification to a pointer in the file table;

5739:

Check that the operation (read or write) is in accordance with the mode with which the file was opened;

5743:

Set up various standard locations in “u” with the appropriate parameters;

5746:

“pipes” get special treatment right from the start!

5755:

Call “readi” or “writei” as appropriate;

5756:

Update the file offset by, and set the value returned to the user program to, the number of characters actually transferred.

18.18 readi (6221)

6230:

If no characters are to be transferred, do nothing;

6232:

Set the “inode” flag to indicate that the “inode” has been accessed;

6233:

If the file is a character special file, call the appropriate device “read” procedure, passing the device identification as parameter;

6238:

Begin a loop to transfer data in amounts up to 512 characters at a time until (6262) either an irrecoverable error condition has been encountered or the requested number of characters has been transferred;

6239:

“lshift” (1410) concatenates the two words of the array “u.u_offset”, shifts right by nine places, and truncates to 16 bits. This defines the “logical block number” of the file which is to be referenced;

6240:

“on” is a character offset within the block;

6241:

“n” is determined initially as the minimum of the number of characters beyond “on” in the block, and the number requested for transfer. (Note that “min” (6339) treats its arguments as unsigned integers.)

6242:

If the file is not a special block file then ...

6243:

Compare the file offset with the current file size;

6246:

Reset “n” as the minimum of the characters requested and the remaining characters in the file;

6248:

Call “bmap” to convert the logical block number for the file to a physical block number for its host device. There will be more on “bmap” shortly. For now, note that “bmap” sets “rablock” as a side effect;

6250:

Set “dn” as the device identification from the “inode”;

6251:

If the file is a special block file then ...

6252:

Set “dn” from the “i_addr” field of the “inode” entry. (Presumably this will nearly always be the same as the “i_dev” field, so why the distinction?)

6253:

Set the “read ahead block” to the next physical block;

6255:

If the blocks of the file are apparently being read sequentially then ...

6256:

Call “breada” to read the desired block and to initiate reading of the “read ahead block”;

6258:

else just read the desired block;

6260:

Call “iomove” to transfer information from the buffer to the user area;

6261:

Return the buffer to the “av”-list.

18.19 writei

6303:

If less than a full block is being written the previous contents of the buffer must be read so that the appropriate part can be preserved, otherwise just get any available buffer;

6311:

There is no “write ahead” facility, but there is a “delayed write” for buffers whose final characters have not been changed;

6312:

If the file offset now points beyond the recorded end of file character, the file has obviously grown bigger!

6318:

Why is it necessary/desirable to set the “IUPD” flag again? (See line 6285.)

18.20 iomove (6364)

The comment at the beginning of this procedure says most of what needs to be said. “copyin”, “copyout”, “cpass” and “passc” may be found at lines 1244, 1252, 6542 and 6517 respectively.

18.21 bmap (6415)

A general description of the function of “bmap” may be found on Page 2 of “FILE SYSTEM (V)” of the UPM.

6423:

Files of more than 2**15 blocks (2**24 characters) are not supported;

6427:

Start with the “small” file algorithm (file is not greater than eight blocks i.e. 4096 characters);

6431:

If the block number is 8 or more, the “small” file must converted into a large file. Note this is a side effect of “bmap”, and should occur only when “bmap” has been called by “writei” (and never by “readi” – see line 6245). Thus all files start life as “small” files and are never explicitly changed to “large” files. Note also that the change is irreversible!

6435:

“alloc” (6956) allocates a block on device “d” from the device’s free list. It then assigns a buffer to this block and returns a pointer to the buffer header;

6438:

The eight buffer addresses in the “i_addr” array for the “inode” are copied into the buffer area and then erased;

6442:

“i_addr[0]” is set to point to the buffer which is set up for a “delayed” write;

6448:

The file is still small. Get the next block if necessary;

6456:

Note the setting of “rablock”;

18.22 Leftovers

You should investigate the following procedures for yourself:

   seek  (5861)   statl (6045)
   sslep (5979)   dup   (6069)
   fstat (6014)   owner (6791)
   stat  (6028)   suser (6811)