mirror of
git://git.9front.org/plan9front/plan9front
synced 2025-01-12 11:10:06 +00:00
1163 lines
31 KiB
Text
1163 lines
31 KiB
Text
.HTML "Fossil, an Archival File Server
|
|
... .FP times
|
|
... .fp 1 R R.nomath
|
|
... .fp 5 CW LucidaSansCW83
|
|
.TL
|
|
Fossil, an Archival File Server
|
|
.AU
|
|
Sean Quinlan
|
|
Jim McKie
|
|
Russ Cox
|
|
jmk,rsc@plan9.bell-labs.com
|
|
.AB
|
|
This paper describes the internals and
|
|
operation of Fossil, an archival file server built for Plan 9.
|
|
Fossil has not yet replaced the current Plan 9 file server
|
|
and
|
|
.CW kfs ,
|
|
but that is our eventual intent.
|
|
Both fossil and this documentation are
|
|
works in progress. Comments on either are most welcome.
|
|
.AE
|
|
.de HP
|
|
.LP
|
|
..
|
|
.NH 1
|
|
Introduction
|
|
.HP
|
|
Fossil is an archival file server built for Plan 9.
|
|
In a typical configuration, it maintains a traditional file system
|
|
in a local disk partition and periodically archives snapshots of the file system
|
|
to a Venti server. These archives are made available through
|
|
a file system interface.
|
|
Fossil can also be run without a Venti server, in which case the
|
|
snapshots (if any) occupy local disk space.
|
|
.PP
|
|
The bulk of this paper explains the underlying data structures:
|
|
Venti trees, the Venti archival file system format, and finally Fossil's
|
|
file system format.
|
|
The end of the paper discusses the architecture of the Fossil server.
|
|
.PP
|
|
The presentation of the data structures is very detailed, perhaps
|
|
too detailed for most readers.
|
|
The intent is to record all the details necessary to make structural
|
|
changes to the file system format.
|
|
Feel free to jump ahead when boredom strikes.
|
|
.NH 1
|
|
Venti trees and directory hierarchies
|
|
.HP
|
|
Venti [3] is an archival block storage server.
|
|
Once a block is stored, it can be retrieved by presenting the 20-byte
|
|
SHA1 hash of its contents, called a
|
|
.I score .
|
|
Blocks on Venti have a maximum length of about 56 kilobytes,
|
|
though in practice smaller blocks are used.
|
|
To store a byte stream of arbitrary length, Venti uses a hash tree.
|
|
Conceptually, the data stream is broken into fixed-size (say,
|
|
.I dsize -byte)
|
|
chunks, which are stored on the Venti server.
|
|
The resulting scores are concatenated into a new pointer stream, which is
|
|
broken into fixed size (say,
|
|
.I psize -byte)
|
|
chunks, which are stored on the Venti server.
|
|
.I Psize "" (
|
|
is different from
|
|
.I dsize
|
|
so that we can ensure that each pointer block holds an
|
|
integral number of pointers.)
|
|
This yields a new pointer stream, and so on, until there is a single block
|
|
and finally a single score describing the entire tree.
|
|
The resulting structure looks like:
|
|
.PS
|
|
.ps 8
|
|
.vs 10
|
|
boxht=0.1
|
|
boxwid=0.1
|
|
|
|
B0: box invis wid 1 "\f(CWVtDataType\fP"
|
|
move right 0.1
|
|
L0a: box wid 0.2
|
|
move right 0.1
|
|
L0b: box wid 0.2
|
|
move right 0.1
|
|
L0c: box invis wid 0.2 "..."
|
|
move right 0.1
|
|
|
|
L0d: box wid 0.2
|
|
move right 0.1
|
|
L0e: box wid 0.2
|
|
move right 0.2
|
|
L0f: box invis wid 0.2 "..."
|
|
move right 0.2
|
|
|
|
L0g: box wid 0.2
|
|
move right 0.1
|
|
L0h: box wid 0.2
|
|
move right 0.1
|
|
L0i: box invis wid 0.2 "..."
|
|
move right 0.1
|
|
|
|
L0j: box wid 0.2
|
|
move right 0.1
|
|
L0k: box wid 0.2
|
|
move right 0.1
|
|
L0l: box invis wid 0.2 "..."
|
|
move right 0.1
|
|
L0m: box wid 0.2
|
|
|
|
define boxppddd {
|
|
line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
|
|
line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se>
|
|
X: box invis at 0.1<$1.nw,$1.ne>
|
|
Y: box invis at 0.1<$1.sw,$1.se>
|
|
line -> from 0.5<X,Y> to $2.nw
|
|
X: box invis at 0.3<$1.nw,$1.ne>
|
|
Y: box invis at 0.3<$1.sw,$1.se>
|
|
line -> from 0.5<X,Y> to $3.nw
|
|
"..." at 0.7<$1.w,$1.e>
|
|
}
|
|
|
|
define boxppdddp {
|
|
line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
|
|
line from 0.4<$1.nw,$1.ne> to 0.4<$1.sw,$1.se>
|
|
line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se>
|
|
X: box invis at 0.1<$1.nw,$1.ne>
|
|
Y: box invis at 0.1<$1.sw,$1.se>
|
|
line -> from 0.5<X,Y> to $2.nw
|
|
X: box invis at 0.3<$1.nw,$1.ne>
|
|
Y: box invis at 0.3<$1.sw,$1.se>
|
|
line -> from 0.5<X,Y> to $3.nw
|
|
"..." at 0.6<$1.w,$1.e>
|
|
X: box invis at 0.9<$1.nw,$1.ne>
|
|
Y: box invis at 0.9<$1.sw,$1.se>
|
|
line -> from 0.5<X,Y> to $4.nw
|
|
}
|
|
|
|
define boxpdddp {
|
|
line from 0.2<$1.nw,$1.ne> to 0.2<$1.sw,$1.se>
|
|
line from 0.8<$1.nw,$1.ne> to 0.8<$1.sw,$1.se>
|
|
X: box invis at 0.1<$1.nw,$1.ne>
|
|
Y: box invis at 0.1<$1.sw,$1.se>
|
|
line -> from 0.5<X,Y> to $2.nw
|
|
"..." at 0.5<$1.w,$1.e>
|
|
X: box invis at 0.9<$1.nw,$1.ne>
|
|
Y: box invis at 0.9<$1.sw,$1.se>
|
|
line -> from 0.5<X,Y> to $3.nw
|
|
}
|
|
|
|
bhd=0.4
|
|
L1abc: box wid 0.5 at 0.5<L0a, L0b>+(0,bhd)
|
|
boxppddd(L1abc, L0a, L0b)
|
|
L1def: box wid 0.5 at 0.5<L0d, L0e>+(0,bhd)
|
|
boxppddd(L1def, L0d, L0e)
|
|
L1ghi: box wid 0.5 at 0.5<L0g, L0h>+(0,bhd)
|
|
boxppddd(L1ghi, L0g, L0h)
|
|
L1jklm: box wid 0.5 at 0.5<L0j, L0k>+(0,bhd)
|
|
boxppdddp(L1jklm, L0j, L0k, L0m)
|
|
B1: box invis wid 1 "\f(CWVtPointerType0\fP" at B0+(0,bhd)
|
|
|
|
L2abcdef: box wid 0.5 at 0.5<L1abc,L1def>+(0,bhd)
|
|
boxppddd(L2abcdef, L1abc, L1def)
|
|
L2ghijklm: box wid 0.5 at 0.5<L1ghi,L1jklm>+(0,bhd)
|
|
boxpdddp(L2ghijklm, L1ghi, L1jklm)
|
|
B2: box invis wid 1 "\f(CWVtPointerType1\fP" at B1+(0,bhd)
|
|
|
|
L3atom: box wid 0.5 at 0.5<L2abcdef, L2ghijklm>+(0,bhd)
|
|
boxpdddp(L3atom, L2abcdef, L2ghijklm)
|
|
B3: box invis wid 1 "\f(CWVtPointerType2\fP" at B2+(0,bhd)
|
|
.PE
|
|
.LP
|
|
The leaves are the original data stream. Those blocks have type
|
|
.CW VtDataType .
|
|
The first pointer stream has type
|
|
.CW VtPointerType0 ,
|
|
the next has type
|
|
.CW VtPointerType1 ,
|
|
and so on.
|
|
The figure ends with a single block of type
|
|
.CW VtPointerType2 ,
|
|
but in general trees can have height up to
|
|
.CW VtPointerType6 .
|
|
For a
|
|
.I dsize
|
|
of 8192 bytes
|
|
and
|
|
.I psize
|
|
of 8180 bytes (409 pointers),
|
|
this gives a maximum stream size of approximately 10 zettabytes
|
|
(2\s-2\u73\d\s+2 or 10\s-2\u22\d\s+2 bytes).
|
|
.PP
|
|
Data blocks are truncated to remove trailing runs of zeros before
|
|
storage to Venti; they are zero-filled back to
|
|
.I dsize
|
|
bytes after retrieval from Venti.
|
|
Similarly, trailing runs of pointers to zero-length blocks are
|
|
removed from and added back to pointer blocks.
|
|
These simple rules happen to make it particularly efficient to store
|
|
large runs of zeros, as might occur in a data stream with ``holes:''
|
|
the zero-length block itself can be interpreted as a tree of any
|
|
depth encoding an all-zero data stream.
|
|
.PP
|
|
Reconstructing the data stream requires the score and type of the
|
|
topmost block in the tree, the data chunk size, the pointer chunk size,
|
|
and the data stream size.
|
|
(From the data stream size and the chunk sizes we could derive the
|
|
depth of the tree and thus the type of the topmost block, but it is convenient
|
|
to allow trees that are deeper than necessary.)
|
|
This information is kept in a 40-byte structure called a
|
|
.CW VtEntry :
|
|
.P1
|
|
VtEntry:
|
|
.ta +\w' 'u +\w' 'u
|
|
gen[4] \fRgeneration number\fP
|
|
psize[2] \fRsize of pointer blocks\fP
|
|
dsize[2] \fRsize of data blocks\fP
|
|
flags[1]
|
|
zero[5]
|
|
size[6] \fRlength of file\fP
|
|
score[20] \fRscore of root block in tree\fP
|
|
.P2
|
|
(In this notation,
|
|
.CW name[sz]
|
|
indicates a
|
|
.CW sz -byte
|
|
field called
|
|
.CW name .
|
|
Integers are stored in big-endian order.
|
|
.CW Size
|
|
really is a 48-bit field.)
|
|
.CW Flags
|
|
is made up of the following bit fields.
|
|
.P1
|
|
.ta +\w' 'u +\w' 'u
|
|
0x01 VtEntryActive \fRentry is allocated\fP
|
|
0x02 VtEntryDir \fRentry describes a Venti directory (q.v.)\fP
|
|
0x1C VtEntryDepthMask \fRmask for tree depth\fP
|
|
0x20 VtEntryLocal \fRreserved (q.v.)\fP
|
|
.P2
|
|
.LP
|
|
The depth of the described tree is stored in the 3 bits indicated:
|
|
a tree with a topmost node of type
|
|
.CW VtPointerType3
|
|
has depth 4.
|
|
.PP
|
|
With
|
|
.CW VtEntry
|
|
we can build more complicated data structures,
|
|
ones with multiple or nested data streams.
|
|
A data stream consisting of
|
|
.CW VtEntry
|
|
structures is called a Venti directory.
|
|
It is identical in structure to the Venti data stream
|
|
we described earlier except that the bottom-level type is
|
|
.CW VtDirType ,
|
|
and
|
|
the
|
|
.CW VtEntry
|
|
describing a Venti directory has the
|
|
.CW VtEntryDir
|
|
flag bit set.
|
|
The
|
|
.I dsize
|
|
for a Venti directory
|
|
is a multiple of 40 so that each data chunk holds
|
|
an integer number of
|
|
.CW VtEntry
|
|
structures.
|
|
By analogy with Venti directories,
|
|
we call the original data stream a
|
|
Venti file.
|
|
Note that Venti files are assumed
|
|
.I not
|
|
to contain pointers to other Venti blocks.
|
|
The only pointers to Venti blocks occur in
|
|
.CW VtEntry
|
|
structures in
|
|
Venti directories
|
|
(and in the internal hash tree structure of the
|
|
individual files and directories).
|
|
Note also that these directories are nothing more than pointer lists.
|
|
In particular, there are no names or metadata like in a file system.
|
|
.PP
|
|
To make it easier to pass hierarchies between applications,
|
|
the root of a hierarchy is described in a 300-byte structure
|
|
called a
|
|
.CW VtRoot :
|
|
.P1
|
|
VtRoot:
|
|
.ta +\w' 'u +\w' 'u
|
|
version[2] \f(CW2\fP
|
|
name[128] \fRname of structure (just a comment)\fP
|
|
type[128] \fRstring describing structure (\f(CWvac\fR)\f(CW
|
|
score[20] \fRpointer to \f(CWVtDirType\fP block\f(CW
|
|
blockSize[2] \fRmaximum block size in structure\fP
|
|
prev[20] \fRprevious \f(CWVtRoot\fP in chain, if any\fP
|
|
.P2
|
|
.LP
|
|
This structure is stored to Venti and its score is passed
|
|
between applications, typically in the form
|
|
``\fItype\f(CW:\fIrootscore\fR,''
|
|
where
|
|
.I type
|
|
is the type field from the
|
|
.CW VtRoot
|
|
structure, and
|
|
.I rootscore
|
|
is the score of the
|
|
.CW VtRoot
|
|
block.
|
|
.CW VtRoot
|
|
structures can be chained together using the
|
|
.I prev
|
|
field to encode an archival history
|
|
of the data structure.
|
|
.PP
|
|
For example, a small Venti hierarchy might look like:
|
|
.PS
|
|
.ps 8
|
|
.vs 10
|
|
boxwid=0.1
|
|
boxht=0.1
|
|
f=0.9
|
|
mb=0.16
|
|
|
|
VtRoot: [
|
|
right
|
|
B1: box
|
|
move right 0.1
|
|
"\f(CWVtRoot\fP" ljust
|
|
]
|
|
|
|
Root: [
|
|
right
|
|
B1: box fill f
|
|
B2: box fill f
|
|
B3: box fill f
|
|
move right 0.1
|
|
] with .nw at VtRoot.sw+(0.2,-.1)
|
|
Level1: [
|
|
RootMeta: [
|
|
box wid mb
|
|
]
|
|
MetaSource: [
|
|
right
|
|
B1: box wid 5*mb
|
|
] with .nw at RootMeta.sw+(0,-.1)
|
|
|
|
Source: [
|
|
right
|
|
B1: box fill f
|
|
B2: box fill f
|
|
B3: box fill f
|
|
B4: box fill f
|
|
B5: box fill f
|
|
B6: box fill f
|
|
B7: box fill f
|
|
B8: box fill f
|
|
] with .nw at MetaSource.sw+(0,-.1)
|
|
SB1: box invis at Source.B1
|
|
SB2: box invis at Source.B2
|
|
SB3: box invis at Source.B3
|
|
] with .nw at Root.sw+(0.4,-.1)
|
|
Level2: [
|
|
MetaSource: [
|
|
right
|
|
B1: box wid 5*mb
|
|
]
|
|
Source: [
|
|
right
|
|
B1: box fill f
|
|
B2: box fill f
|
|
B3: box fill f
|
|
B4: box fill f
|
|
B5: box fill f
|
|
B6: box fill f
|
|
B7: box fill f
|
|
B8: box fill f
|
|
] with .nw at MetaSource.sw+(0,-.1)
|
|
File: box wid 0.8 with .nw at Source.sw+(0,-.1)
|
|
] with .nw at Level1.sw+(0.6,-.1)
|
|
|
|
line -> from VtRoot.B1 down boxwid/2+0.1+boxwid/2 then to Root.w
|
|
line -> from Root.B3 down boxwid/2+0.1+boxwid/2 then to Level1.RootMeta.w
|
|
line -> from Root.B2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level1.MetaSource.w
|
|
line -> from Root.B1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level1.Source.w
|
|
|
|
line -> from Level1.SB3 down boxwid/2+0.1+boxwid/2 then to Level2.MetaSource.w
|
|
line -> from Level1.SB2 down boxwid/2+0.1+boxwid+0.1+boxwid/2 then to Level2.Source.w
|
|
line -> from Level1.SB1 down boxwid/2+0.1+boxwid+0.1+boxwid+0.1+boxwid/2 then to Level2.File.w
|
|
|
|
[
|
|
KEY: box wid 1.5 invis "Key"
|
|
line from KEY.sw to KEY.se
|
|
k = -.1
|
|
kk=0.5
|
|
A: [
|
|
box wid 4*boxwid
|
|
"Venti file" ljust with .w at last box .w+(kk,0)
|
|
] with .nw at KEY.sw+(0,2*k)
|
|
B: [
|
|
box fill f
|
|
"Venti entry (\f(CWVtEntry\fP)" ljust with .w at last box .w+(kk,0)
|
|
] with .nw at A.sw+(0,k)
|
|
C: [
|
|
right
|
|
CC: box fill f
|
|
box fill f
|
|
box fill f
|
|
box fill f
|
|
"Venti directory" ljust with .w at CC.w+(kk,0)
|
|
] with .nw at B.sw+(0,k)
|
|
D: [
|
|
line -> right 3*boxwid
|
|
"Venti pointer (score)" ljust with .w at last line .w+(kk, 0)
|
|
] with .nw at C.sw+(0,k)
|
|
] with .nw at VtRoot.nw+(3,0)
|
|
.PE
|
|
.LP
|
|
Venti files are shown as white boxes, while directories are shown
|
|
as shaded boxes. Each shaded square represents a
|
|
.CW VtEntry .
|
|
Arrows represent pointers from
|
|
.CW VtEntry
|
|
structures to other
|
|
Venti files or directories.
|
|
.PP
|
|
The hierarchical structure provided by Venti files and directories
|
|
can be used as the base for more complicated data structures.
|
|
Because this structure captures all the information
|
|
about pointers to other blocks, tools written to traverse
|
|
Venti hierarchies can traverse the more complicated
|
|
data structures as well.
|
|
For example,
|
|
.I venti/copy
|
|
(see
|
|
.I venti (1))
|
|
copies a Venti hierarchy from one Venti server to another,
|
|
given the root
|
|
.CW VtEntry .
|
|
Because the traditional file system described in later sections is
|
|
layered on a Venti hierarchy,
|
|
.I venti/copy
|
|
can copy it without fully understanding its structure.
|
|
.NH 1
|
|
Vac file system format
|
|
.HP
|
|
The Venti archive format
|
|
.I vac
|
|
builds a traditional file system using a Venti hierarchy.
|
|
Each vac file is implemented as a Venti file;
|
|
each vac directory is implemented as a Venti
|
|
directory and a Venti file to provide traditional file system metadata.
|
|
The metadata is stored in a structure called a
|
|
.CW DirEntry :
|
|
.P1
|
|
DirEntry:
|
|
.ta +\w' 'u +\w' 'u
|
|
magic[4] \f(CW0x1c4d9072\fP (DirMagic)\fP
|
|
version[2] \f(CW9\fP
|
|
elem[s] \fRname (final path element only)\fP
|
|
entry[4] \fRentry number for Venti file or directory\fP
|
|
gen[4] \fRgeneration number\fP
|
|
mentry[4] \fRentry number for Venti file holding metadata\fP
|
|
mgen[4] \fRgeneration number\fP
|
|
qid[8] \fRunique file serial number\fP
|
|
uid[s] \fRowner\fP
|
|
gid[s] \fRgroup\fP
|
|
mid[s] \fRlast modified by\fP
|
|
mtime[4] \fRlast modification time\fP
|
|
ctime[4] \fRcreation time\fP
|
|
atime[4] \fRlast access time\fP
|
|
mode[4] \fRmode bits\fP
|
|
.P2
|
|
The notation
|
|
.CW name[s]
|
|
denotes a string stored as a two-byte length
|
|
and then that many bytes.
|
|
The above describes Version 9 of the
|
|
.CW DirEntry
|
|
format. Versions 7 and 8 are very similar; they can be
|
|
read by the current
|
|
.I vac
|
|
source code but are not written.
|
|
Earlier versions were not widespread.
|
|
A
|
|
.CW DirEntry
|
|
may be followed by optional extension sections, though none
|
|
are currently used.
|
|
The
|
|
.CW mode
|
|
bits include bits commonly used by
|
|
Unix and Windows, in addition to those used by Plan 9.
|
|
.PP
|
|
The
|
|
.CW entry
|
|
field is an index into the parallel Venti directory.
|
|
The
|
|
.CW gen
|
|
field must match the
|
|
.CW gen
|
|
field in the corresponding
|
|
.CW VtEntry
|
|
in the directory;
|
|
it is used to detect
|
|
stale indices.
|
|
Similarly,
|
|
.CW mentry
|
|
and
|
|
.CW mgen
|
|
are the index and generation number
|
|
for the metadata Venti file,
|
|
if the
|
|
.CW DirEntry
|
|
describes a vac directory.
|
|
.PP
|
|
The relation between Venti files and directories and
|
|
vac files and directories can be seen in this figure:
|
|
.PS
|
|
.ps 8
|
|
.vs 10
|
|
boxwid=0.1
|
|
boxht=0.1
|
|
f=0.9
|
|
mb=0.16
|
|
|
|
VtRoot: [
|
|
right
|
|
B1: box
|
|
move right 0.1
|
|
"\f(CWVtRoot\fP" ljust
|
|
]
|
|
|
|
SuperRoot: [
|
|
right
|
|
B1: box fill f
|
|
move right 0.1
|
|
"fs root block" ljust
|
|
] with .nw at VtRoot.sw + (0.2, -.2)
|
|
Root: [
|
|
right
|
|
B1: box fill f
|
|
B2: box fill f
|
|
B3: box fill f
|
|
move right 0.1
|
|
"root directory info block" ljust
|
|
] with .nw at SuperRoot.sw+(0.2, -.2)
|
|
Level1: [
|
|
RootMeta: [
|
|
box wid mb
|
|
move right 0.1
|
|
"root metadata" ljust
|
|
]
|
|
MetaSource: [
|
|
right
|
|
B1: box wid mb
|
|
B2: box wid mb
|
|
B3: box wid mb
|
|
B4: box wid mb
|
|
B5: box wid mb
|
|
] with .nw at RootMeta.sw+(0,-.2)
|
|
MB1: box wid mb invis at MetaSource.B1
|
|
MB2: box wid mb invis at MetaSource.B2
|
|
MB3: box wid mb invis at MetaSource.B3
|
|
MB4: box wid mb invis at MetaSource.B4
|
|
MB5: box wid mb invis at MetaSource.B5
|
|
|
|
Source: [
|
|
right
|
|
B1: box fill f
|
|
B2: box fill f
|
|
B3: box fill f
|
|
B4: box fill f
|
|
B5: box fill f
|
|
B6: box fill f
|
|
B7: box fill f
|
|
B8: box fill f
|
|
] with .nw at MetaSource.sw+(0,-.1)
|
|
SB1: box invis at Source.B1
|
|
SB2: box invis at Source.B2
|
|
SB3: box invis at Source.B3
|
|
SB4: box invis at Source.B4
|
|
SB5: box invis at Source.B5
|
|
SB6: box invis at Source.B6
|
|
SB7: box invis at Source.B7
|
|
SB8: box invis at Source.B8
|
|
] with .nw at Root.sw+(0.4,-.2)
|
|
Level2: [
|
|
MetaSource: [
|
|
right
|
|
B1: box wid mb
|
|
B2: box wid mb
|
|
B3: box wid mb
|
|
B4: box wid mb
|
|
B5: box wid mb
|
|
]
|
|
Source: [
|
|
right
|
|
B1: box fill f
|
|
B2: box fill f
|
|
B3: box fill f
|
|
B4: box fill f
|
|
B5: box fill f
|
|
B6: box fill f
|
|
B7: box fill f
|
|
B8: box fill f
|
|
] with .nw at MetaSource.sw+(0,-.1)
|
|
File: box wid 0.8 with .nw at Source.sw+(0,-.2)
|
|
] with .nw at Level1.sw+(0.6,-.2)
|
|
|
|
line -> from VtRoot.B1 down boxwid/2+0.2+boxwid/2 then to SuperRoot.w
|
|
line -> from SuperRoot.B1 down boxwid/2+0.2+boxwid/2 then to Root.w
|
|
line -> from Root.B3 down boxwid/2+0.2+boxwid/2 then to Level1.RootMeta.w
|
|
line -> from Root.B2 down boxwid/2+0.2+boxwid+0.2+boxwid/2 then to Level1.MetaSource.w
|
|
line -> from Root.B1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level1.Source.w
|
|
|
|
line -> from Level1.SB3 down boxwid/2+0.2+boxwid/2 then to Level2.MetaSource.w
|
|
line -> from Level1.SB2 down boxwid/2+0.2+boxwid+0.1+boxwid/2 then to Level2.Source.w
|
|
line -> from Level1.SB1 down boxwid/2+0.2+boxwid+0.1+boxwid+0.2+boxwid/2 then to Level2.File.w
|
|
|
|
arrowwid = arrowwid/2
|
|
arrowht = arrowht/2
|
|
line -> from Level1.MB1 to Level1.SB1.n
|
|
line -> from Level1.MB2 to Level1.SB2.n
|
|
line -> from Level1.MB2 to Level1.SB3.n
|
|
line -> from Level1.MB4 to Level1.SB7.n
|
|
line -> from Level1.MB5 to Level1.SB5.n
|
|
arrowwid = arrowwid * 2
|
|
arrowht = arrowht * 2
|
|
|
|
box dashed with .nw at Level1.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2
|
|
box dashed with .nw at Level2.MetaSource.nw+(-.05,.05) wid 0.8+.05*2 ht .3+.05*2
|
|
box dotted with .nw at Level2.File.nw+(-.05,.05) wid 0.8+0.05*2 ht .1+.05*2
|
|
|
|
[
|
|
KEY: box wid 1.5 invis "Key"
|
|
line from KEY.sw to KEY.se
|
|
k = -.1
|
|
kk=0.5
|
|
A: [
|
|
box wid 4*boxwid
|
|
"Venti file" ljust with .w at last box .w+(kk,0)
|
|
] with .nw at KEY.sw+(0,2*k)
|
|
B: [
|
|
box fill f
|
|
"Venti entry (\f(CWEntry\fP)" ljust with .w at last box .w+(kk,0)
|
|
] with .nw at A.sw+(0,k)
|
|
C: [
|
|
right
|
|
CC: box fill f
|
|
box fill f
|
|
box fill f
|
|
box fill f
|
|
"Venti directory" ljust with .w at CC.w+(kk,0)
|
|
] with .nw at B.sw+(0,k)
|
|
D: [
|
|
line -> right 3*boxwid
|
|
"Venti pointer (score)" ljust with .w at last line .w+(kk, 0)
|
|
] with .nw at C.sw+(0,k)
|
|
DD: [
|
|
box dotted wid 4*boxwid
|
|
"Vac file" ljust with .w at last box .w+(kk,0)
|
|
] with .nw at D.sw+(0,k)
|
|
E: [
|
|
box wid mb
|
|
"Vac entry (\f(CWDirEntry\fP)" ljust with .w at last box .w+(kk,0)
|
|
] with .nw at DD.sw+(0,k)
|
|
G: [
|
|
box dashed wid 4*boxwid
|
|
"Vac directory" ljust with .w at last box .w+(kk,0)
|
|
] with .nw at E.sw+(0,k)
|
|
H: [
|
|
arrowwid = arrowwid/2
|
|
arrowht = arrowht/2
|
|
line -> right 1.5*boxwid
|
|
"Vac pointer (integer index)" ljust with .w at last line .w+(kk, 0)
|
|
arrowwid = arrowwid * 2
|
|
arrowht = arrowht * 2
|
|
] with .nw at G.sw+(0,k)
|
|
] with .nw at VtRoot.nw+(3,0)
|
|
.PE
|
|
.LP
|
|
In reality, the story is slightly more complicated.
|
|
The metadata file in a Vac directory
|
|
is not just the concatenation of
|
|
.CW DirEntry
|
|
structures.
|
|
Instead, it is the concatenation of
|
|
.CW MetaBlocks .
|
|
A
|
|
.CW MetaBlock
|
|
contains some number of
|
|
.CW DirEntry
|
|
structures along with a sorted index to make it easy
|
|
to look for a particular
|
|
.CW DirEntry
|
|
by its
|
|
.CW elem
|
|
field.
|
|
The details are in the source code.
|
|
.PP
|
|
As shown in the diagram,
|
|
the root directory of the file system is summarized by
|
|
three
|
|
.CW VtEntry
|
|
structures describing
|
|
the Venti directory for the children of the root,
|
|
the Venti file for the metadata describing the children of the root,
|
|
and a Venti file holding metadata for the root directory itself.
|
|
These
|
|
.CW VtEntry
|
|
structures are placed in a Venti directory of their own,
|
|
described by the single
|
|
.CW VtEntry
|
|
in the
|
|
root block.
|
|
.NH 1
|
|
Fossil file system format
|
|
.HP
|
|
Fossil uses the vac format, with some small changes.
|
|
The changes only affect the data on the local disk; the data
|
|
archived to Venti is exactly in vac format.
|
|
.PP
|
|
Blocks stored on local disk may contain scores pointing at local disk
|
|
blocks or at Venti blocks.
|
|
Local block addresses are stored as 20-byte scores in which the first 16 bytes
|
|
are all zero and the last 4 bytes specify a block number in the disk.
|
|
Before a block is archived, all the
|
|
blocks it points to must be archived, and the local scores in the block
|
|
must be changed to Venti scores.
|
|
Using block addresses rather than content hashes for local data
|
|
makes the local file system easier to manage: if a local block's contents
|
|
change, the pointer to the block does not need to change.
|
|
.NH 2
|
|
Snapshots
|
|
.HP
|
|
Fossil is an archival file server.
|
|
It takes periodic snapshots of the file system,
|
|
which are made accessible through the file system.
|
|
Specifically, the active file system is presented in
|
|
.CW /active .
|
|
Ephemeral snapshots (those that are kept on local disk and eventually deleted)
|
|
are presented in
|
|
\f(CW/snapshot/\fIyyyy\f(CW/\fImmdd\f(CW/\fIhhmm\fR,
|
|
where
|
|
.I yyyy
|
|
is the full year,
|
|
.I mm
|
|
is the month number,
|
|
.I dd
|
|
is the day number,
|
|
.I hh
|
|
is the hour,
|
|
and
|
|
.I mm
|
|
is the minute.
|
|
Archival snapshots (those that are archived to Venti and persist forever)
|
|
are presented in
|
|
\f(CW/archive/\fIyyyy\f(CW/\fImmdds\fR,
|
|
where
|
|
.I yyyy ,
|
|
.I mm ,
|
|
and
|
|
.I dd
|
|
are year, month, and day as before,
|
|
and
|
|
.I s
|
|
is a sequence number if more than one
|
|
archival snapshot is done in a day.
|
|
For the first snapshot,
|
|
.I s
|
|
is null.
|
|
For the subsequent snapshots,
|
|
.I s
|
|
is
|
|
.CW .1 ,
|
|
.CW .2 ,
|
|
.CW .3 ,
|
|
etc.
|
|
.PP
|
|
To implement the snapshots, the file server maintains a
|
|
current
|
|
.I epoch
|
|
for the active file system.
|
|
Each local block has a label that records, among other things,
|
|
the epoch in which the block was allocated.
|
|
If a block was allocated in an epoch earlier than the current one,
|
|
it is immutable and treated as copy-on-write.
|
|
Taking a snapshot can be accomplished by
|
|
recording the address of the current root block and then
|
|
incrementing the epoch number.
|
|
Notice that the copy-on-write method makes
|
|
snapshots both time efficient and space efficient.
|
|
The only time cost is waiting for all current file system
|
|
requests to finish and then incrementing a counter.
|
|
After a snapshot, blocks only get copied when they are
|
|
next modified, so the per-snapshot
|
|
space requirement is proportional
|
|
to the amount of new data rather than the total
|
|
size of the file system.
|
|
.PP
|
|
The blocks in the archival snapshots are moved to Venti,
|
|
but the blocks in the ephemeral snapshots take up space
|
|
in the local disk file.
|
|
To allow reclamation of this disk space, the file system
|
|
maintains a
|
|
.I low
|
|
.I epoch ,
|
|
which is the epoch of the earliest ephemeral snapshot
|
|
still available.
|
|
Fossil only allows access to snapshots with epoch numbers
|
|
between the
|
|
low epoch and the current epoch
|
|
(also called the high epoch).
|
|
Incrementing the low epoch thus makes old
|
|
snapshots inaccessible.
|
|
The space required to store those snapshots can then
|
|
be reclaimed, as described below.
|
|
.NH 2
|
|
Local blocks
|
|
.HP
|
|
The bulk of the local disk file is the local blocks.
|
|
Each block has a 14-byte label associated with it, of the format:
|
|
.P1
|
|
Label:
|
|
.ta +\w' 'u +\w' 'u
|
|
state[1] \fRblock state\fP
|
|
type[1] \fRblock type\fP
|
|
epoch[4] \fRallocation epoch\fP
|
|
epochClose[4] \fRclose epoch\fP
|
|
tag[4] \fRrandom tag\fP
|
|
.P2
|
|
.LP
|
|
The
|
|
.CW type
|
|
is an analogue of the block types described earlier,
|
|
though different names are used, to distinguish between
|
|
pointers blocks in a hash tree for a data stream
|
|
and pointer blocks for a directory stream.
|
|
The
|
|
.CW epoch
|
|
was mentioned in the last section.
|
|
The other fields are explained below.
|
|
.PP
|
|
There are two distinguished blocks states
|
|
.CW BsFree
|
|
.CW 0x00 ) (
|
|
and
|
|
.CW BsBad
|
|
.CW 0xFF ), (
|
|
which mark blocks that are available for allocation
|
|
and blocks that are bad and should be avoided.
|
|
If
|
|
.CW state
|
|
is not one of these values, it is a bitwise
|
|
.I or ' `
|
|
of the following flags:
|
|
.P1
|
|
.ta +\w' 'u +\w' 'u
|
|
0x01 BsAlloc \fRblock is in use\fP
|
|
0x02 BsCopied \fRblock has been copied\fP
|
|
0x04 BsVenti \fRblock has been stored on Venti\fP
|
|
0x08 BsClosed \fRblock has been unlinked from active file system\fP
|
|
.P2
|
|
.LP
|
|
The flags are explained as they arise in the discussions below.
|
|
.PP
|
|
It is convenient to store some extra fields in the
|
|
.CW VtEntry
|
|
structure when it describes a Venti file or directory
|
|
stored on local disk.
|
|
Specifically, we set the
|
|
.CW VtEntryLocal
|
|
flag bit
|
|
and then use the bytes 7-16 of the score (which would
|
|
otherwise be zero, since it is a local score) to hold these fields:
|
|
.P1
|
|
.ta +\w' 'u +\w' 'u
|
|
archive[1] \fRboolean: this is an archival snapshot\fP
|
|
snap[4] \fRepoch number if root of snapshot\fP
|
|
tag[4] \fRrandom tag\fP
|
|
.P2
|
|
.LP
|
|
The extended
|
|
.CW VtEntry
|
|
structure is called an
|
|
.CW Entry .
|
|
The
|
|
.CW tag
|
|
field
|
|
in the
|
|
.CW Label
|
|
and the
|
|
.CW Entry
|
|
is used to identify dangling pointers or other file system corruption:
|
|
all the local blocks in a hash tree must
|
|
have tags matching the tag in the
|
|
.CW Entry .
|
|
If this
|
|
.CW Entry
|
|
points at the root of a snapshot,
|
|
the
|
|
.CW snap
|
|
field is the epoch of the snapshot.
|
|
If the snapshot is intended to be archived to Venti,
|
|
the
|
|
.CW archive
|
|
field is non-zero.
|
|
.NH 2
|
|
Block reclamation
|
|
.HP
|
|
The blocks in the active file system form a tree: each
|
|
block has only one parent.
|
|
Once a copy-on-write block
|
|
.I b
|
|
is replaced by its copy, it is no longer
|
|
needed by the active file system.
|
|
At this point,
|
|
.I b
|
|
is unlinked from the active file system.
|
|
We say that
|
|
.I b
|
|
is now
|
|
.I closed :
|
|
it is needed only for snapshots.
|
|
When a block is closed, the
|
|
.CW BsClosed
|
|
bit is set in its state, and the current epoch (called the block's closing epoch)
|
|
is stored in the
|
|
.CW epochClose
|
|
label field.
|
|
(Open blocks have an
|
|
.CW epochClose
|
|
of
|
|
.CW ~0 ).
|
|
.PP
|
|
A block is referenced by snapshots with epochs
|
|
between the block's allocation epoch and its closing epoch.
|
|
Once the file system's low epoch grows to be greater than or equal to the block's
|
|
closing epoch, the block is no longer needed for any snapshots
|
|
and can be reused.
|
|
.PP
|
|
In a typical configuration, where nightly archival snapshots
|
|
are taken and written to Venti, it is desirable to reclaim
|
|
the space occupied by now-archived blocks if possible.
|
|
To do this, Fossil keeps track of whether the pointers
|
|
in each block are unique to that block.
|
|
When a block
|
|
.I bb
|
|
is allocated, a pointer to
|
|
.I bb
|
|
is written into exactly one active block (say,
|
|
.I b ).
|
|
In the absence of snapshots, the pointer to
|
|
.I bb
|
|
will remain unique to
|
|
.I b ,
|
|
so that if the pointer is zeroed,
|
|
.I bb
|
|
can be immediately reused.
|
|
Snapshots complicate this invariant:
|
|
when
|
|
.I b
|
|
is copied-on-write, all its pointers
|
|
are no longer unique to it.
|
|
At time of the copy, the
|
|
.CW BsCopied
|
|
state bit in the block's label
|
|
is set to note the duplication of the pointers contained within.
|
|
.NH 2
|
|
Disk layout
|
|
.HP
|
|
The file system header describes the file system layout and has this format:
|
|
.P1
|
|
.ta +\w' 'u +\w' 'u
|
|
Header:
|
|
magic[4] \fR0x3776AE89 (HeaderMagic)\fP
|
|
version[2] \fR1 (HeaderVersion)\fP
|
|
blockSize[2] \fIfile system block size\fP
|
|
super[4] \fRblock offset of super block\fP
|
|
label[4] \fRblock offset of labels\fP
|
|
data[4] \fRdata blocks\fP
|
|
end[4] \fRend of file system\fP
|
|
.P2
|
|
.LP
|
|
The corresponding file system layout is:
|
|
.PS
|
|
.ps 8
|
|
.vs 9
|
|
boxwid=0.75
|
|
boxht=0.15
|
|
Empty: box "empty" ht 0.25
|
|
Header: box "header" with .n at Empty.s
|
|
Empty2: box "empty" with .n at Header.s
|
|
Super: box "super block" with .n at Empty2.s
|
|
Label: box "label" "blocks" with .n at Super.s ht 0.25
|
|
Data: box "data" "blocks" with .n at Label.s ht 0.3
|
|
" 0" ljust at Empty.ne
|
|
" 128kB" ljust at Header.ne
|
|
" \f5super\fP \(mu \f(CWblockSize\fP" ljust at Super.ne
|
|
" \f5label\fP \(mu \f(CWblockSize\fP" ljust at Label.ne
|
|
" \f5data\fP \(mu \f(CWblockSize\fP" ljust at Data.ne
|
|
" \f5end\fP \(mu \f(CWblockSize\fP" ljust at Data.se
|
|
"" at (-1,0)
|
|
"" at (6,0)
|
|
.PE
|
|
.LP
|
|
The numbers to the right of the blocks are byte offsets
|
|
of the boundaries.
|
|
.LP
|
|
The super block describes the file system itself and looks like:
|
|
.P1
|
|
.ta +\w' 'u +\w' 'u
|
|
Super:
|
|
magic[4] \fR0x2340A3B1 (SuperMagic)\fP
|
|
version[2] \fR1 (SuperVersion)\fP
|
|
epochLow[4] \fRfile system low epoch\fP
|
|
epochHigh[4] \fRfile system high (active) epoch\fP
|
|
qid[8] \fRnext qid to allocate\fP
|
|
active[4] \fRdata block number: root of active file system\fP
|
|
next[4] \fRdata block number: root of next file system to archive\fP
|
|
current[4] \fRdata block number: root of file system currently being archived\fP
|
|
last[20] \fRVenti score of last successful archive\fP
|
|
name[128] \fRname of file system (just a comment)\fP
|
|
.P2
|
|
.LP
|
|
.NH 1
|
|
Fossil server
|
|
.HP
|
|
The Fossil server is a user-space program that runs on a standard Plan 9 kernel.
|
|
.NH 2
|
|
Process structure
|
|
.PP
|
|
The file server is structured as a set of processes synchronizing
|
|
mostly through message passing along queues.
|
|
The processes are given names, which can be seen in the output of
|
|
.CW ps
|
|
.CW -a .
|
|
.PP
|
|
.CW Listen
|
|
processes announce on various network addresses.
|
|
A
|
|
.CW con
|
|
process handles each incoming connection, reading 9P requests
|
|
and adding them to a central message queue.
|
|
.CW Msg
|
|
processes remove 9P requests from the queue,
|
|
handle them, and write the responses to the appropriate
|
|
file descriptors.
|
|
.PP
|
|
The
|
|
.CW disk
|
|
process handles disk I/O requests made by the other processes.
|
|
The
|
|
.CW flush
|
|
process writes dirty blocks from the in-memory block cache to disk.
|
|
The
|
|
.CW unlink
|
|
process frees previously linked blocks once the blocks that point at them
|
|
have been written to disk.
|
|
.PP
|
|
A
|
|
.CW consI
|
|
reads from each console file (typically a pipe posted in
|
|
.CW /srv ),
|
|
adding the typed characters to the input queue.
|
|
The
|
|
.CW cons
|
|
process echoes input and runs the commands, saving
|
|
output in a ring buffer.
|
|
Because there is only one
|
|
.CW cons
|
|
process, only one console command may be executing at a time.
|
|
A
|
|
.CW consO
|
|
process copies this ring buffer to each console file.
|
|
.PP
|
|
The
|
|
.CW periodic
|
|
process runs periodic events, like
|
|
flushing the root metadata to disk or
|
|
taking snapshots of the file system.
|
|
.NH 2
|
|
Block cache
|
|
.HP
|
|
Fossil maintains an in-memory block cache which
|
|
holds both local disk blocks and Venti blocks.
|
|
Cache eviction follows a least recently used policy.
|
|
Dirty blocks are restricted to at most half the cache.
|
|
This can be changed by editing
|
|
.CW DirtyPercentage
|
|
in
|
|
.CW dat.h .
|
|
.PP
|
|
The block cache uses soft updates [1] to ensure that the on-disk
|
|
file system is always self-consistent.
|
|
Thus there is no
|
|
.I halt
|
|
console command
|
|
and no need to check a file system
|
|
that was shut down without halting.
|
|
.NH 2
|
|
Archiving
|
|
.HP
|
|
A background process writes blocks in archival snapshots to Venti.
|
|
Although
|
|
.CW /archive/\fIyyyy\fP/\fImmdds\fR
|
|
is a copy of only
|
|
.CW /active
|
|
at the time of the snapshot,
|
|
the archival process archives the
|
|
entire file tree rather than just
|
|
the subtree rooted at
|
|
.CW /active .
|
|
The snapshots
|
|
.CW /snapshot/\fIyyyy\fP/\fImmdd\fP/\fIhhmm
|
|
are stored as empty directories.
|
|
Once all the blocks have been archived,
|
|
a
|
|
.CW VtRoot
|
|
header for the file system is archived.
|
|
The score of that header is recorded in
|
|
.CW super.score
|
|
and also printed on the file server console.
|
|
The score can used by
|
|
.I flfmt
|
|
to restore a file system (see
|
|
.I fossil (4)).
|
|
.NH 2
|
|
Contrast with the old file server
|
|
.HP
|
|
The most obvious difference between Fossil and the
|
|
old Plan 9 file server [2] is that Fossil uses a Venti server as
|
|
its archival storage in place of a WORM juke box.
|
|
There are a few other architectural differences to be
|
|
aware of.
|
|
.PP
|
|
Fossil is a user-level program run on a standard kernel.
|
|
.PP
|
|
Fossil does not have any way to concatenate, stripe, or
|
|
mirror disk files. For functionality similar to the old file server's
|
|
configuration strings, use the experimental file stack device
|
|
(see
|
|
.I fs (3)).
|
|
.PP
|
|
Fossil speaks only 9P2000. Old 9P (aka 9P1) is not supported.
|
|
.PP
|
|
... XXX words about converting an old file system to fossil?
|
|
.NH 1
|
|
References
|
|
.LP
|
|
[1] Gregory R. Ganger, Marshall Kirk McKusick, Craig A. N. Soules,
|
|
and Yale N. Patt.
|
|
``Soft Updates: A Solution to the Metadata Update Problem
|
|
in File Systems,''
|
|
.I "ACM Transactions on Computer Systems" ,
|
|
Vol 18., No. 2, May 2000, pp. 127\-153.
|
|
.LP
|
|
[2] Sean Quinlan, ``A Cached WORM File System,''
|
|
.I "Software\(emPractice and Experience" ,
|
|
Vol 21., No 12., December 1991, pp. 1289\-1299.
|
|
.LP
|
|
[3] Sean Quinlan and Sean Dorward, ``Venti: A New Approach to Archival Storage,''
|
|
.I "Usenix Conference on File and Storage Technologies" ,
|
|
2002.
|