2005-07-13 13:23:47 +00:00
|
|
|
.TH VENTI 8
|
|
|
|
.SH NAME
|
2005-07-18 22:41:58 +00:00
|
|
|
venti \- archival storage server
|
|
|
|
.SH SYNOPSIS
|
|
|
|
.B venti/venti
|
|
|
|
[
|
|
|
|
.B -Ldsw
|
|
|
|
]
|
|
|
|
[
|
|
|
|
.B -a
|
|
|
|
.I address
|
|
|
|
]
|
|
|
|
[
|
|
|
|
.B -B
|
|
|
|
.I blockcachesize
|
|
|
|
]
|
|
|
|
[
|
|
|
|
.B -c
|
|
|
|
.I config
|
|
|
|
]
|
|
|
|
.PP
|
|
|
|
.B " "
|
|
|
|
[
|
|
|
|
.B -C
|
|
|
|
.I lumpcachesize
|
|
|
|
]
|
|
|
|
[
|
|
|
|
.B -h
|
|
|
|
.I httpaddress
|
|
|
|
]
|
|
|
|
[
|
|
|
|
.B -I
|
|
|
|
.I indexcachesize
|
|
|
|
]
|
|
|
|
[
|
|
|
|
.B -W
|
|
|
|
.I webroot
|
|
|
|
]
|
2005-07-13 13:23:47 +00:00
|
|
|
.SH DESCRIPTION
|
|
|
|
Venti is a SHA1-addressed archival storage server.
|
|
|
|
See
|
|
|
|
.IR venti (7)
|
|
|
|
for a full introduction to the system.
|
|
|
|
This page documents the structure and operation of the server.
|
|
|
|
.PP
|
|
|
|
A venti server requires multiple disks or disk partitions,
|
|
|
|
each of which must be properly formatted before the server
|
|
|
|
can be run.
|
|
|
|
.SS Disk
|
|
|
|
The venti server maintains three disk structures, typically
|
|
|
|
stored on raw disk partitions:
|
|
|
|
the append-only
|
|
|
|
.IR "data log" ,
|
|
|
|
which holds, in sequential order,
|
|
|
|
the contents of every block written to the server;
|
|
|
|
the
|
|
|
|
.IR index ,
|
|
|
|
which helps locate a block in the data log given its score;
|
|
|
|
and optionally the
|
|
|
|
.IR "bloom filter" ,
|
|
|
|
a concise summary of which scores are present in the index.
|
|
|
|
The data log is the primary storage.
|
|
|
|
To improve the robustness, it should be stored on
|
|
|
|
a device that provides RAID functionality.
|
|
|
|
The index and the bloom filter are optimizations
|
|
|
|
employed to access the data log efficiently and can be rebuilt
|
|
|
|
if lost or damaged.
|
|
|
|
.PP
|
|
|
|
The data log is logically split into sections called
|
|
|
|
.IR arenas ,
|
|
|
|
typically sized for easy offline backup
|
|
|
|
(e.g., 500MB).
|
|
|
|
A data log may comprise many disks, each storing
|
|
|
|
one or more arenas.
|
|
|
|
Such disks are called
|
|
|
|
.IR "arena partitions" .
|
|
|
|
Arena partitions are filled in the order given in the configuration.
|
|
|
|
.PP
|
|
|
|
The index is logically split into block-sized pieces called
|
|
|
|
.IR buckets ,
|
|
|
|
each of which is responsible for a particular range of scores.
|
|
|
|
An index may be split across many disks, each storing many buckets.
|
|
|
|
Such disks are called
|
|
|
|
.IR "index sections" .
|
|
|
|
.PP
|
|
|
|
The index must be sized so that no bucket is full.
|
|
|
|
When a bucket fills, the server must be shut down and
|
|
|
|
the index made larger.
|
|
|
|
Since scores appear random, each bucket will contain
|
|
|
|
approximately the same number of entries.
|
|
|
|
Index entries are 40 bytes long. Assuming that a typical block
|
|
|
|
being written to the server is 8192 bytes and compresses to 4096
|
|
|
|
bytes, the active index is expected to be about 1% of
|
|
|
|
the active data log.
|
|
|
|
Storing smaller blocks increases the relative index footprint;
|
|
|
|
storing larger blocks decreases it.
|
|
|
|
To allow variation in both block size and the random distribution
|
|
|
|
of scores to buckets, the suggested index size is 5% of
|
|
|
|
the active data log.
|
|
|
|
.PP
|
|
|
|
The (optional) bloom filter is a large bitmap that is stored on disk but
|
|
|
|
also kept completely in memory while the venti server runs.
|
|
|
|
It helps the venti server efficiently detect scores that are
|
|
|
|
.I not
|
|
|
|
already stored in the index.
|
|
|
|
The bloom filter starts out zeroed.
|
|
|
|
Each score recorded in the bloom filter is hashed to choose
|
|
|
|
.I nhash
|
|
|
|
bits to set in the bloom filter.
|
|
|
|
A score is definitely not stored in the index of any of its
|
|
|
|
.I nhash
|
|
|
|
bits are not set.
|
|
|
|
The bloom filter thus has two parameters:
|
|
|
|
.I nhash
|
|
|
|
(maximum 32)
|
|
|
|
and the total bitmap size
|
|
|
|
(maximum 512MB, 2\s-2\u32\d\s+2 bits).
|
|
|
|
.PP
|
|
|
|
The bloom filter should be sized so that
|
|
|
|
.I nhash
|
2005-07-18 22:41:58 +00:00
|
|
|
\(mu
|
2005-07-13 13:23:47 +00:00
|
|
|
.I nblock
|
|
|
|
\(<=
|
2005-07-18 22:41:58 +00:00
|
|
|
0.7 \(mu
|
2005-07-13 13:23:47 +00:00
|
|
|
.IR b ,
|
|
|
|
where
|
|
|
|
.I nblock
|
|
|
|
is the expected number of blocks stored on the server
|
|
|
|
and
|
|
|
|
.I b
|
|
|
|
is the bitmap size in bits.
|
|
|
|
The false positive rate of the bloom filter when sized
|
|
|
|
this way is approximately 2\s-2\u\-\fInblock\fR\d\s+2.
|
|
|
|
.I Nhash
|
|
|
|
less than 10 are not very useful;
|
|
|
|
.I nhash
|
|
|
|
greater than 24 are probably a waste of memory.
|
|
|
|
.I Fmtbloom
|
|
|
|
(see
|
|
|
|
.IR venti-fmt (8))
|
|
|
|
can be given either
|
|
|
|
.I nhash
|
|
|
|
or
|
|
|
|
.IR nblock ;
|
|
|
|
if given
|
|
|
|
.IR nblock ,
|
|
|
|
it will derive an appropriate
|
|
|
|
.IR nhash .
|
|
|
|
.SS Memory
|
|
|
|
Venti can make effective use of large amounts of memory
|
|
|
|
for various caches.
|
|
|
|
.PP
|
|
|
|
The
|
|
|
|
.I "lump cache
|
|
|
|
holds recently-accessed venti data blocks, which the server refers to as
|
|
|
|
.IR lumps .
|
|
|
|
The lump cache should be at least 1MB but can profitably be much larger.
|
|
|
|
The lump cache can be thought of as the level-1 cache:
|
|
|
|
read requests handled by the lump cache can
|
|
|
|
be served instantly.
|
|
|
|
.PP
|
|
|
|
The
|
|
|
|
.I "block cache
|
|
|
|
holds recently-accessed
|
|
|
|
.I disk
|
|
|
|
blocks from the arena partitions.
|
|
|
|
The block cache needs to be able to simultaneously hold two blocks
|
|
|
|
from each arena plus four blocks for the currently-filling arena.
|
|
|
|
The block cache can be thought of as the level-2 cache:
|
|
|
|
read requests handled by the block cache are slower than those
|
|
|
|
handled by the lump cache, since the lump data must be extracted
|
|
|
|
from the raw disk blocks and possibly decompressed, but no
|
|
|
|
disk accesses are necessary.
|
|
|
|
.PP
|
|
|
|
The
|
|
|
|
.I "index cache
|
|
|
|
holds recently-accessed or prefetched
|
|
|
|
index entries.
|
|
|
|
The index cache needs to be able to hold index entries
|
|
|
|
for three or four arenas, at least, in order for prefetching
|
|
|
|
to work properly. Each index entry is 50 bytes.
|
|
|
|
Assuming 500MB arenas of
|
|
|
|
128,000 blocks that are 4096 bytes each after compression,
|
|
|
|
the minimum index cache size is about 6MB.
|
|
|
|
The index cache can be thought of as the level-3 cache:
|
|
|
|
read requests handled by the index cache must still go
|
|
|
|
to disk to fetch the arena blocks, but the costly random
|
|
|
|
access to the index is avoided.
|
|
|
|
.PP
|
|
|
|
The size of the index cache determines how long venti
|
|
|
|
can sustain its `burst' write throughput, during which time
|
|
|
|
the only disk accesses on the critical path
|
|
|
|
are sequential writes to the arena partitions.
|
|
|
|
For example, if you want to be able to sustain 10MB/s
|
|
|
|
for an hour, you need enough index cache to hold entries
|
|
|
|
for 36GB of blocks. Assuming 8192-byte blocks,
|
|
|
|
you need room for almost five million index entries.
|
|
|
|
Since index entries are 50 bytes each, you need 250MB
|
|
|
|
of index cache.
|
|
|
|
If the background index update process can make a single
|
|
|
|
pass through the index in an hour, which is possible,
|
|
|
|
then you can sustain the 10MB/s indefinitely (at least until
|
|
|
|
the arenas are all filled).
|
|
|
|
.PP
|
|
|
|
The
|
|
|
|
.I "bloom filter
|
|
|
|
requires memory equal to its size on disk,
|
|
|
|
as discussed above.
|
|
|
|
.PP
|
|
|
|
A reasonable starting allocation is to
|
|
|
|
divide memory equally (in thirds) between
|
|
|
|
the bloom filter, the index cache, and the lump and block caches;
|
|
|
|
the third of memory allocated to the lump and block caches
|
|
|
|
should be split unevenly, with more (say, two thirds)
|
|
|
|
going to the block cache.
|
|
|
|
.SS Network
|
|
|
|
The venti server announces two network services, one
|
|
|
|
(conventionally TCP port
|
|
|
|
.BR venti ,
|
|
|
|
17034) serving
|
|
|
|
the venti protocol as described in
|
|
|
|
.IR venti (7),
|
|
|
|
and one serving HTTP
|
|
|
|
(conventionally TCP port
|
|
|
|
.BR venti ,
|
|
|
|
80).
|
|
|
|
.PP
|
|
|
|
The venti web server provides the following
|
|
|
|
URLs for accessing status information:
|
|
|
|
.TP
|
|
|
|
.B /index
|
|
|
|
A summary of the usage of the arenas and index sections.
|
|
|
|
.TP
|
|
|
|
.B /xindex
|
|
|
|
An XML version of
|
|
|
|
.BR /index .
|
|
|
|
.TP
|
|
|
|
.B /storage
|
|
|
|
Brief storage totals.
|
|
|
|
.TP
|
|
|
|
.BI /set/ variable
|
|
|
|
The current integer value of
|
|
|
|
.IR variable .
|
|
|
|
Variables are:
|
|
|
|
.BR compress ,
|
|
|
|
whether or not to compress blocks
|
|
|
|
(for debugging);
|
|
|
|
.BR logging ,
|
|
|
|
whether to write entries to the debugging logs;
|
|
|
|
.BR stats ,
|
|
|
|
whether to collect run-time statistics;
|
|
|
|
.BR icachesleeptime ,
|
|
|
|
the time in milliseconds between successive updates
|
|
|
|
of megabytes of the index cache;
|
|
|
|
.BR arenasumsleeptime ,
|
|
|
|
the time in milliseconds between reads while
|
|
|
|
checksumming an arena in the background.
|
|
|
|
The two sleep times should be (but are not) managed by venti;
|
|
|
|
they exist to provide more experience with their effects.
|
|
|
|
The other variables exist only for debugging and
|
|
|
|
performance measurement.
|
|
|
|
.TP
|
|
|
|
.BI /set/ variable / value
|
|
|
|
Set
|
|
|
|
.I variable
|
|
|
|
to
|
|
|
|
.IR value .
|
|
|
|
.TP
|
|
|
|
.BI /graph/ name / param / param / \fR...
|
|
|
|
A PNG image graphing the named run-time statistic over time.
|
|
|
|
The details of names and parameters are undocumented;
|
|
|
|
see
|
|
|
|
.B httpd.c
|
|
|
|
in the venti sources.
|
|
|
|
.TP
|
|
|
|
.B /log
|
|
|
|
A list of all debugging logs present in the server's memory.
|
|
|
|
.TP
|
|
|
|
.BI /log/ name
|
|
|
|
The contents of the debugging log with the given
|
|
|
|
.IR name .
|
|
|
|
.TP
|
|
|
|
.B /flushicache
|
|
|
|
Force venti to begin flushing the index cache to disk.
|
|
|
|
The request response will not be sent until the flush
|
|
|
|
has completed.
|
|
|
|
.TP
|
|
|
|
.B /flushdcache
|
|
|
|
Force venti to begin flushing the arena block cache to disk.
|
|
|
|
The request response will not be sent until the flush
|
|
|
|
has completed.
|
|
|
|
.PD
|
|
|
|
.PP
|
|
|
|
Requests for other files are served by consulting a
|
|
|
|
directory named in the configuration file
|
|
|
|
(see
|
|
|
|
.B webroot
|
|
|
|
below).
|
|
|
|
.SS Configuration File
|
|
|
|
A venti configuration file
|
|
|
|
enumerates the various index sections and
|
|
|
|
arenas that constitute a venti system.
|
|
|
|
The components are indicated by the name of the file, typically
|
|
|
|
a disk partition, in which they reside. The configuration
|
|
|
|
file is the only location that file names are used. Internally,
|
|
|
|
venti uses the names assigned when the components were formatted
|
|
|
|
with
|
|
|
|
.I fmtarenas
|
|
|
|
or
|
|
|
|
.I fmtisect
|
|
|
|
(see
|
|
|
|
.IR venti-fmt (8)).
|
|
|
|
In particular, only the configuration needs to be
|
|
|
|
changed if a component is moved to a different file.
|
|
|
|
.PP
|
|
|
|
The configuration file consists of lines in the form described below.
|
|
|
|
Lines starting with
|
|
|
|
.B #
|
|
|
|
are comments.
|
|
|
|
.TP
|
|
|
|
.BI index " name
|
|
|
|
Names the index for the system.
|
|
|
|
.TP
|
|
|
|
.BI arenas " file
|
|
|
|
.I File
|
|
|
|
is an arena partition, formatted using
|
|
|
|
.IR fmtarenas .
|
|
|
|
.TP
|
|
|
|
.BI isect " file
|
|
|
|
.I File
|
|
|
|
is an index section, formatted using
|
|
|
|
.IR fmtisect .
|
|
|
|
.PP
|
|
|
|
After formatting a venti system using
|
|
|
|
.IR fmtindex ,
|
|
|
|
the order of arenas and index sections should not be changed.
|
|
|
|
Additional arenas can be appended to the configuration;
|
|
|
|
run
|
|
|
|
.I fmtindex
|
|
|
|
with the
|
|
|
|
.B -a
|
|
|
|
flag to update the index.
|
|
|
|
.PP
|
|
|
|
The configuration file also holds configuration parameters
|
|
|
|
for the venti server itself.
|
|
|
|
These are:
|
|
|
|
.TF httpaddr netaddr
|
|
|
|
.TP
|
|
|
|
.BI mem " size
|
|
|
|
lump cache size
|
|
|
|
.TP
|
|
|
|
.BI bcmem " size
|
|
|
|
block cache size
|
|
|
|
.TP
|
|
|
|
.BI icmem " size
|
|
|
|
index cache size
|
|
|
|
.TP
|
|
|
|
.BI addr " netaddr
|
|
|
|
network address to announce venti service
|
|
|
|
(default
|
|
|
|
.BR tcp!*!venti )
|
|
|
|
.TP
|
|
|
|
.BI httpaddr " netaddr
|
|
|
|
network address to announce HTTP service
|
|
|
|
(default
|
|
|
|
.BR tcp!*!http )
|
|
|
|
.TP
|
|
|
|
.B queuewrites
|
|
|
|
queue writes in memory
|
|
|
|
(default is not to queue)
|
|
|
|
.TP
|
|
|
|
.BI webroot " dir
|
|
|
|
directory tree containing files for HTTP server
|
|
|
|
to consult for unrecognized URLs
|
|
|
|
.PD
|
|
|
|
.PP
|
|
|
|
The units for the various cache sizes above can be specified by appending a
|
|
|
|
.LR k ,
|
|
|
|
.LR m ,
|
|
|
|
or
|
|
|
|
.LR g
|
|
|
|
(case-insensitive)
|
|
|
|
to indicate kilobytes, megabytes, or gigabytes respectively.
|
|
|
|
.SS Command Line
|
2005-07-18 22:41:58 +00:00
|
|
|
Many of the options to Venti duplicate parameters that
|
|
|
|
can be specified in the configuration file.
|
|
|
|
The command line options override those found in a
|
|
|
|
configuration file.
|
|
|
|
Additional options are:
|
2005-07-13 13:23:47 +00:00
|
|
|
.TP
|
|
|
|
.BI -c " config
|
|
|
|
The server configuration file
|
|
|
|
(default
|
|
|
|
.BR venti.conf )
|
|
|
|
.TP
|
|
|
|
.B -d
|
|
|
|
Produce various debugging information on standard error.
|
|
|
|
Implies
|
|
|
|
.BR -s .
|
|
|
|
.TP
|
|
|
|
.B -L
|
|
|
|
Enable logging. By default all logging is disabled.
|
|
|
|
Logging slows server operation considerably.
|
|
|
|
.TP
|
|
|
|
.B -s
|
|
|
|
Do not run in the background.
|
|
|
|
Normally,
|
|
|
|
the foreground process will exit once the Venti server
|
|
|
|
is initialized and ready for connections.
|
|
|
|
.PD
|
|
|
|
.SH EXAMPLE
|
|
|
|
A simple configuration:
|
|
|
|
.IP
|
|
|
|
.EX
|
|
|
|
% cat venti.conf
|
|
|
|
index main
|
|
|
|
isect /tmp/disks/isect0
|
|
|
|
isect /tmp/disks/isect1
|
|
|
|
arenas /tmp/disks/arenas
|
|
|
|
mem 10M
|
|
|
|
bcmem 20M
|
|
|
|
icmem 30M
|
|
|
|
%
|
|
|
|
.EE
|
|
|
|
.PP
|
|
|
|
Format the index sections, the arena partition, and
|
|
|
|
finally the main index:
|
|
|
|
.IP
|
|
|
|
.EX
|
|
|
|
% venti/fmtisect isect0. /tmp/disks/isect0 &
|
|
|
|
% venti/fmtisect isect1. /tmp/disks/isect1 &
|
|
|
|
% venti/fmtarenas arenas0. /tmp/disks/arenas &
|
|
|
|
% wait
|
|
|
|
% venti/fmtindex venti.conf
|
|
|
|
%
|
|
|
|
.EE
|
|
|
|
.PP
|
|
|
|
Start the server and check the storage statistics:
|
|
|
|
.IP
|
|
|
|
.EX
|
|
|
|
% venti/venti
|
|
|
|
% hget http://$sysname/storage
|
|
|
|
.EE
|
2005-07-18 22:41:58 +00:00
|
|
|
.SH SOURCE
|
|
|
|
.B \*9/src/cmd/venti/srv
|
2005-07-13 13:23:47 +00:00
|
|
|
.SH "SEE ALSO"
|
|
|
|
.IR venti (1),
|
|
|
|
.IR venti (3),
|
|
|
|
.IR venti (7),
|
|
|
|
.IR venti-backup (8)
|
|
|
|
.IR venti-fmt (8)
|
|
|
|
.br
|
|
|
|
Sean Quinlan and Sean Dorward,
|
|
|
|
``Venti: a new approach to archival storage'',
|
|
|
|
.I "Usenix Conference on File and Storage Technologies" ,
|
|
|
|
2002.
|
|
|
|
.SH BUGS
|
|
|
|
Setting up a venti server is too complicated.
|
|
|
|
.PP
|
|
|
|
Venti should not require the user to decide how to
|
|
|
|
partition its memory usage.
|
2005-07-18 22:41:58 +00:00
|
|
|
.PP
|
|
|
|
Users of shells other than
|
|
|
|
.IR rc (1)
|
|
|
|
will not be able to use the program names shown.
|
|
|
|
One solution is to define
|
|
|
|
.B "V=$PLAN9/bin/venti"
|
|
|
|
and then substitute
|
|
|
|
.B $V/
|
|
|
|
for
|
|
|
|
.B venti/
|
|
|
|
in the paths above.
|