mirror of
git://git.9front.org/plan9front/plan9front
synced 2025-01-12 11:10:06 +00:00
415 lines
14 KiB
Text
415 lines
14 KiB
Text
.HTML "Namespaces as Security Domains"
|
|
.TL
|
|
Namespaces as Security Domains
|
|
.AU
|
|
Jacob Moody
|
|
.AB
|
|
We aim to explore the use of Plan 9 namespaces
|
|
as ways of building isolated processes. We present
|
|
here code for increasing the ability and granularity
|
|
for which a process may isolate itself from others
|
|
on the system.
|
|
.AE
|
|
.SH
|
|
Introduction
|
|
.PP
|
|
.FS
|
|
First presented in a slightly different form at the 9th International Workshop on Plan 9.
|
|
.FE
|
|
.LP
|
|
In a Plan 9 system the kernel exposes hardware and system
|
|
interfaces through a myriad of filesystem trees. These trees, or
|
|
sharp devices, replace the functionality of many would be system calls
|
|
through use of standard file system operations. A standard Plan 9 environment
|
|
is comprised of a composition of these individual devices together, the collection
|
|
of such being the processes namespace.
|
|
.LP
|
|
With these principles it is quite easy for a process to build a slim namespace using only
|
|
what it may need for operation. This could be done in service to reduce the "blast radius"
|
|
of awry or malicious code to some effect. But to be fully effective a process must also be able
|
|
to remove the ability to bootstrap these capabilities back. We will explore different ways of
|
|
building isolated namespaces, their pitfalls, ways to address those issues, along with new solutions.
|
|
.SH
|
|
Outside World
|
|
.LP
|
|
There have been many solutions for sandboxing within the UNIX™
|
|
world. There are more classical approaches such as
|
|
.CW zones ,
|
|
[Price04] and
|
|
.CW jails ,
|
|
that all provide an abstraction of building some number of
|
|
smaller full unix boxes out of a single physical host. However these
|
|
interfaces are presented more as a systems management tool, the mechanisms
|
|
for which an administrator creates and manages these resources is unergonomic
|
|
to use on a per-process basis. Instead it seems more the fashion now to isolate
|
|
specific pieces of the system, and expect it possible that each process on the system
|
|
may choose to manage its environment. The most successful execution of this idea in the
|
|
wild is the OpenBSD project's
|
|
.CW unveil
|
|
and
|
|
.CW pledge
|
|
[Beck18] system calls, allowing a processes to cut off specific parts of the filesystem or
|
|
system call interfaces. Linux namespaces [Biederman06] implement this idea by allowing a process
|
|
to fork off private versions of specific global resources. In both these cases the sandboxing
|
|
of a process is through gradual steps, removing potentially dangerous tools one by one.
|
|
.SH
|
|
Existing Work
|
|
.LP
|
|
Let us first define the resources we are restricting access to. The aforementioned gradual solutions
|
|
provide ways in which a process can remove itself from specific kernel interfaces. In plan9 the kernel
|
|
exposes almost all of its functionality through individual filesystems. These devices are accessed
|
|
globally by prefixing a path with a sharp('#'), and have conventional places they are bound within the
|
|
namespace.
|
|
.LP
|
|
A processes namespace in plan9 is typically constructed using a namespace file. These files
|
|
are a collection of namespace operations formatted as one would expect to see them in a shell script.
|
|
They typically begin by binding in some number of sharp devices in to their expected location.
|
|
.P1
|
|
bind #d /fd
|
|
bind -c #e /env
|
|
bind #p /proc
|
|
bind -c #s /srv
|
|
.P2
|
|
Then using the globals provided, in particular /srv, to bring in the rest of the root filesystem.
|
|
A process can at any point choose to construct itself a new namespace, but it must do so when changing
|
|
users. This is done in part to ensure that each filesystem that the program would like to use has
|
|
their chance to authenticate and be notified. Because this information is only exchanged on attach,
|
|
the new user must construct a namespace from scratch.
|
|
.LP
|
|
Many programs, like network services, wish to drop their current user and become the special user
|
|
.CW none
|
|
user on startup, and in doing so must rebuild their namespace. The conventional default namespace
|
|
files used is /lib/namespace, but most programs allow the user to specify an alternative with a
|
|
flag. It is here that we already can approximate a chroot style environment by changing the root
|
|
filesystem used in a namespace file.
|
|
.P1
|
|
bind #s /srv
|
|
mount /srv/myboot /root
|
|
bind -a /root /
|
|
.P2
|
|
By having another filesystem exposed in /srv/myboot and modifying the provided namespace file,
|
|
we've allowed this process to work within an entirely separate root filesystem.
|
|
.SH
|
|
RFNOMNT
|
|
.LP
|
|
The issue in using these namespaces as security barriers is that there is nothing preventing
|
|
a process from bootstrapping a resource back. While our example code places a different root filesystem
|
|
in the namespace, nothing is preventing that process or its children from potentially rebootstrapping
|
|
the real root filesystem back. For this issue there is a special rfork flag
|
|
.CW RFNOMNT
|
|
the prevents a process from accessing any almost any sharp device of consequence. This is done by
|
|
preventing a process from walking to a device by its location within '#'. This allows existing
|
|
binds of resources to continue working within the namespace but restricts a process from binding
|
|
in new resources from the kernel.
|
|
.LP
|
|
While effective we found this to be too large a hammer in practice. Doing as its name implies
|
|
.CW RFNOMNT
|
|
also prevents a process from performing any mounts or binds. This in practice creates a single
|
|
point in time in which a process gives up all of its control, instead of the idealized gradual
|
|
process. This makes it quite hard to make use of in practice, only a single program in a chain
|
|
may be the one to invoke
|
|
.CW RFNOMNT
|
|
or must hope that no other program further in the chain may want to make use of its namespace.
|
|
The interface itself feels very clunky, there is a nice gradual addition of these kernel devices
|
|
to the namespace why must the removal be all at once?
|
|
.SH
|
|
Chdev
|
|
.LP
|
|
We propose a new write interface through /dev/drivers
|
|
that functionally replaces
|
|
.CW RFNOMNT .
|
|
/dev/drivers now accepts writes in the form of
|
|
.P1
|
|
chdev op devmask
|
|
.P2
|
|
Devmask is a string of sharp device characters. Op specifies how
|
|
devmask is interpreted. Op is one of
|
|
.TS
|
|
lw(1i) lw(4.5i).
|
|
\f(CW&\fP T{
|
|
Permit access to just the devices specified in devmask.
|
|
T}
|
|
\f(CW&~\fP T{
|
|
Permit access to all but the devices specified in devmask.
|
|
T}
|
|
\f(CW~\fP T{
|
|
Remove access to all devices. Devmask is ignored.
|
|
T}
|
|
.TE
|
|
.LP
|
|
This allows a process to selectively remove access to
|
|
sections of sharp devices with quite a bit of control.
|
|
In order to mimic all of
|
|
.CW RFNOMNT 's
|
|
features, removing access to
|
|
.CW devmnt ,
|
|
which is not normally accessible directly,
|
|
disables the processes ability to perform mount
|
|
and bind operations.
|
|
.LP
|
|
For the implementation, we extended the existing
|
|
.CW RFNOMNT
|
|
flag attached to the process namespace group
|
|
into a bit vector. Each bit representing an index
|
|
into
|
|
.CW devtab .
|
|
The following function illustrates how this vector is set.
|
|
.P1
|
|
void
|
|
devmask(Pgrp *pgrp, int invert, char *devs)
|
|
{
|
|
int i, t, w;
|
|
char *p;
|
|
Rune r;
|
|
u64int mask[nelem(pgrp->notallowed)];
|
|
|
|
if(invert)
|
|
memset(mask, 0xFF, sizeof mask);
|
|
else
|
|
memset(mask, 0, sizeof mask);
|
|
|
|
w = sizeof mask[0] * 8;
|
|
for(p = devs; *p != 0;){
|
|
p += chartorune(&r, p);
|
|
t = devno(r, 1);
|
|
if(t == -1)
|
|
continue;
|
|
if(invert)
|
|
mask[t/w] &= ~(1<<t%w);
|
|
else
|
|
mask[t/w] |= 1<<t%w;
|
|
}
|
|
|
|
wlock(&pgrp->ns);
|
|
for(i=0; i < nelem(pgrp->notallowed); i++)
|
|
pgrp->notallowed[i] |= mask[i];
|
|
wunlock(&pgrp->ns);
|
|
}
|
|
.P2
|
|
Devmask is called from the write handler for /dev/drivers. This
|
|
bitmask is then consulted any time a name is resolved that begins
|
|
with '#'. This is done from within the
|
|
.CW namec ()
|
|
function using the following function to check
|
|
if a particular device
|
|
.CW r
|
|
is permitted.
|
|
.P1
|
|
int
|
|
devallowed(Pgrp *pgrp, int r)
|
|
{
|
|
int t, w, b;
|
|
|
|
t = devno(r, 1);
|
|
if(t == -1)
|
|
return 0;
|
|
|
|
w = sizeof(u64int) * 8;
|
|
rlock(&pgrp->ns);
|
|
b = !(pgrp->notallowed[t/w] & 1<<t%w);
|
|
runlock(&pgrp->ns);
|
|
return b;
|
|
}
|
|
.P2
|
|
.LP
|
|
We found that once removal is made to a core verb of these sharp
|
|
devices it becomes easy to start to view access to them
|
|
as capabilities. This is aided by system functionally already neatly
|
|
organized into the various devices themselves. For example, one could
|
|
say a process is capable of accessing the broader internet if it has access
|
|
to the
|
|
.CW devip
|
|
device. This access can either be direct via it's path under '#' or through a
|
|
location in the namespace where this device had already been bound. With these
|
|
changes, the entire capability list of a process is on display through just its
|
|
/proc/$pid/ns file. This
|
|
.CW ns
|
|
file would indicate if a particular device is bound and now also includes
|
|
the list of devices a process has access to.
|
|
.LP
|
|
In practice, this results in a pattern of binding
|
|
in a sharp device, making use of them and removing
|
|
them when no longer needed. A namespace file for
|
|
a web server could now look like
|
|
.P1
|
|
bind #s /srv
|
|
# /srv/www created by srvfs www /lib/www
|
|
mount /srv/www /lib/
|
|
unmount /srv
|
|
chdev -r s # chdev &~ s
|
|
.P2
|
|
In this example we have created a new root for the process by
|
|
using exportfs to expose a little piece of the boot namespace.
|
|
We unmount
|
|
.CW devsrv
|
|
and remove access to it with
|
|
.CW chdev
|
|
ensuring there is no way for our process to talk to the real
|
|
.CW /srv/boot .
|
|
This provides a nice succinct lifetime of access to
|
|
.CW devsrv
|
|
and makes the removal of these sharp devices as easy as
|
|
it is to use them in the first place.
|
|
.LP
|
|
Like
|
|
.CW RFNOMNT ,
|
|
.CW chdev
|
|
does not restrict access to sharp devices that had already been mounted.
|
|
This allows a process to use a subsection or only one piece of
|
|
sharp devices as well. One example of this may be to restrict a process
|
|
to just a single network stack
|
|
.P1
|
|
bind '#I1' /net
|
|
chdev -r I
|
|
.P2
|
|
.SH
|
|
/srv/clone
|
|
.LP
|
|
With this
|
|
.CW chdev
|
|
mechanism, the ability for a device to provide isolation of its
|
|
own became more powerful. Partially illustrated in the previous
|
|
.CW devip
|
|
example.
|
|
.CW Devsrv ,
|
|
the sharp device providing named pipes, was an ideal target for
|
|
adding isolation. Devsrv provides a bulletin board of all posted 9p services
|
|
for a given host. We wanted to provide a mechanism for a process, or
|
|
family tree of process to share a private
|
|
.CW devsrv
|
|
between themselves.
|
|
.LP
|
|
The design for this was borrowed from devip, one in which a process opens a
|
|
.CW clone
|
|
file to read its newly allocated slot number. This new 'board' appears as a sibling directory
|
|
to the
|
|
.CW clone
|
|
it was spawned from. This new board is itself a fully functioning
|
|
.CW devsrv
|
|
with its own clone file, making nesting to full trees of
|
|
.CW srvs
|
|
quite easy, and completely transparent. The following illustrates
|
|
how one could replace their global
|
|
.CW /srv
|
|
with a freshly allocated one.
|
|
.P1
|
|
</srv/clone {
|
|
s='/srv/'^`{read}
|
|
bind -c $s /srv
|
|
exec p
|
|
}
|
|
.P2
|
|
Also like devip, once the last reference to the file descriptor returned by opening
|
|
.CW clone
|
|
is closed the board is closed and posters to that board receive an EOF. It is important
|
|
to bake this kind of ownership into the design, as self referential users of
|
|
.CW /srv
|
|
are quite common in current code.
|
|
.LP
|
|
This along with chdev can be used to create a sandbox for /srv quite easily,
|
|
the process allocates itself a new /srv then removes access to the global
|
|
root srv. This allows potentially untrusted process to still make use of the interface
|
|
without needing to worry about their access to the global state. The practice of having
|
|
new boards appear as subdirectories allows the entire state to easily be seen by inspecting the
|
|
root of devsrv itself.
|
|
.SH
|
|
Restricting Within a Mount
|
|
.LP
|
|
As shown earlier with the use of
|
|
.CW srvfs ,
|
|
an intermediate file server can be used to only service a small subsection of a larger
|
|
namespace. In that example we used this to expose only /lib/www from the host to processes
|
|
running a web server. This can be limited as the invocation of
|
|
.CW exportfs
|
|
can become more complicated if the user wishes to use multiple pieces from completely
|
|
separate places within the file tree. To address this a utility program
|
|
.CW protofs
|
|
was written to easily create convincing mimics of the filesystem it was run from.
|
|
.CW protofs
|
|
accepts a
|
|
.CW proto
|
|
file, a text file containing a description of file tree, and uses it to provide
|
|
dummy files mimicking the structure. These dummies can then be used by a process as targets
|
|
for bind mounts of its current namespace, providing the illusion of trimming all but select
|
|
pieces. This new root cannot be simply bound over the real one, that still allows an unmount
|
|
to escape back to the real system, but rexporting the namespace still works. To illustrate a
|
|
more involved setup than before:
|
|
.P1
|
|
# We want to provide our web server
|
|
# with /bin, /lib/www and /lib/git
|
|
; cat >>/tmp/proto <<.
|
|
bin d775
|
|
lib d775
|
|
www d775
|
|
git d775
|
|
.
|
|
; protofs -m /mnt/proto /tmp/prot
|
|
; bind /bin /mnt/proto/bin
|
|
; bind /lib/www /mnt/proto/lib/www
|
|
; bind /lib/git /mnt/proto/lib/git
|
|
# A private srv could be used, omitted for brevity
|
|
; srvfs webbox /mnt/proto
|
|
# Namespace file for using our new mini-root
|
|
; cat >>/tmp/ns <<.
|
|
mount #s/webbox /root
|
|
bind -b /root /
|
|
chdev -r s
|
|
.
|
|
; auth/newns -n /tmp/ns ls /
|
|
bin
|
|
lib
|
|
;
|
|
.P2
|
|
.SH
|
|
Future Work
|
|
.LP
|
|
While we think these bring us closer to namespaces as security boundaries,
|
|
there is still plenty of work and understanding to be done. One particular
|
|
item of interest is attempting some kind of isolation of
|
|
.CW devproc ,
|
|
possibly in a similar fashion to the
|
|
.CW /srv/clone
|
|
implementation, but attempts have yet to be made. The exact nature of
|
|
.CW namespace
|
|
files and how they relate to sandboxing as a whole has yet to be fully
|
|
worked out. There is clear potential, but it is likely additional abilities may
|
|
be required. It is somewhat difficult to synthesize a namespace entirely
|
|
from nothing, which is something we found ourselves reaching for when building
|
|
alternative roots to run processes within. There is potential for some merger
|
|
of
|
|
.CW proto
|
|
and
|
|
.CW namespace
|
|
files to provide a template of the current namespace to graft on to the next one.
|
|
.LP
|
|
Both
|
|
.CW chdev
|
|
and
|
|
.CW /srv/clone
|
|
are merged into 9front and their implementations are freely available as part of the base system.
|
|
.SH
|
|
References
|
|
.LP
|
|
[Beck18]
|
|
Bob Beck,
|
|
``Pledge, and Unveil, in OpenBSD'',
|
|
.I "BSDCan Slides"
|
|
Ottawa,
|
|
July, 2018.
|
|
.LP
|
|
[Price04]
|
|
Daniel Price,
|
|
Andrew Tucker,
|
|
``Solaris Zones: Operating System Support for Consolidating Commercial Workloads'',
|
|
.I "Proceedings of the 18th Large Installation System Administration Conference"
|
|
pp. 241-254,
|
|
Atlanta,
|
|
November, 2004.
|
|
.LP
|
|
[Biederman06]
|
|
Eric W. Biederman
|
|
``Multiple Instances of the Global Linux Namespaces'',
|
|
.I "Proceedings of the 2006 Linux Symposium Volume One"
|
|
pp. 102-112,
|
|
Ottawa, Ontario
|
|
July, 2006.
|