FreeBSD Manual Pages

home | help
EMILUA(7)		       Emilua reference			     EMILUA(7)

NAME
       Emilua -	Lua execution engine

DESCRIPTION
       The purpose of this manual is to	help you attack	the system. If you're
       trying to find security holes, this section should be a good overview
       on how the whole	system works.

       If you find any bug in the code,	please responsibly send	a bug report
       so the Emilua team can fix it.

MESSAGE	SERIALIZATION
       Emilua follows the advice from WireGuard	developers to avoid parsing
       bugs by avoiding	object serialization altogether. Sequenced-packet
       sockets with builtin framing are	used so	we always receive/send whole
       messages	in one API call.

       There is	a hard-limit (configurable at build time) on the maximum
       number of members you can send per message. This	limit would need to
       exist anyway to avoid DoS from bad clients.

       Another limitation is that no nesting is	allowed. You can either	send a
       single non-nil value or a non-empty dictionary where every member in it
       is a leaf from the root tree. The messaging API is part of the attack
       surface that bad	clients	can exploit. We	cannot afford a	single bug
       here, so	the code must be simple. By forbidding subtrees	we can ignore
       recursion complexities and simplify the code a lot.

       The struct used to receive messages follows:

	   enum	kind
	   {
	       boolean_true    = 1,
	       boolean_false   = 2,
	       string	       = 3,
	       file_descriptor = 4,
	       actor_address   = 5,
	       nil	       = 6
	   };

	   struct ipc_actor_message
	   {
	       union
	       {
		   double as_double;
		   uint64_t as_int;
	       } members[EMILUA_CONFIG_IPC_ACTOR_MESSAGE_MAX_MEMBERS_NUMBER];
	       unsigned	char strbuf[
		   EMILUA_CONFIG_IPC_ACTOR_MESSAGE_MAX_MEMBERS_NUMBER *	512];
	   };

       A variant class is needed to send the messages. Given a variant is
       needed anyway, we just adopt NaN-tagging	for its	implementation as that
       will make the struct members packed together and	no memory from the
       host process hidden among paddings will leak to the containers.

       The code	assumes	that no	signaling NaNs are ever	produced by the	Lua VM
       to simplify the NaN-tagging scheme[1][2]. The type is stored in the
       mantissa	bits of	a signaling NaN.

       If the first member is nil, then	we have	a non-dictionary value stored
       in members[1]. Otherwise, a nil will act	as a sentinel to the end of
       the dictionary. No sentinel will	exist when the dictionary is fully
       filled.

       read() calls will write to objects of this type directly	(i.e. no
       intermediate char[N] buffer is used) so we avoid	any complexity with
       code related to alignment adjustments.

       memset(buf, 0, s) is used to clear any unused member of the struct
       before a	call to	write()	so we avoid leaking memory from	the process to
       any container.

       Strings are preceded by a single	byte that contains the size of the
       string that follows. Therefore, strings are limited to 255 characters.
       Following from this scheme, a buffer sufficiently large to hold the
       largest message is declared to avoid any	buffer overflow. However, we
       still perform bounds checking to	make sure no uninitialized data	from
       the code	stack is propagated back to Lua	code to	avoid leaking any
       memory. The bounds checking function in the code	has a simple
       implementation that doesn't make	the code much more complex and it's
       easy to follow.

       To send file descriptors	over, SCM_RIGHTS is used. There	are a lot of
       quirks involved with SCM_RIGHTS (e.g. extra file	descriptors could be
       stuffed into the	buffer even if you didn't expect them).	The encoding
       scheme for the network buffer is	far simpler to use than	SCM_RIGHTS'
       ancillary data. Complexity-wise,	there's	far greater chance to
       introduce a bug in code related to SCM_RIGHTS than a bug	in the code
       that parses the network buffer.

       Code could be simpler if	we only	supported messaging strings over, but
       that would just defer the problem of secure serialization on the	user's
       back. Code should be simple, but	not simpler. By	throwing all
       complexity on the user's	back, the implementation would offer no
       security. At least we centralized the sensitive object serialization so
       only one	block of code need to be reviewed and audited.

SPAWNING A NEW PROCESS
       UNIX systems allow the userspace	to spawn new processes by a fork()
       followed	by an exec(). exec() really means an executable	will be
       available in the	container, but this assumption doesn't play nice with
       our idea	of spawning new	actors in an empty container.

       What we really want is to to perform a fork followed by no exec() call.
       This approach in	itself also has	its own	problems. exec() is the	only
       call that will flush the	address	space of the running process. If we
       don't exec() then the new process that was supposed to run untrusted
       code with no access to system resources will be able to read all
       previous	memory -- memory that will most	likely contain sensitive
       information that	we didn't want leaked. Other problems such as threads
       (supported by the Emilua	runtime) would also hinder our ability to use
       fork() without exec()ing.

       One simple approach to solve all	these problems is to fork() at the
       beginning of the	program	so we fork() before any	sensitive information
       is loaded in the	process' memory. Forking at a well known point also
       brings other benefits. We know that no thread has been created yet, so
       resources such as locks and the global memory allocator stay in a well
       defined state. By creating this extra process before much more extra
       virtual memory or file descriptor slots in our process table have been
       requested, we also make sure that further processes creation will be
       cheaper.

	    emilua program
	       emilua runtime (supervisor fork()ed near	main())

       Every time the main process wants to create an actor in a new process,
       it'll defer its job onto	the supervisor that was	fork()ed near main().
       An AF_UNIX+SOCK_SEQPACKET socket	is used	to orchestrate this process.
       Given the supervisor is only used to create new processes, it can use
       blocking	APIs that will simplify	the code a lot.	The blocking read() on
       the socket also means that it won't be draining any CPU resources when
       it's not	needed.	Also important is the threat model here. The main
       process is not trying to	attack the supervisor process. The supervisor
       is also trusted and it doesn't need to run inside a container.
       SCM_RIGHTS handling between the main process and	the supervisor is
       simplified a lot	due to these constraints.

       However some care is still needed to setup the supervisor. Each actor
       will initially be an exact copy of the supervisor process memory	and we
       want to make sure that no sensitive data	is leaked there. The first
       thing we	do right after creating	the supervisor is collecting any
       sensitive information that might	still exist in the main	process	(e.g.
       argv and	envp) and instructing the supervisor process to
       explicit_bzero()	them. This compromise is not as	good as	exec() would
       offer, but it's the best	we can do while	we limit ourselves to
       reasonably portable C code with few assumptions about dynamic/static
       linkage against system libraries, and other settings from the host
       environment.

       This problem doesn't end	here. Now that we assume the process memory
       from the	supervisor contains no sensitive data, we want to keep it that
       way. It may be true that	every container	is assumed as a	container that
       some hacker already took	over (that's why we're isolating them, after
       all), but one container shouldn't leak information to another one. In
       other words, we don't even want to load sensitive information regarding
       the setup of any	container from the supervisor process as that could
       leak into future	containers. The	solution here is to serialize such
       information (e.g. the init.script) such that it is only sent directly
       to the final process. Another AF_UNIX+SOCK_SEQPACKET socket is used.

       Now to the assumptions on the container process.	We do assume that
       it'll run code that is potentially dangerous and	some hacker might own
       the container at	some point. However the	initial	setup does not run
       arbitrary dangerous code	and it still is	part of	the trusted computing
       base. The problem is that we don't know whether the init.script will
       need to load sensitive information at any point to perform its job.
       That's why we setup the Lua VM that runs	init.script to use a custom
       allocator that will explicit_bzero() all	allocated memory at the	end.
       Allocations done	by external libraries such as libcap lie outside of
       our control, but	they rarely matter anyway.

       That's mostly the bulk of our problems and how we handle	them. Other
       problems	are summarized in the short list below.

       •   SIGCHLD would be sent to the	main process, but we cannot install
	   arbitrary signal handlers in	the main process as that's a property
	   from	the application	(i.e. signal handling disposition is not a
	   resource owned by the Emilua	runtime). The problem was already
	   solved by making the	actor a	child of the supervisor	process.

       •   We can't install arbitrary signal handlers in the container process
	   either as that would	break every module by bringing different
	   semantics depending on the context where it runs (host/container).
	   To handle PID1 automatically	we just	fork a new process and forward
	   its signals to the new child.

       •   "/proc/self/exe" is a resource inherited
	   <https://lwn.net/Articles/781013/> from the main process (i.e. a
	   resource that exists	outside	the container, so the container	is not
	   existing in a completely empty world), and could be exploited in
	   the container" . ETXTBSY will hinder	the ability from the container
	   to meddle with "/proc/self/exe", and	ETXTBSY	is guaranteed by the
	   existence of	the supervisor process (even if	the main process
	   exits, the supervisor will stay alive).

       The output from tools such as top start to become rather	cool when you
       play with nested	containers:

	    emilua program
	       emilua runtime (supervisor fork()ed near	main())
		  emilua runtime (PID1 within the new namespace)
		    emilua program
		       emilua runtime (supervisor fork()ed near	main())
		  emilua runtime (PID1 within the new namespace)
		     emilua program
			emilua runtime (supervisor fork()ed near main())

WORK LIFETIME MANAGEMENT
       For Linux namespaces, PID1 eases	our life a lot.	As soon	as any
       container starts	to act suspiciously we can safely kill the whole
       subtree of processes by sending SIGKILL to the PID1 that	started	it.

       For FreeBSD's Capsicum, PD_DAEMON is not	permitted in subprocesses that
       were placed into	capability mode. If all	references to a	procdesc file
       descriptor are closed, the associated process will be automatically
       terminated by the kernel.

       AF_UNIX+SOCK_SEQPACKET sockets are connection-oriented and simplify our
       work even further. We shutdown()	the ends of each pair such that
       they'll act unidirectionally just like pipes. When all copies of	one
       end die,	the operation on the other end will abort. The actor API
       translates to MPSC channels, so we never	ever send the reading end to
       any container (we only make copies of the sending end). The kernel will
       take care of any	tricky reference counting necessary (and SIGKILLing
       PID1 will make sure no unwanted end survives).

       The only	work left for us to do is pretty much to just orchestrate the
       internal	concurrency architecture of the	runtime	(e.g. watch out	for
       blocking	reads).	Given that we want to abort reads when all the copies
       of the sending end are destroyed, we don't keep any copy	to the sending
       end in our own process. Everytime we need to send our address over, we
       create a	new pair of sockets to send the	newly created sending end
       over. inbox will	unify the receipt of messages coming from any of these
       sockets.	You can	think of each newly created socket as a	new
       capability. If one capability is	revoked, others	remain unaffected.

       One good	actor could send our address further to	a bad actor, and there
       is no way to revoke access to the bad actor without also	revoking
       access to the good actor, but that is in	line with capability-based
       security	systems. Access	rights are transitive. In fact,	a bad actor
       could write 0-sized messages over the AF_UNIX+SOCK_SEQPACKET socket to
       trick us	into thinking the channel was already closed. We'll happily
       close the channel and there is no problem here. The system can happily
       recover later on	(and only this capability is revoked anyway).

FLOW CONTROL
       The runtime doesn't schedule any	read on	the socket unless the user
       calls inbox:receive(). Upon reading a new message the runtime will
       either wake the receiving fiber directly, or enqueue the	result in a
       buffer if no receiving fiber exists at the time (this can happen	if the
       user canceled the fiber,	or another result arrived and woke the fiber
       up already). inbox:receive() won't schedule any read on the socket if
       there's some result already enqueued in the buffer.

setns(fd, CLONE_NEWPID)
       We don't	offer any helper to spawn a program (i.e. system.spawn())
       within an existing PID namespace. That's	intentional (although one
       could still do it through init.script). setns(fd, CLONE_NEWPID) is
       dangerous. Only exec() will flush the address space for the process.
       The window of time that exists until exec() is called means that	any
       memory from the previous	process	could be read by a compromised
       container (cf. ptrace(2)).

TESTS
       A mix of	approaches is used to test the implementation.

       There's an unit test for	every class of good inputs. There are unit
       tests for accidental bad	inputs that one	might try to perform through
       the Lua API. The	unit tests always try to create	one scenario for
       buffered	messages and another for immediate delivery of the result.

       When support for	plugins	is enabled, fuzz tests are built as well. The
       fuzzers are generation-based. One fuzzer	will generate good input and
       test if the program will	accept all of them. Another fuzzer will	mutate
       a good input into a bad one (e.g. truncate the message size to attempt
       a buffer	overflow), and check if	the program rejects all	of them.

       There are some other tests as well (e.g.	ensure no padding exists
       between the members of the C struct we send over	the wire).

NOTES
       [1]    http://www.lua.org/source/5.2/lapi.c.html#lua_pushnumber

       [2]    https://github.com/LuaJIT/LuaJIT/blob/v2.0.5/src/lj_api.c#L569

Emilua 0.11.2			  2025-03-12			     EMILUA(7)
Want to link to this manual page? Use this URL:
<https://man.freebsd.org/cgi/man.cgi?query=emilua-internals-sandboxes&sektion=7&manpath=FreeBSD+Ports+14.3.quarterly>
home | help
Header And Logo

Peripheral Links

Site Navigation

FreeBSD Manual Pages

Header And Logo

Peripheral Links

Search

Site Navigation

FreeBSD Manual Pages