Phantom - an encrypted virtual machine

The encrypted computer is a powerful concept. It opens up a new paradigm of applications where both the program memory and its instructions are encrypted. For instance, it can be used to instantiate a global encrypted state (think encrypted social networks) or encrypted smart contracts.

At phantom zone, we understand the immense value that an encrypted computer brings to the world. One of our core goals is to build it. As a first step towards realising our goal, we’re releasing our prototype of an encryted RISC-V virtual machine - phantom.

Phantom’s FHE-VM implements a RISC-V virtual machine (VM) over the plaintext polynomials¹ used in FHE schemes. Mainstream FHE schemes “encrypt” plaintext polynomials and manipulate them, while encrypted, through a set of APIs (such as, arithemtic op, automorphisms, bootstrapping, etc.). Our RISC-V VM implementation over plaintext polynomials validates the feasibility of the encrypted variant. It also provides good approximation of the cost.

To use FHE-VM, developers simply need to describe their encrypted programs as rust programs. Then phantom handles the compilation of the rust program to a RISC-V binary and transforms the content of this binary to plaintext polynomials. The transformed RISC-V binary is then executed on arbitrary inputs using the FHE-VM. The production version of phantom will also encrypt the transformed binary, locally, on developer’s device. Thus allowing to execute encrypted programs on encrypted inputs using the FHE-VM.

The rust program can contain hardcoded secrets and, since the resulting program binary is encrypted, the secrets always remain hidden. It can also maintain an external encrypted memory (e.g. global encrypted state), encrypted under a hardcoded secret, and mutate the memory as per (hidden) program instructions and encrypted inputs. For an executor (i.e. server), executing the rust program on the FHE-VM produces a trace that remains unintelligible, hence it learns no information about the program and the input².

Simple example

Setup phantom

After cloning Phantom, run the setup.sh file. The setup file clones our fhe-backend library Poulpy and its dependencies. Poulpy is used by the FHE-VM to implement necessary operations.

Simple crate

The simple crate implements a Rust program that hardcodes a secret SALT. It takes as input a random message and outputs Sha256(message | SALT).

/// Hardcoded hidden SALT
#[no_mangle]
static SALT: [u8; 32] = [
    175, 142, 86, 41, 61, 122, 186, 56, 50, 101, 187, 215, 124, 127, 14, 221, 109, 201, 110, 189,
    174, 1, 87, 170, 113, 193, 170, 115, 85, 51, 79, 172,
];

The main function is divided into 3 parts: read inputs, body, write outputs. I’ll refer to this readme to learn more about declaring inputs and outputs, reading inputs, and writing outputs.

let mut out = Output::default();
// output = Sha256(message | SALT)
let mut hasher = Sha256::new();
hasher.update(input.message.as_slice());
hasher.update(SALT.as_slice());
out.hash_value = hasher.finalize().into();

The body section, seen above, executes functions on the inputs and produces outputs. Moreover, the body can only access data declared as static variables in the program and any external data should be fed as part of the input. The body cannot not make any “external” calls during execution.

// the rust program is declared in root/src/guest/main.rs

// compile guest/main.rs to risc-v target
let compiler = CompileOpts::new("guest");
let elf_bytes = compiler.build();

// initialise phantom
//
// the `init` function transforms the risc-v binary
// into suitable format for execution on FHE-VM
let pz = Phantom::init(elf_bytes);

// prepare input
let mut input_message = [0u8; 32];
thread_rng().fill_bytes(&mut input_message);
let input = Input {
    message: input_message,
};
// input is provided as buffer of bytes
let input_buffer = to_u8_slice(&input);

// encrypted vm runs for `max_cycles`
//
// this is necessary because vm cannot know when to halt
let max_cycles = 8725;

// create an encrypted vm instance
let mut enc_vm = pz.encrypted_vm(input_buffer, max_cycles);

// execute the vm
enc_vm.execute();

// read output
let output_tape = enc_vm.output_tape();

The snippet above is taken from src/main.rs file in the simple crate and describes how to compile and execute the rust program on encrypted-vm (i.e FHE-VM).

Cost

The spec of the FHE-VM is available here and spec of core routines used in the FHE-VM are available under this directory. Our FHE-VM is a work in progress and the specs are expected to change. We are aware that the spec documents are not self explanatory and encourage others to open issues for any related questions/discussions.

As per the current specs, time taken per cycle is expected to be 600ms for target RV32I (32-bit base ISA) and 4s for target RV32IM ISA (32-bit base + “M” multiplier extension). Our estimations assume effective parallelization of the routines³. A breakdown of the runtime estimation is provided here.

Bottlenecks and open questions

Arithmetic operations

The most expensive rountines are the arithemtic operations. For example, Div/Rem dominate the cost a cycle on target RV32IM, while Add/Sub take significant chunk. Improving efficiency of arithemtic operations will reduce cost per cycle.

Can 32 bit integers be mapped to a different representation in which arithmetic operations are significantly faster? If so, is the representation compatible with other risc-v instructions?

Mapping 32 bit integers to a different representation to speed up arithmetic operations is not a new idea. There exist some prior works that investigate the same question. But it’s unknown whether the representations are compatible with other instructions in RV32I/M? If not, what other representations that are compatible exist?

RAM

We’ve designed, to the best of our knowledge, the first FHE-based oblivious RAM that supports both read & write operations and that remains efficient for any reasonable RAM size. For instance, writing to a 8KB size RAM is expected to take only a few milliseconds.

However, translating the address from an integer to an index of the RAM requires relatively expensive routines such as homomorphic decomposition and circuit bootstrapping. To provide an intuition, the address ciphertext is decomposed into multiple limb ciphertexts (\(\text{Address} \to (k_0, k_1, ..., k_i)\)). Each limb ciphertext is then converted to a monomial ciphertext with limb at the exponent of the monomial (i.e. \(\text{Enc}(k_i) \to \text{Enc}(X^{k_i})\)). Address is decomposed into limbs in serial, and each limb requires 2 programmable bootstrappings (PBS). Circuit bootstrapping converts a limb ciphertext to monomial ciphertext and requires multiple PBS operations, which can be processed in parallel. Therefore, the bottleneck remains homomorphic decomposition (more information related to FHE-RAM is provided in the appendix). This motivates the following question:

How to construct a better address translation circuit? Or more broadly: How to construct an efficient FHE-RAM for which index encoding step is FHE friendly?

It’s unclear whether pre-fetch operation can be leveraged to fetch the value at address before the cycle in which the value is used. And it remains an open path to further improve our RAM.

The other problem with our RAM construction is that the cost per read/write is asymptotically O(N) (N: size of the RAM). Thus, since RAM is read from / written to at every cycle, FHE-VM will remain very inefficient for high memory programs.

How to construct a concretely efficient FHE-RAM with sublinear (or better logarithmic) cost with the size per read/write?

Turns out, there already exists a construction for FHE-RAM with logarithmic cost per operation: DE-PIR. Although DE-PIR is improved in follow up works, it still remains very inefficient in practice. So, a “concretely efficient” FHE-RAM scheme is still an open question.

Special routines (a.k.a precompiles)

Programs executed on the FHE-VM remain fully unintelligible. There’s no way for someone to peek at its exectution trace, memory, or to selectively interrupt the execution. While this introduces a powerful new paradigm for application development, it comes with a few important caveats.

First, allowing a program to accept arbitrary inputs can be risky. Programs, at the very least, must authenticate their inputs. For instance, if the program generates signed outputs based on its inputs—and those signatures are valuable to others—it must verify that the inputs are valid. Otherwise, the program could unintentionally produce correctly signed outputs for invalid or malicious inputs.

Second, a program’s output holds no value unless it can attest to its own validity. Therefore a program should also produce, alongside its output, a publicly verifiable attestation (e.g. using digital signatures, MACs, etc.).

During the authentication process, the input is verified. If the authentication fails, the authentication flag is set to false; otherwise it’s set to true. The input is then passed to the second stage, where the FHE-VM executes the actual program and produces the output. In the final attestation stage, the system reads the output and the authentication flag. If the authentication flag is false, the output is zeroed out. The system then produces the final output along with an attestation.

Different programs may require different authentication and attestation routines (e.g. digital signatures, AEADs, MACs, SNARKs etc.). And, since, these routines have a constant runtime (they are cryptographic functions), they are well suited to the FHE circuit model. Thus, the resulting structure will have an authentication circuit and an attestation circuit with a general computation layer (i.e. FHE-VM) sandwiched in between.

We’re yet to investigate efficient FHE circuits for necessary cryptographic functions.

Although no-one has explored it from our lens, there already exist prior work that improve efficiency of FHE circuits for different cryptographic functions. This will clearly come handy in our investigation.

The future

A practical encrypted computer is within our reach. But we’ll have to solve many problems, some of them mentioned above, to finally construct one. Our core focus will remain solving them. Stay on the lookout for more updates.

Appendix

FHE-RAM

RAM with \(D\) elements is stored in a hypercube of dimension \(\log_{N}(D)\), each dimension storing up to \(N\) ciphertexts. Each ciphertext is an encryption of polynomial \(\mathbb{Z}_{N}[X]\) (i.e. polynomial of degree \(N\) with integer coefficients). Address \(k\) is represented using tuple \((X^{k_{0,0}}, X^{k_{0,1}}, \dots, X^{i,j})\) where \((k_{0,0}, k_{0,1}, \dots, k_{i,j})\) is \(k\) decomposed first with base \(N\) and then with base \(B\), for some \(B \lt N\) (i.e. \(k = \sum_{i}^{\log_{N}(k)}k_{i}N^{i}\) where \(k_i = \sum_{j}^{\log_{B}(N)}k_{i,j}B^j\)).

The read operation iteratively folds the database over its dimensions in a tree like fashion, while the write operation performs the inverse operation by walking the tree backward. To read/write, the amortized cost per element is \(\mathcal{O}(3D/N)\), \(\mathcal{O}(2D\log_{2}(N)/N)\) key-switches respectively.

While read/write, given address as \(\texttt{Enc}(X^{k_{0,0}}, X^{k_{0,1}}, \dots, X^{k_{i,j}})\) remains efficient, translating \(\texttt{Enc}(k)\) to \(\texttt{Enc}(X^{k_{0,0}}, X^{k_{0,1}}, \dots, X^{k_{i,j}})\) is expensive, as it requires digit decomposition and circuit bootstrapping.

Concretely, \(\texttt{Enc}(k)\) is decomposed into \(\texttt{Enc}(X^{k_{0,0}}, X^{k_{0,1}}, \dots, X^{k_{i,j}})\), which requires \(2\log_{B}(k)\) sequential PBS operations. B is chosen such that subsequent circuit bootstrapping step, which converts \(\texttt{Enc}(k_{i,j}) \to \texttt{Enc}(X^{k_{i,j}})\), remains practical. Circuit bootstrapping requires \(\log_B{k}\) PBS one per \(k_{i,j}\), thus each PBS can be processed in parallel.

The plaintext polynomial is \(\mathbb{T}_{N}[X] = \mathbb{R}[X]/(X^{N}+1) \mod 1\) defined over the Torus. This is the common plaintext domain of all mainstream FHE scemes.↩︎
The executor executes the VM for some max. no. of cycles.↩︎
This does not rule out additional parrallelisation opportunities. We don’t know yet how to parallelize certain routines, but they might as well be.↩︎