High quality visualisation is key to a great design experience. Our Visualise 360 service uses cloud-based ray-tracing to deliver custom panoramas of a user's bathroom or kitchen design, without needing the help of a computer graphics expert.
In previous posts, we've alluded to the complexity involved in developing, deploying, and managing a high-performance GPU service of this kind. We're always working to improve everything we do, so we've recently deployed a major upgrade to our technology behind this to deliver a faster, more reliable visualisation experience.
Viable cloud rendering requires significant use of low-overhead, high-performance languages. Building and maintaining this service has been difficult since its inception, with our initial production version consisting of:
- A complex CUDA layer;
- Thousands of line of C++ code to prepare and run renders;
- C++ wrappers over C libraries;
- C wrappers over C++ libraries; and,
- Cython to provide a callable interface to the renderer for integration.
Using C++ so heavily prevented many team members, who mostly have experience of higher-level languages, from contributing to the codebase. It also led to occasional difficult to debug memory safety and correctness issues of the kind that puzzled even our more experienced C++ developers.
At this point we wondered if there was a better way. Could we avoid the numerous technology boundaries: allowing more integrated development of this service and mitigating the high cost associated with the low level, performance-critical code we were deploying?
Rust is a relatively young language. Originally heavily supported by Mozilla for use in the Firefox web browser, it has since been enthusiastically adopted by many other organisations tired of the old approaches to building performant system software. It aims to be:
A language empowering everyone to build reliable and efficient software.
More specifically, it achieves this through three main areas of focus.
One key goal of the Rust language is high performance without compromise. In practice, of course, code written in any language can vary dramatically in performance depending on optimisation, algorithmic complexity, and platform restrictions. Pre-empting the inevitable comparisons with C, we believe that in Rust:
- It's always possible to write code as fast as C.
- It's almost always possible to write code as fast as C without using unsafe.
- It's usually possible to write code as fast as C using convenient abstractions.
- It's sometimes possible to do this and end up faster than C.
That is to say, if one wishes to sweat the details of performance in Rust, one will invariably manage to produce code at least as fast as C. If one does not, it's usually going to be very fast regardless.
While important, performance remains secondary in the design of the language. First and foremost, always, are two facets of reliability: safety and correctness.
Rust has a fairly unusual model of variable lifetimes. In Rust, data and resources have a single owner. That owner may lend any number of immutable references ("borrows") to the resource, or a single mutable reference. The owner of a resource may also pass ownership to elsewhere, e.g. as a function argument or return value. The Rust compiler (specifically, the borrow checker) then ensures that no borrows outlive the underlying resource.
While this may sound very restrictive, it actually yields several useful properties. In practice, most code which does not follow these restrictions is generally very difficult to reason about, and accidentally violating them can result in major correctness defects (in safe, garbage collected languages) or disastrous memory safety errors (in non-garbage-collected languages like C and C++). By enforcing these constraints at compile time, an entire class of common errors is all but eliminated: invalid resources can no longer be used, and data can not unexpectedly change underneath you.
Low level languages often rely on direct use of pointers, which Rust restricts to blocks annotated as “unsafe”. This allows one to bend the rules of the language, but with a little discipline it is easy to prevent this from infecting the rest of the codebase.
Although it has a steep initial learning curve, Rust can be a very productive language to work in.
Its powerful and expressive type system allows offloading a lot of the burden of state and resource management to the compiler instead of tracking it in your application code (or, worse, in the developer's head).
The power of the language, combined with the development culture around it, strongly encourages building and using powerful, low- or zero-cost abstractions. Rust's futures are one example of such an abstraction - providing very efficient, composable, asynchronous operations with (almost) no language support or obligatory heavyweight runtime required.
Rust in Production
These factors made Rust seem like a promising tool to help alleviate many long-standing difficulties with our cloud rendering service.
We initially began replacing a small component of our stack using Rust, but it quickly became clear that a larger effort would allow a great reduction in complexity. In the end, all the C, C++, and Python components of the service were rebuilt, with Rust used from task loading through to dispatching GPU operations.
Rust can integrate directly with C libraries using its Foreign Function Interface without overhead. Bindgen, a build-time tool, lets us generate FFI declarations from C header files, making it easy to use libraries without official support for Rust, like Nvidia's OptiX.
Interacting with native libraries is innately "unsafe" in Rust, but thanks to its powerful type system and other mechanisms of abstraction we can totally confine the complexity of interacting with a library as powerful as OptiX to a small region of carefully-vetted code. This allows the rest of the program to proceed with the usual safety and correctness guarantees of Rust.
Production renders over a day
Production queue submissions (blue) and retrievals (orange) over a day
In addition to substantially simplifying the codebase, we also saw major improvements in the development process. Now, a single “cargo build” can prepare a native executable for any of our developers' environments, and a single CI pipeline can build a ready-to-deploy docker image. This lets any of our developers get up and running quickly, and allows rapid iteration of features, both locally and on test environments.
As with any big technology decision, these changes were not without trade-offs. Two main challenges have faced us with this new codebase:
While progress on the language and surrounding libraries and tooling has been rapid, it's hard to avoid the fact that Rust and its ecosystem are still relatively immature. This means that, in some cases, support for some technologies and practices can be patchy, unstable, or simply nonexistent. Although powerful, the Futures async abstraction in particular has lately been in flux. We judged the benefits to our development and deployment processes to be worth the costs of potential code churn due to language and library evolution, but this depends heavily on the development team and company needs.
Being a less established language can make it tough to hire new developers with Rust experience. Its relatively steep initial learning curve might also put off existing developers from upskilling to work on the codebase. This can lead to more friction on-boarding developers than projects in more common or higher-level languages.
That being said, a productive developer on our rendering service would previously have needed to be at least reasonably proficient in Python, C, and C++, which often made shipping major new features prohibitively difficult. Though Rust does have substantial complexity as a language, for us it's at least just one language. We also find it generally easier for developers to pick up and apply, to production quality, than C or C++.
We're now running our new, Rust-based service in production, processing well over a thousand bespoke design renders every day.
Some teething problems at launch aside, the project has been everything we hoped: it's delivered a performant, maintainable service which is easier to reason about, extend, and deploy.