Harmony - Confession 77

2017.11.15 15:56:28

This is a blog entry about Shirakumo's sound system Harmony. While Harmony was released some months back, I spent the last few weeks rewriting large parts of it from the ground up, and I think it's worth it to write a short article describing how it's built and what you can do with it. So, if you're interested in doing sound processing and playback in Lisp, this is for you.

The need for Harmony arose out of me not finding any suitably powerful sound solution in Lisp. I tried doing a pure Lisp solution at first, but was not able to figure out how to make things go fast without sacrificing design. So, in the interest of performance, I first set out to write a C library that does the essential sound computations for me. This library is called libmixed.

I wanted to keep libmixed as simple and straight-forward as possible. As such, it does not do any sound file reading or writing, synthesising, or sound output to devices. Instead, it only concerns itself with the processing of sound samples. Another important factor for its design was that it should be easy to extend it with further capabilities, and to allow combining processing steps. This led to the segment pipeline model.

In libmixed, you assemble a set of segments – audio producers, consumers, or transforms – that perform the computations you need. You connect them together through buffers so that they can transparently exchange audio data. This produces a directed, acyclic graph where each vertex is a segment, and each edge is a buffer. This graph can then be serialised into a simple sequence that dictates the order in which the segments should be run.

Unfortunately, audio data comes in a variety of formats. Samples are frequently encoded as signed or unsigned integers of 8, 16, 24, or 32 bits, or in floats or doubles. The data might have multiple audio channels, and the samples can be either interleaved (LRLRLR..) or sequential (LLL..RRR..). The sample rate might be different as well. All of this can make it quite difficult to deal with the data between different audio components. Libmixed's solution to this problem is to force all the buffers to use the same sample rate, to encode samples in floats, and to only represent a single channel. Almost all segments present in a pipeline thus don't have to worry about any of these discrepancies anymore, reducing complexity tremendously. In order to allow interacting with foreign components easily, it does also include an unpacker and a packer segment.

The packer takes a set of buffers and information about the sample representation, and packs the buffer's data into a single C array, ensuring proper sample format, layout, and sample rate. The unpacker does the opposite. Thus, if for example you have a library that decodes an audio file, you most likely need to add an unpacker segment to the pipeline that decodes the audio data from the library into the proper internal format.

So, for a very simple example of taking two audio files, mixing them together, applying a reverb effect, and then playing them back, the pipeline would need to look something like this:

simple pipeline

We can get away with assigning the same two buffers for both of the unpackers here by using a special property of the basic-mixer segment. Instead of manually processing the two unpackers in our pipeline, we can set the segment property on the basic-mixer's inputs, which tells the basic-mixer to cause the processing on the segment on its own. This way, the mixer can process the segment that produces the input as it mixes it together, reducing the need to allocate individual buffers for each input to the mixer. This is one of the design decisions that still bother me a bit, but I found it necessary after discovering that I would otherwise need to allocate a huge amount of buffers if I wanted to allow playback of a lot of sources simultaneously.

As it currently stands, libmixed includes segments to mix audio by either just adding samples, or through 3D spatial positioning of the source. It also includes segments to change volume and pan, to fade in and out, to generate simple sawtooth, square, triangle, or sine waves, and to include LADSPA plugins in the pipeline. I'd like to add a bunch more effects segments to it to make it more useful for real-time sound processing, but I haven't felt the motivation to get into that yet. If you're interested in sound processing and would be willing to do this, let me know!

Basically the idea of libmixed boils down to this: there's segments that have properties, inputs, and outputs. You can write and read properties, and connect buffers to the inputs and outputs. You can then tell the segment to process a number of samples, and it will read its input buffers, and write to its output buffers. This all works over a struct that contains a bunch of function pointers to perform these actions. It is thus very easy to add further segments to libmixed, even as an outside library: simple produce a struct that holds the appropriate function pointers to the functions that do what you want. This is also how cl-mixed allows you to write segments from Lisp out.

Ascending from the C world to the C+L world then leads us to cl-mixed, which is the bindings and wrapper library for libmixed. It takes care of all the hairy low-level stuff of interacting with the C library, tracking and allocating foreign memory, and so forth. As mentioned, it also gives you a simple interface to write your own segments from Lisp. This can be really nice in order to prototype an effect.

While libmixed is a neat framework to base your sound processing around, it doesn't exactly make most of the common tasks very convenient. Usually you have some audio files that you would like to play back, and maybe apply some effects to them. This is where Harmony comes in.

Harmony takes libmixed's generic view of segments and extends it to include sources, drains, and mixers. Sources are segments with no inputs, drains are segments without outputs, and mixers take a run-time variable number of inputs. It also greatly simplifies the pipeline construction by handling the buffer allocation for you. It does this with the help of a graph library called Flow. More on that later. Harmony also gives you a sound server object that handles the mixing in the background, allowing you to focus on just adding, removing, and changing sources in your program. Finally, Harmony includes a number of pre-made sources and drains that either connect to other libraries, or present native implementations. Currently, it supports playing back MP3, WAV, FLAC, and raw buffers, and supports outputting to out123, OpenAL, WASAPI, CoreAudio, ALSA, PulseAudio, and to raw buffers.

The easiest way to get started is to use the harmony-simple system, which assembles a default pipeline for you, allowing you to just directly play some stuff.

 (ql:quickload :harmony-simple)
 (harmony-simple:play #p"my-cool-music.mp3" :music)
 (harmony-simple:play #p"kablammo.wav" :sfx)

Assembling your own pipeline isn't very difficult, though. It comes down to just telling Harmony how to connect the inputs and outputs between segments. An optimal buffer layout is automatically computed based on the segment's properties and the graph that you describe through the connections. To do this, Harmony uses the Flow library to describe segments in a graph. Unlike most graph libraries, Flow assigns distinct input and output ports to each vertex. These ports have properties like their arity, direction, and so on. For instance, the basic-mixer from the previously illustrated pipeline would be a vertex with an input port of arbitrary arity, and two output ports of single arity. Flow then employs a simple algorithm to assign colours to the edges in such a way that no input is connected to the same two colours, and no input and output on a vertex have the same colour unless the vertex is marked as in-place. This kind of allocation computation has cropped up in a couple of places, so I've been able to use Flow for it in other projects as well. I don't think it's important to know how the algorithm works, but in case you're interested, the source is commented and pretty short.

How to write new sources and segments, or how to assemble your own pipeline is already illustrated pretty succinctly in the documentation, so I suggest you check it out if you're interested in working on that kind of thing. Harmony is primarily geared towards use in games, where simple pipelines and immediate playback of a variety of audio sources is necessary. However, I can also see it being used in some kind of digital audio workstation, where a graphical user interface could allow you to put together segments and configure them, mapping to a libmixed pipeline underneath.

I feel like I've been rambling about tangents for a bit here, but I suppose the reason is that Harmony doesn't really do all that much. At best it just smooths over the last remaining corners that come from the libmixed C heritage and adds some useful segments to the library. All the concepts used and all the sound technology behind it lies more in libmixed's hands though, and I think I've already explained that all earlier.

So, to close off: if you're thinking of doing some kind of digital audio processing in Lisp, keep Harmony and cl-mixed in mind. I'm also more than open to feedback and suggestions, so if you have any ideas or things you'd like to add to the projects, head on over to GitHub's issues, or talk to us on Freenode/#shirakumo.

Written by shinmera