Following up on the discussion yesterday I today did a case study on spatialisation in Logic Pro to get a sense of the workflow in DAW (Digital Audio Workstation) software, and how/if the stratified approach makes sense.
The Logic Pro 9 project is based on a MIDI file of the 1st movement from the String Quartet Op 18 no. 1 by Beethoven, retrieved from Kunst der Fuge, an eminent online resource on classical music MIDI files.
The four voices are all synthesized using the Audio Unit instrument Synful Orchestra v. 2.5.2. Synful uses reconstructive phrase-modeling for mapping of gestural performance data to the synthesis engine. Further details on Synful can be found here.
On the master channel two effect processes are applied as inserts: The AudioEase Altiverb v. 6 convolution reverb (Audio Unit plug-in) and Equalizer (ships with Logic).
The resulting audio file is provided above. The Logic project can be downloaded here.
From the quick test the layered approach does seem relevant for spatialisation within a DAW context, and the various layers are easily identified. The additional layers proposed from Spat as discussed yesterday do seem relevant to the discussion of this example:
Authoring: Synful offers possibilities for describing the positioning of the musicians on the virtual stage. In this project the virtual musicians are positioned in the standard way for string quartets, centered on the stage with Violin 1 – Violin 2 – Viola – Cello from left to right (as seen from the audience), and Violin 2 and Viola slightly further back on the stage as compared to Violin 1 and Cello. Neither musicians nor listener are moving, so the scene description is static. If dynamic repositioning were to be desired, it would be difficult to achieve with Synful. The Synful VST plug-in do not offer any parameters for automation, and the AudioUnit seems equally limited in terms of automation.
Source pre-processing: Synful use the positions of the musicians to emulate Interaural Time Difference (ITD) and Interaural Level Difference (ILD). Localization cues are further improved through emulation of early reflections. In the test Synful has been set to synthesize early reflections using four walls.
Room modeling: processing of the room model is split between Synful (early reflections) and AltiVerb (convolution reverb). The Altiverb impulse responses contains both early reflection and late reverb. Gain levels for early reflections have been supressed in Altiverb, so that they are left ut in preference of the early reflections generated in Synful.
Encoding and Decoding: The Logic Session is stereo, and hence limited in terms of capacity for surround reproduction as compared to multi-speaker surround setups in Max, Jamoma and Spat. Still, as playback has been done over headphones rather than stereo speakers, a binaural post-processing stereo plugin was applied at the end of the insert chain on the master strip, after the post-processing discussed below.
Post-processing of output signals: EQ is applied to the output signal. This was required for improved balance between 1st violin versus viola and cello. The system seemed to emphasize the low and mid range frequencies. This has been adjusted partly by raising the 1st violin gain in the mix, and partly by shelf filters.
Hardware abstraction layer. Logic is set to use built-in audio.
Hardware layer. Although the mix was intended for playback over stereo speakers, the case study was carried out using headphones. The binaural post-processing stereo plugin was assumed to offer interchangeability between speakers and headphones. In reality the binaural post-processing was felt to affect not only spatial qualities, but also the EQ curve of the signal, causing a more mellow mix with less high frequencies. A MultiMeter plugin for spectral analysis of the mastered signal seemed to confirmed that the binaural plug-in reduced spectral energy above approx 1 kHz.
Compared to the layered model proposed in Peters et. al. (2009) Spat further details the DSP processing part of the spatialisation. The identification of source pre-processing (early reflections), room modeling (convolution reverb) and post-processing (EQ) all appeared relevant in the example. The separation of early reflections and late reverb makes sense also in terms of how they contribute to the spatial impression: While early reflections might contribute to the localization of the source within the space, the late reverb mainly provides clues regarding acoustic properties of the room the sound is situated in as offering a general colorization and blurring of the source that often will be considered aesthetically pleasing. Spat seems to offer a more precise model for specifying and processing reverberation as compared to our model proposed at SMC 2009.
The Synful plug-in simulates early reflection, but provides no modelling of late reverb, instead assuming that it will be handled by a subsequent reverb unit. The fact that Eric Lindemann of Synful previously worked at Ircam might contribute to explaining the separation of early and late reverb and the inclusion of early reflections in the Synful plug-in.
The added configurations offered by AltiVerb v.6 as compared to earlier versions enables suppression of early reflections, in this example replaced by output from Synful. The ability to substitute early reflections from Altiverb for Synful has its limitations though: Instead of convolving the dry signal only with the impulse response, the dry signal and early reflections from Synful are convolved, and the resulting late reverb is expected to be denser than if only the Altiverb reverb is used.
As a final observation this example illustrates that the layered approach is not necessarily strictly mirrored in the signal processing flow. In this example binaural encoding have to be considered to belong to the decoding layer. Still the binaural plu-in is inserted at the end of the signal chain, after EQ post-processing.
Concluding this post, I have also spent time today looking into and reading up on how surround processing is handled in Logic. It is restricted to established consumer/prosumer formats (mono, stereo, quadrophonic, 5.1, 7.1), not offering possibilities for arbitrary configurations of speakers and channels. But for the formats it is catering for, I have to say hat I am pretty impressed by what it has to offer. The up-scaling of mono and stereo effects for multichannel processing very much resembles ideas we have within the Jamoma team for Jamoma Multicore effect processing.
I remain impressed with the latest version of the IRCAM Spat Max/MSP library for live spatialisation.
I have for some years maintained a small Jamoma UserLib with Jamoma module wrappers for a few of the functionalities offered by Spat. So far the purpose has been to make available to Jamoma some essential functionalities and externals from Spat that are not freely (or GNULGPL compatibly) available elsewhere: Air filtering and binaural decoding of ambisonic B-format signals. Up until now these were based on prior (3.x) versions of Spat. Today I updated one of them, the air filter, to work with Spat 4.1.5. The other module require more work, and might get abandoned for other and better implementations. It is tempting to start a much more thorough Jamoma wrapping of all of Spat, as a supplement to the other functionalities for spatialisation already present.
The pdf documentation of Spat seems to be in a transitional state between versions 3.x and 4.x currently. This is actually a good thing, for me at least, as it reveals more of the inner logic of the system than the objects provided in Spat 4.1.5 do.
I notice structural ideas related to the stratified approach to spatialisation proposed in a paper by Nils Peters, myself and others at the SMC 2009 conference:
Stratified model according to Peters et al (2009).
In our model, the DSP processing required for spatialisation was structured according to two layers, the endocing and decoding layers.
Spat is expanding or further detailing the layered module. In Spat, signal processing is divided in four successive stages, separating directional effects from temporal effects:
Pre-processing of input signals (Source)
Room effects module (reverberator) (Room)
Directional distribution module (Panning)
Output equalization module (Decoding)
Quoting the documentation “the reunion of these four modules constitutes a full processing chain from sound pickup to the output channels, for one source or sound event. Each one of these four modules works independently from the others (they have distinct control syntaxes) and can be used individually. Each module has a number of attributes that allow to vary its configuration (for instance, varying complexities for the room effect module or different output channel configurations for the directional distribution module). This modularity allows easy configuration of Spat~ according to the reproduction format, to the nature of input signals, or to hardware constraints (e.g. available processing power).”
Our encoding layer covers the source, room and half of the panning modules, while our decoding layer covers the spat decoding module as well as the remainder of the panning module. One could imagine the model having three main layers: authoring, signal processing and hardware, with further subdivision as follows:
Update:SVG files apparently do not work to well with RSS readers, so the last image shows up as cropped or not at all in the feeds I have tested so far.
ControllerMate is a controller programming tool that allows custom functionality to be added to standard HID devices (joysticks, trackballs, gamepads, keyboards, and more)
Using a graphical interface and drag-and-drop editing, users can program controller buttons to perform complex keyboard and mouse sequences. Programming is accomplished using “building blocks”. Each type of building block performs a different type of function. Building blocks can be individually configured and linked together to perform an endless variety of tasks.
I’m slowly starting work on a number of papers, in collaboration with fellow Jamoma developers. In the usual way it starts of with a bit of procrastination, and downloading the latest version of MacTeX, TeX for Mac, seemed a valid excuse.
For quite a while I have used TextMate for work on LaTeX and ConTeXt, but the latest version of TeXShop (included as part of MaxTeX) comes with templates and a palette that seems to make it quite efficient for typing up equations and other kinds of stuff using obscure commands that I never remember the syntax for.
Earlier this fall I invested in the latest version of Final Cut. I mainly do video editing when working on documentation of projects, and I’m currently working on documentation of the sound installation I did at Fjell Festning a little more than a year ago.
The possibilities for exchange of projects back and forth between Final Cut and SoundTrack (or Logic) are truly great, but come with a few gotchas.
Having captured the video documentation, I started out organizing similar shots into groups, and then decided what to use for most of them in separate sequences. Afterwards I composed the main flow of the documentation video by organizing these nested sequences in the main sequence, and deciding where to start and end each of the sequences.
So far so good. Only afterward I discovered that the sound and video of nested sequences is not exported to Sound Track. The video part I can circumvent by rendering the video, but I don’t find any way of gtetting the audio across.
For this project it is not to much of a deal, as I am going to scrap most of the audio anyway, instead using separate audio recordings done at the space.
But for the future it’s worth remembering. And I don’t manage to find a way to unnest the sequences.
That is where I realized how spoiled I am in Max with the ability to encapsulate and de-encasulate of subpatches.
In a discussion with a student today, different ways of listening to, thinking about, analyzing and creating sound came up. In a few minutes the following where identified. More could probably be added (insert bibliography here):
Language. Symbolic meaning
References to or representations of external physical objects and phenomena
Vertical composition: The textural surface of sound. Combination of simultaneous layers. Spectral distribution of material
Horizontal composition: Disposition in time of elements, form, fluctuations and energies
The gestural and bodily qualities of sound. How it resonate and engage (or not) with the body
The different modes of listening are not excluding or competing, it’s more like viewing the same sculpture from different angles.