‍The Role of Audio in Virtual Reality

As implementation of virtual reality (VR) increases, daily amassing use cases, it is clear that a critical part of the immersive experience is the soundfield. How to create a convincing virtual reality 3D sound experience? In this article, TEECOM defines best practices for acquisition, production, and distribution of audio when creating VR content, including audio/video synchronization and spatialization.

Realistic and synchronized audio will increase the user’s perception of immersion in the virtual reality environment and reduce physiological side effects common to today’s productions, including headaches, motion sickness, and fatigue. The following best practices may seem elementary, but each step is important enough to be respected as a separate element in the production chain. These guidelines should not be considered comprehensive, but provide a starting point for further exploration as this field evolves.The existing audio production methodology must be tailored to the idiosyncrasies required for a successful VR mix.

How to Create a Convincing Virtual Reality 3D Sound Experience — ©ISE2017

‍

Production Guidelines for Realistic 3D Sound

To create a convincing virtual reality experience and mitigate the physiological side-effects of wearing a virtual reality headset, the complete audio signal chain requires more attention than previous productions. Scrutiny of every step in the audio production workflow is required. The audio mixer for VR must be able to monitor and playback the mixed and processed audio on headphones in real time. This includes any exporting compression, head-related transfer function (HRTF) processing, and psychoacoustic calibration.

Audio Acquisition

Avoid audio assets in MP3 format or other lossy-audio formats. This includes existing audio libraries and new audio acquisitions. The processing required for VR productions will create audible artifacts in the audio elements and prevent the audio from being placed accurately in a soundfield.

Avoid recordings of localized sounds (shoes, clicks, door squeaks, and other sound effects) containing room noise/reverb/convolution. The effects and processing required to create the VR room environment will be applied to every sound you mix when it is processed for listening in a 360-degree (4πr2) environment. If multiple convolutions are applied to a recording, the spatialization is distorted and accurate localization will be impossible.

Use, at minimum, 24-bit, 48KHz recordings. Even 16-bit, 44.1KHz recordings may not meet the high processing needs of VR. The higher sample rate recordings will be less susceptible to phase distortion caused by the multiple layers of filters applied through the post production process. Higher bit depths will sound more natural when processed for convolution.You may have to rethink which sound libraries you are using and how you are capturing audio assets. Microphone location will be key in the recording process. Placement should be close enough to the sound source to prevent room sound and any other undesired sounds from being recorded, but be far enough away to capture the entire sound and not a single element of the sound.

If you are a sound effects library creator, issue a sound library that has robust content resilient to heavy post-processing. Ideally, sounds should be natively recorded in 24-bit/96KHz or at least 24-bit/48KHz uncompressed audio files.

Post-Processing and the Digital Audio Workstation (DAW)

Avoid importing assets in a way which alters the phase or distorts the high-frequency content of the audio. Audio should be imported native to the file format or converted only to a similar, uncompressed format (e.g., converting .wav to .aiff or .wav64).

Avoid applying phase-altering and high-frequency eliminating filters such as dynamic compression, file compression, low-quality 3D spatialization plug-ins, and complex or heavy equalization.All effects should be phase-linear.

If you are a DAW product creator or plug-in manufacturer, think about releasing a set of commonly used video post-production plug-ins with no or little phase alteration. Create a plug-in that analyzes audio assets and gives a recommendation on whether or not the audio asset has the qualities to handle heavy processing without harming the sound in a way which negatively affects the VR experience.

Ensure your VR production tools operate natively at minimum, 24 bit and 48 KHz (24/96 recommended). This increased bit depth and sample rate will help mitigate damaging effects from the intense processing involved in creating the VR audio environment.

Rendering for VR Playback and Exporting from the DAW

This step may be separated into two processes depending on the exported formats and post processing methodology.

Avoid using a single-mix format for all playback situations. For instance, do not mix for 5.1 and process the 5.1 mix for VR. Create separate mixes for each playback format.
Use a VR algorithm which lends itself to be converted to an optimized-HRTF (O-HRTF) or personalized-HRTF (P-HRTF) with as little distortion as possible.

Distribution to the End User

Avoid lossy compression when processing the content for distribution. Your production will suffer if all the work and care for preserving accurate audio is lost in the distribution chain.Ideally, the sample rate of 48KHz or higher will be preserved in the final audio format. This will help eliminate any phase distortion introduced by the band-limiting filters required for 44.1KHz audio, which affect spatialization[1].

Playback Hardware: Headphones and Goggles

Use high-quality headphones capable of producing as accurate a soundstage and spatialization as possible. THX and Oculus discourage the use of earbuds; their tests have shown that earbuds do not provide an accurate soundstage, though formal research using P-HRTFs and a calibrated playback system has not supported these anecdotes. Further studies and tests are required.

Playback Software: HRTFs and Calibration

Incorporate P-HRTFs as part of the playback software.
Include a way of calculating the size of the listener’s head. This could be as simple as a measurement device built into headbands of the VR headset.

Include a final step of calibrating the audio after the P-HRTF and head size is applied to the audio. Each pair of headphones and set of human ears will locate sounds in the VR world with slight to large variations. Furthermore, recent experiments have shown that P-HRTFs do not provide accurate results because of how the human brain processes audio. Upwards of 25 percent of subjects do not localize sounds where the other 75 percent hear them using P-HRTF alone. These localization variations range from what should be a localized sound being reproduced like a blob of sound, or a sound that should be in front of the listener sounding like it comes from within the listener’s head or behind them.

Audio must be in sync with the video. The estimated maximum latency should be targeted at near 2-3ms. Much of the content currently available can have up to 10ms of audio latency.The following table summarizes the above:

Audio Serves the Immersive Experience

It is only when these elements, along with video resolution with respect to visual acuity and frame rates, are considered that a convincing and non-harmful virtual reality experience is achieved.

The experience must be served above all other considerations. We need to recognize the limitations of our tools and play within the boundaries of those limitations. All the elements that make up an experience must serve the intent of the creator and the experience. Let the art guide the technology. As George Massenburg says, “If we think it can be better, we should at least try.”

[1] The Audibility of Typical Digital Audio Filters in a High-Fidelity Playback System. AES Paper 9174. 2014. Jackson, Helen M.; Capp, Michael D.; Stuart, Robert J.