When Apple’s virtual assistant, Siri, first released in 2011 it was the first voice recognition tool with basic contextual knowledge of user information. Fast forward to 2022, most consumer electronic brands have some kind of digital virtual assistant within their products. As brands explore how the voice of this virtual assistant can engage audiences, it will become increasingly critical to create differentiated voice experiences to set themselves apart from their competition.
One issue that some consumers notice is the uniformity of these virtual assistant voices (typically mono-tonal, somewhat robotic sounding female voices), making it difficult to distinguish between brand and assistant. Industry leaders in virtual assistant technology like Amazon’s Alexa and Apple’s Siri relay information with a default female voice. Likewise, automated telephone directories typically have a similar guidance voice.
Amazon found that test users of their Alexa product responded more strongly to a female voice than the male equivalent. Microsoft, while developing their Cortana virtual assistant, found a female voice to best embody the expected qualities of an assistant: helpful, supportive and trustworthy.
However, the real reason that most of the voices we associate with faceless artificial intelligence are female, is that the original text-to-speech systems were trained mostly on female voices. This was due to early developers considering that women tend to articulate vowel sounds more clearly, and that the pitch of a woman’s voice makes it easier to understand what has been said.
Of course, today there are exceptions, and various virtual assistant voice options are becoming available. I predict that over the coming years, development in virtual assistant voice technology will lead to major brands having their own, unique, marketable voices.
Rat Scabies is a musician, best known as the drummer of The Damned. He has been in and around studios as both a musician and a producer since the 1970s. In 1995, my Dad toured with The Damned playing the Hammond Organ. Since then, my Dad has maintained contact with Rat. Recently I had a conversation with Rat about how the recording industry and process has changed in line with technological developments since the 1970s.
I was somewhat surprised by Rat’s acceptance of some new technologies: specifically his use of electronic drum kits as opposed to a live acoustic kit. In my mind, I had this picture of Rat laying down his drum tracks on a live kit with an abundance of cymbals and toms. In reality, Rat considers todays electronic drum kits to produce as authentic a sound as a live kit, with much more versatility (e.g. for a fraction of the price of a single live snare drum, one can purchase a sound pack containing hundreds of sampled snares). Additionally, being able to work in MIDI allows for far easier quantisation and editing of individual elements of the kit, making the recording and editing process much faster and less strenuous. With a live kit, one must consider phase between microphones, and any quantisation must be done across all elements of the kit. Traditionally, a recording with a significant error would be discarded, but with MIDI, it can simply be edited afterwards to remove the error.
Unsurprisingly, Rat considers the arrival of the computer and development of DAWs to have had the most significant effect on the recording process. It has made recording accessible to anyone. Nowadays, Rat himself does most of his recordings from home, saving large sums on studio costs which he would have had to pay in the past. Additionally, development of virtual instruments and even virtual musicians has enabled individuals to create pieces of music which in the past would have required whole ensembles to do. It also opens up music composition and production to a new, ‘untrained’ demographic. Rat used the example of The Doors being rather unremarkable musicians (they certainly wouldn’t compare technically to many of their classically trained counterparts), but very capable of creating great, popular music. In this modern day, you don’t even have to be able to play an instrument to make music!
I really enjoyed my conversation with Rat, and it was fascinating to hear his perspective on developments in recording technology and the current state of the music recording industry. Rat is due to reunite with The Damned for a reunion tour in 2022, and I look forward to seeing him at the Manchester show (yes I have my ticket already!)
In recent years, especially in light of the COVID pandemic, event promoters have been forced into transitioning from live events to digital events. This has led to some major advances in audio-visual technology, producing the resources necessary to deliver a digital experience similar to that which you would receive if you were actually at a live event.
Travis Scott’s Fortnite Concert
Given the limitation of large gatherings due to the pandemic, musical artists have been experimenting with digital performances. On 23 April 2020, over 12 million Fortnite players attended a 10-minute virtual performance by Travis Scott, using Fortnite’s multiplayer online video game servers as a massive ‘venue’.
Watch the performance here:
The NBA Restart
The 2019-20 NBA season restarted in July 2020, after shutting down for over four months, but with no fans allowed to enter the league’s “bubble” to attend the games physically, the NBA opted for a digital approach. Using Microsoft Teams’ Together mode, which applies AI segmentation technology to bring people together into a shared environment like a conference room, coffee shop, or in this case an arena, the NBA gave fans the opportunity to watch games whilst being projected live on in-arena videos boards where the crowd would typically sit, enabling them to offer their support and be seen and heard, despite being unable to attend physically.
To enhance this at-home fan experience, the NBA made several changes to their broadcast. They implemented dozens more cameras in new positions closer to the court to showcase intimate, never-before-seen camera angles. They also placed high-fidelity microphones around the court to capture enhanced sounds like squeaks from trainers and ball bounces.
The Future of Digital Events
One of the biggest concerns with digital events is maintaining engagement from the participants or audience. Even at live events boredom is no pleasure, but at digital events it is critical to maintain the interest of the audience as bored participants may simply switch off at the push of a button. I think that a lot of work will be required to engage a digital audience for as extended a period as a live audience, particularly for musical performances.
While the future of live events remains uncertain, I think that digital events will shape the way consumers engage and interact with live content, and some elements of digital events may be adopted by live events when they can return in their full glory. However I think that digital events lack the spectacle of a live experience, and that will be difficult to achieve.
“Augmented reality, the ability to provide additional information through visual, auditory, even touch, all those technologies are evolving rapidly” – Michael Knappe (Head of technology and lifestyle audio at Harman International)
AAR is an interactive experience enhancing the real world achieved through digital audio stimuli, and it is becoming the new selling feature of headphones. Instead of having to look at your phone for information like texts, notifications, emails etc., your headphones can simply implement that information into your environment through sounds. The technology has many theoretical applications and is already being implemented in some interesting ways.
PairPlay
PairPlay is a role-playing interactive adventure game where yourself and your partner split a set of AirPods (one ear each) and each hear opposite sides of the same story. The app implements immersive audio through various challenges, turning the user’s home (or wherever else they decide to play) into a rich cinematic universe where they can become secret agents, ghost hunters, robots and more!
Here’s the game in action:
The app is available to download for free on the App Store, but isn’t so free if you don’t have a pair of AirPods (it is designed specifically for AirPods). It is also only available in English as of yet, and lacks built-in accessibility features, for example captions for the hard of hearing.
Translate with Google Pixel Buds
Google Pixel Buds work with Android devices to offer real-time translation into the user’s ear of up to 40 languages. It has two modes: conversation mode which allows you to talk to someone, and transcribe mode which allows you to hear spoken language translated into your ear accompanied by a transcript on your phone. This application of AAR quite literally breaks language barriers and opens up a new world of opportunity for users.
Here’s Google’s promotional tutorial video showing the earbud translation in action:
VoiceMap
VoiceMap is a publishing platform and marketplace for location-aware audio tools. The service allows users to create tours of their local area, sharing stories, and enables visitors top truly immerse themselves in their location with guidance and information from locals. It gives a voice to the people who have a far deeper and more personal connection with a place than a traditional tour company.
Notably, the company has worked with the Society of London Theatre to produce a tour of Theatreland in the West End narrated by Sir Ian McKellen. Iain Manley, the founder of VoiceMap describes the company’s process of developing the tours, “We do everything from mapping it out, producing and editing a script, selecting voice artists, recording, editing the audio, then adding sound effects, through to the final publication.”
Here’s Iain telling the story of VoiceMap:
The future of AAR
Although often overshadowed by virtual reality (VR), AR is expected to dominate the market. According to estimates, the AR market will be worth $70-75bn by 2023, while VR is expected to be worth $10-15bn. AAR plays a massive role in AR and as headphone technology improves, and headphones becomes more and more comfortable and practical to wear for longer periods, I can see AAR implementations becoming key sources of information in our day to day lives.
As part of a university project in my Bachelor’s degree, I and two fellow students re-soundtracked an iconic scene from Stanley Kubrick’s ‘2001: A Space Odyssey’. Due to copyright this scene cannot be published, but is available in Google Drive via the link below:
For this project, the priority was to immerse the viewer in a realistic sonic environment. One would expect to hear some constant noise on a spaceship due to life support and ship control systems. Using Ableton’s wavetable synthesizer, I produced an evolving waveform from the two modulating oscillators shown below:
To give the waveform an atonal, machine-like-rumble sound, I recorded a minute long clip of two of the waveforms oscillating one semi-tone apart, as shown:
This atonality keeps the oscillators musically neutral, leaving scope for music to be added without potential tonal clashes.
I exported the minute long clip as a .wav file and resampled it in Ableton’s Simpler to produce two identical waveforms differing slightly in pitch. Since resonant frequencies of smaller bodies are higher than those of larger bodies, higher frequencies within signals conveyed through smaller bodies are likely to dominate relative to lower frequencies. For this reason, I decided that the noise on the main ship and the pod should differ in pitch; the pod being higher pitched due to its smaller body. Syncing these sounds to be present in their respective locations provides a continuity between audio and video; as the location of the scene changes, so does the background audio.
Perceived Silence
Given the objective to immerse the viewer in a realistic sonic environment, the sections of the scene outside of both ships in the vacuum of space should be silent. However, total silence provided no emphasis to the dark, emptiness of space as depicted in the scene and detracted from the dramatic suspense of the unfolding events. Some audio was therefore necessary to emphasise the atmosphere, whilst still being perceived as silence. For the sake of continuity, I decided to use the same waveform as used for the background noise in the ships. I applied Logic’s equaliser audio unit to the waveform to reduce the higher frequency presence, so as to become an LFO, and applied a sub-bass audio unit to lower the frequency to a minimal perceivable level. This was inspired by a similar effect implemented in Alfonso Cuarón’s ‘Gravity’; using LFOs to give an impression of absence of sound, whilst maintaining an ambience. This technique is referred to as perceived silence. Traditionally in cinema, silences had been difficult to achieve due to the constant sound of the film projector. Although today this issue is no longer prevalent, a typical cinema audience will always give rise to some noise, referred to as ‘popcorn sound’. Implementing some audio as perceived silence reduces the impact that this ‘popcorn sound’ has on an audience’s audible experience.
To add to the hectic cacophony of sound just before Dave evacuates the pod, I found samples of various siren tones from Splice (an online royalty free sample library). I then positioned them in the timeline so as to provide an audible response to Dave’s interaction with his control panel.
As part of a university project in my Bachelor’s degree, I and two fellow students re-soundtracked an iconic scene from Stanley Kubrick’s ‘2001: A Space Odyssey’. Due to copyright this scene cannot be published, but is available in Google Drive via the link below:
My role in ADR and dialogue editing was the recording, syncing and mixing of Dave’s speech. As with all ADR, my primary objective was to deliver the cleanest, most intelligible sounding dialogue possible, while ensuring continuity between the video and audio. Dave’s dialogue recordings were made using a Rode NT5 cardioid condenser microphone, in configuration with a Scarlett 2i2 audio interface connected to a MacBook Pro running Logic Pro X. A sample rate of 48kHz and bit depth of 24-bit were set within Logic Pro X to meet the industry standard for film and TV work.
In the original scene, Dave speaks with an American accent. Typically, Americans tend to simplify their language, specifically the pronunciation of vowels, which tend to be shorter than in spoken British English. Therefore, in order that the ADR sound natural, I had to shorten my vowel pronunciation and talk faster in the recordings.
2001: A Space Odyssey is notorious for the sparsity of dialogue. Although the dialogue in this particular scene only lasts for the first two minutes, Dave continues to be the main focus throughout the rest of the scene, and as such, some unscripted audio recording was necessary to capture his presence and emotional state. Throughout the scene, Dave’s breaths and movement can be heard so I recorded a thirty second take of myself breathing heavily beside the microphone and using cues from the film, I synchronised breaths in my recording with Dave’s apparent sighs and breaths. This provided an element of realism and consistency throughout the non-dialogue scenes which would otherwise have been lacking.
As part of a university project in my Bachelor’s degree, I and two fellow students re-soundtracked an iconic scene from Stanley Kubrick’s ‘2001: A Space Odyssey’. Due to copyright this scene cannot be published, but is available in Google Drive via the link below:
The scene depicts Dave (the astronaut protagonist) attempting to re-enter his spaceship from an external pod after retrieving the body of his colleague who was free-floating away from the spaceship following an attack by the ship’s artificially intelligent computer HAL.
The scene is understandably tense, with Dave initially attempting to reason with and argue against HAL, following HAL’s refusal to allow Dave to re-enter the ship. This is followed by Dave making preparations to re-enter the ship from his pod without his space helmet. From Dave’s perspective, this would be an incredibly stressful situation; he is certainly aware that leaving the pod into the vacuum of space is likely to lead to his death, but equally, remaining in the pod is not an option. The scene climaxes with Dave blowing the hatch of the pod in order to eject himself into the airlock of the ship, closing the outer-door of the ship and re-pressurising the airlock.
Our objective was to reflect Dave’s emotional state throughout the scene in the soundtrack, as well as to realistically depict the sonic environment visualised, without this sounding clichéd or forced. Following his argument with HAL, we wanted to convey Dave’s shock, anxiety and stress audibly, through suggestion of his temporary dissociation from reality by applying reverberation on his voice and with the introduction of dissonant arpeggiated music. This section is followed by a period of relative calm and focus as Dave is forced to release his dead colleague in order to access the airlock. Subsequently, as Dave approaches the airlock, the rise in tension is reflected in the introduction of a single-note, bass dominant synthesised ostinato, influenced by the music associated with the approaching Shark in Stephen Spielberg’s Jaws. This sound develops into a crescendo as the scene progresses, with the introduction of an arpeggiated sequence as he aligns the pod hatch with the airlock. As he activates the controls necessary to blow the hatch, the music morphs into an alarm sound, matching with a flashing alarm light behind him in the pod. Further alarms are introduced as Dave progresses with the evacuation sequence, culminating in a cacophony of alarms and dissonant music as he prepares to exit the pod. The scene finishes with Dave ejecting into the perceived silence of the vacuum of the airlock, before closing the airlock to the sound of re-pressurisation.
Mastering is the final step in audio post production; balancing out all the elements of a track so it sounds consistent across all platforms. Historically, this has been conducted by a mastering engineer, blending science and personal taste to produce a cohesive, balanced final piece. Due to the critical listening required, and depending on the engineer’s experience, mastering can often be time consuming, and expensive.
In recent years, the development of automated mastering services has enabled artists with smaller budgets to access professional sounding masters without the need to pay for human engineers.
LANDR is a cloud-based music creation platform developed by MixGenius (an AI company based in Montreal). It’s flagship product is its automated mastering service, first launched in 2014. This product remains in constant development, through evolving AI, and its current iteration uses Synapse, the most sophisticated AI mastering engine yet:
“With years of research, 19 million mastered tracks and over 1 million hours of music, Synapse is the most sophisticated AI-powered mastering engine yet. Improved clarity, smarter compression and superior loudness give your music instant, professional polish at a price that works for your budget.”
CloudBounce is another automated online mastering service, founded in 2015, and is a product of the Abbey Road Red Program. CloudBounce boasts one of the most advanced machine listening algorithms, and employs different audio processing tools such as a compressor, EQ, limiter and stereo imaging “to make it sound powerful and crystal clear”.
Ozone is iZotope’s one-stop-shop plug-in for mastering music. It applies integrated machine learning along with extensive user inputs and customisation to produce industry standard masters to the artist’s preference.
iZotope’s latest iteration, Ozone 9, isn’t a solitary, simple plug-in, but rather a full suite of various tools typically present in a mastering studio.
The Future of Mastering
In the end, the future of mastering comes down to the musician. Different musicians may prioritise different things when it comes to obtaining a master. One of the most significant factors is turnaround time. With an automated mastering engine, a master of an individual track can be completed in 5-10 minutes. On the other hand, a manual master, done by a mastering engineer, can take anything from a few hours up to a few days.
Another significant factor is the personalisation and customisation of the master. With a human engineer, the artist can discuss their specifications and provide some creative input. Although some tailoring options are available, automated mastering services do not currently provide the same level of customisation. It is however worth noting that as more tracks are mastered using automated services, the machine learning algorithms governing said services become more experienced, and thus learn to improve. They work very similarly to humans; the more experience they have, the better they get. It is therefore a certainty that automated mastering services will improve.
Finally, one must consider the price: with online mastering, you typically pay monthly or annually to access the service, giving you a certain number of masters or even an unlimited amount within that timeframe. With LANDR, you can pay $25 per month for an unlimited number of mastered versions of your tracks. On the other hand, a studio based Mastering Engineer will typically charge $20 to $100 per track (excluding the top notch engineers who charge a significant premium).
At the end of the day, both are feasible options for obtaining masters, and there is no right or wrong. I think there is a future in both!
I, along with three fellow students: Matt, Callum and Gareth, were tasked to produce a short podcast, to showcase our capability to produce engaging media content. We chose to discuss Artificial Intelligence, and its applications in the music industry.
We recorded the podcast using a Sennheiser MKH 416 P48 and a Rode NGT2 in the configuration shownbelow, connected to a laptop running Reaper through a Focusrite Scarlett 18i8 audio interface. A presenter was seated either side of each boom microphone.
Our conversation covered four AI related talking points:
What is AI and its history?
Applications of AI in music mastering
Applications of AI in music composition
The future of AI in music
My topic of expertise was the history of AI, and I briefly explained the principals behind how it works. I studied Electronic Engineering for my bachelors degree, and so covered whole modules learning the principals behind neural networks and machine learning.
We recorded a 25 minute conversation, however there were sections we found to be not so relevant to the topic, and some sections which didn’t flow so fluently. We cut out these sections using moments of silence as cut points, so as to not sound abrupt. This reduced the podcast recording to just under 15 minutes.
We compressed the recording from each microphone to reduce the dynamic range and increase clarity. We also used EQ on the recordings to remove some high frequency noise present.
We added the same music to both the beginning and end to provide a sense of continuity. We then added music throughout to distract from some slight noise present in the recording which we were unable to eliminate using EQ.
Binaural audio refers to audio captured and delivered in a way that the listener would hear the sound exactly as they would in the real world. Sound waves hit each of the listener’s ears at different times, and with different volumes, and using information the brain can localise the origin of the noise. The listener perceives sound emanating from all directions around them, with different sounds being positioned in various locations within the space, giving an impression of multiple sound sources. Binaural audio technology enables the creation of such immersive spatial audio experiences for headphone users.
Recording Binaural Audio
Traditionally, recordings are made using either mono or stereo microphone techniques. Binaural recording systems emulate the physics of the human head, by placing two microphones in ear-like cavities on either side of a dummy head. The dummy head replicates the density and shape of a human head, and hence, the microphones capture and process the sound exactly as a human ear would. When listening back to a binaural recording over headphones, there is a clear distinction between left and right perspectives. The brain scrutinises these inter-aural differences, and is able to localise sound sources relative to the dummy head, to perceive a 3D soundscape.
Binaural Technologies
Neumann KU 100 Dummy Head
“The KU 100 is a dummy head microphone for a truly immersive binaural listening experience with headphones. KU 100 recordings played back over high quality headphones thus give the listeners an experience almost identical to what they would have heard with their own ears at the recording position, with stunning lateral and vertical localization and a breathtaking sense of space and a room decay that surrounds the listener.”
dearVR MICRO – Sennheiser
“The free full version of dearVR MICRO enables you to position signals in any location left, right, above, below, in front, or behind your head – instead of only left or right.”
Binaural Mix
Below is my short binaural mix, produced in Logic Pro X, utilising Logic’s native binaural panning plug-in. The mix immerses the listener in an outdoor, natural woodland space. The audio stems were taken from the BBC Sound Effects Library and Splice.