CSIT December 2023 Mini Challenge: Audio pre-processing

The CSIT Mini Challenge is a monthly challenge organised by CSIT, aimed to raise awareness on the various tech focus areas of CSIT. Participants earn a digital badge on completing the challenge, which they can feature on their social media platforms.

December Mini Challenge

This month’s Mini Challenge focuses on audio data pre-processing, which is an important part in the data science workflow. It involves cleaning and transforming raw data into a format suitable for analysis and modelling.

This improves the accuracy of models and reduces the risk of biased or unreliable results. This allows data science to obtain meaningful insights and make informed decisions.

We are given 3 tasks involving audio data preprocessing. We will have to manipulate sound data and obtain hidden information. I used Jupyter Notebook for most of this Mini Challenge as it’s easy to use and this challenge mostly involves Python.

The challenge files can be downloaded here.

Task 1

Task 1 Description

In the first task, we are asked to use basic audio manipulation techniques to obtain a geographic location. The resources provided are as follows:

From the learning resources provided, it seems like we have to extract the audio into a 1-D Sound Array we can manipulate, then reverse that array, speed it up or slow it down, and write it into an audio file.

The following code can be used to obtain the result:

import soundfile as sf
import librosa
y, sr = librosa.load('Task_1/T1_audio.wav') # Obtain 1-D sound array and sampling rate
y_rev = y[::-1] # Reverse the 1-D sound array
rev_y_fast = librosa.effects.time_stretch(y_rev, rate = 2.0) # Double the speed of the audio
sf.write('output1.wav', rev_y_fast, 22050) # Output the audio

The resultant audio file can be found here.

We are given the coordinates 67.9222N, 26.5046E. We can google this and find that it’s the location of Lapland, a place in Finland home to Santa Claus.

Task 2

Task 2 Description

In the second task, we are given four audio files which sound like snippets from popular music tracks, but with additional noises resembling bird sounds. The learning resources given are as follows:

The first resource contains several ways of audio visualisation, while the second resource tells us more about spectrograms. We can execute the following code to make and view spectrograms from the provided audio files:

import matplotlib.pyplot as plt
import librosa.display

# Load the audio files
x1, sr1 = librosa.load('Task_2/T2_audio_a.wav')
x2, sr2 = librosa.load('Task_2/T2_audio_b.wav')
x3, sr3 = librosa.load('Task_2/T2_audio_c.wav')
x4, sr4 = librosa.load('Task_2/T2_audio_d.wav')

# Obtain spectrograms of each audio file
X1 = librosa.stft(x1) # stft = short time fourier transformation
Xdb = librosa.amplitude_to_db(abs(X1)) # convert amplitude to decibels
plt.figure(figsize=(14, 5))
librosa.display.specshow(Xdb, sr=sr1, x_axis='time', y_axis='hz') # plots the spectrogram

X2 = librosa.stft(x2)
Xdb = librosa.amplitude_to_db(abs(X2))
plt.figure(figsize=(14, 5))
librosa.display.specshow(Xdb, sr=sr2, x_axis='time', y_axis='hz')

X3 = librosa.stft(x3)
Xdb = librosa.amplitude_to_db(abs(X3))
plt.figure(figsize=(14, 5))
librosa.display.specshow(Xdb, sr=sr3, x_axis='time', y_axis='hz')

X4 = librosa.stft(x4)
Xdb = librosa.amplitude_to_db(abs(X4))
plt.figure(figsize=(14, 5))
librosa.display.specshow(Xdb, sr=sr4, x_axis='time', y_axis='hz')

We then obtain the following spectrograms:

From audio a:

Audio A interpretation

This spells “MINS”.

From audio b:

Audio B interpretation

This spells “NOON”.

From audio c:

Audio C interpretation

This spells “SEVEN”.

From audio d:

Audio D interpretation

This spells “PAST”.

Solution

Rearranging the words, we get “Seven minutes past noon”, so our answer is simply 12:07.

Task 3

Task 3 Description

In the third task, we are given an audio file that can’t be comprehended. The learning resources provided are as follows:

We can first obtain a power spectrogram of the audio to see how it has been manipulated:

import numpy as np

y, sr = librosa.load('Task_3/C.Noisy_Voice.wav')
S = np.abs(librosa.stft(y))
fig, ax = plt.subplots()
img = librosa.display.specshow(librosa.amplitude_to_db(S,ref=np.max),y_axis='log', x_axis='time', ax=ax)
ax.set_title('Power spectrogram')
fig.colorbar(img, ax=ax, format="%+2.0f dB")

We obtain the following image:

Task 3 Power Spectrogram

Clearly, additional noise has been added above 2048Hz and below 20Hz. In order to hear the concealed audio, we need to remove the unwanted noise. We can use the scipy library to do this:

from scipy.io import wavfile
from scipy.fft import rfft, rfftfreq, irfft

samplerate, data = wavfile.read('Task_3/C.Noisy_Voice.wav')
N = len(data)
yf = rfft(data)
xf = rfftfreq(N, 1 / samplerate)

# The maximum frequency is half the sample rate
points_per_freq = len(xf) / (samplerate / 2)
target_idx = int(points_per_freq * 100) # target 100Hz frequency
target_idx2 = int(points_per_freq * 2000) # target 2000Hz frequency
yf[:target_idx] = 0 # Filter out all frequencies below 100Hz
yf[target_idx2:] = 0 # Filter out all frequencies above 2000Hz
new_sig = irfft(yf) # Apply inverse FFT
norm_new_sig = np.int16(new_sig * (32767 / new_sig.max())) # Normalize the signal
write("clean.wav", samplerate, norm_new_sig) # Write the signal to a file

Upon listening to the produced audio file, it is too soft to be heard, so we can use a code snippet from the stackoverflow resource as follows:

from pydub import AudioSegment

def match_target_amplitude(sound, target_dBFS):
    change_in_dBFS = target_dBFS - sound.dBFS
    return sound.apply_gain(change_in_dBFS)

sound = AudioSegment.from_file("clean.wav", "wav")
normalized_sound = match_target_amplitude(sound, -20.0)
normalized_sound.export("normalizedAudio.wav", format="wav")

The audio obtained can be found here. Since it’s not in english, we need to use the whisper AI to translate the audio to english.

After following the setup instructions from the whisper GitHub page, we can run the following command in terminal. Here, we use the large-v2 model as recommended in the task description.

whisper normalizedAudio.wav --task translate --model large-v2

We obtain the following results:

Task 3 Whisper Result

These are lyrics to the song “Do You Want to Build a Snowman?”. We can guess that the answer object is snowman.

Conclusion

I had alot of fun with the challenges, and understanding how to use the librosa, whisper and scipy libraries. I also appreciate the resources and hints provided so we could solve the challenges more efficiently.