Voice Print with Voice Features and LSH

VXRL
9 min readJul 26, 2024

--

Twitter/X: darkfloyd1014

“In this world, we always assume everything is legitimate and follow our thoughts. Unfortunately, it isn’t. ”

The United States presidential election will happen, and some exciting fraud cases will happen. It was found that a fake AI-generated voice from current President Joe Biden was revealed in some announcements. To be honest, it is not a new trick.

However, AI does help in this area by making the voice perfect. There is a company called ElevenLab. I have tried to subscribe to its starter plan, and it is great to add a voice by gathering public voice samples or making them by ourselves; the legitimate purpose is for e-learning, marketing, and social media production.

Instant Voice Cloning: Add the Voice

We experimented with this news article in Japanese and gathered our friend Jan Lam, a member of the “Soft and Hard ” music group and current anchor of the early morning program 903 Clear Day.

(24日、パリ五輪サッカー男子1次リーグ 日本5―0パラグアイ)

ラフプレーもいとわない南米王者を相手に、日本は冷静にプレーすることをチームで共有していた。

光ったのは中盤の山本、藤田、三戸を中心とした「欧州組」のタフさだ。相手のタックルを恐れずにボールを受ける。

前半、相手が退場者を出したのも偶然ではない。「(ラフプレーは)欧州で慣れていた。反応しなければ、相手はイライラする」とオランダでプレーするFW斉藤。辛抱強くプレーし、チームで最も身長が低い164センチの三戸が2得点。屈強な相手を苦にしなかった

We shared some samples with Jan Lam (example: voice) in person, saying that he may need to consider buying an intellectual property right for his voice, and he replied with a “headache” icon. It means he has no idea and ways to control.

Leading music and media companies, including Spotify and YouTube, have identified ways to detect music infringement in uploaded video or audio.

What can we do to generate a voice print and see whether the voice print is technically found in some unauthorized media? My approach comes to the following methodology:

  1. Extract the features of the voice
  2. Identify and produce voice print in two parts: similar and vary
  3. Generate Local Sensitivity Hash (LSH) and later for similarity check

1. Extract Voice Features

We import the librosa library and extract the following features with brief descriptions:

This program extracts various features from the audio signal, including:

Pitch: The average and standard deviation of the pitch values.
Tone: The average and standard deviation of the tone values.
Cadence: The average and standard deviation of the cadence values.
Resonance: The average and standard deviation of the resonance values.
Articulation: The average and standard deviation of the articulation values.
Accent: The average and standard deviation of the accent values.
Rhythm: The average and standard deviation of the rhythm values.
Stress: The average and standard deviation of the stress values.
Intonation: The average and standard deviation of the intonation values.
Skewness: The skewness of the audio signal.
Kurtosis: The kurtosis of the audio signal.
Peak count: The number of peaks in the audio signal.

import librosa
import numpy as np
from scipy.signal import find_peaks
from scipy.stats import skew, kurtosis

def extract_features(file_path):
# Load the audio file
signal, sr = librosa.load(file_path)

# Extract pitch features
pitches, magnitudes = librosa.piptrack(y=signal, sr=sr)
avg_pitch = np.mean(pitches)
std_pitch = np.std(pitches)

# Extract tone features
tone = librosa.feature.tonnetz(y=signal, sr=sr)
avg_tone = np.mean(tone)
std_tone = np.std(tone)

# Extract cadence features
tempo, beats = librosa.beat.beat_track(y=signal, sr=sr)
cadence = librosa.feature.tempo(y=signal, sr=sr)
avg_cadence = np.mean(cadence)
std_cadence = np.std(cadence)

# Extract resonance features
freqs = np.abs(librosa.stft(signal))
avg_resonance = np.mean(freqs)
std_resonance = np.std(freqs)

# Extract articulation features
articulation = librosa.feature.spectral_centroid(y=signal, sr=sr)
avg_articulation = np.mean(articulation)
std_articulation = np.std(articulation)

# Extract accent features
accent = librosa.feature.spectral_bandwidth(y=signal, sr=sr)
avg_accent = np.mean(accent)
std_accent = np.std(accent)

# Extract speech pattern features

rhythm = librosa.feature.rms(y=signal)
stress = librosa.feature.spectral_contrast(y=signal, sr=sr)
intonation = librosa.feature.spectral_rolloff(y=signal, sr=sr)
avg_rhythm = np.mean(rhythm)
std_rhythm = np.std(rhythm)
avg_stress = np.mean(stress)
std_stress = np.std(stress)
avg_intonation = np.mean(intonation)
std_intonation = np.std(intonation)

# Extract additional features
skewness = skew(signal)
kurtosis_value = kurtosis(signal)
peak_count = len(find_peaks(signal)[0])

# Return the extracted features
return {
'pitch': (avg_pitch, std_pitch),
'tone': (avg_tone, std_tone),
'cadence': (avg_cadence, std_cadence),
'resonance': (avg_resonance, std_resonance),
'articulation': (avg_articulation, std_articulation),
'accent': (avg_accent, std_accent),
'rhythm': (avg_rhythm, std_rhythm),
'stress': (avg_stress, std_stress),
'intonation': (avg_intonation, std_intonation),
'skewness': skewness,
'kurtosis': kurtosis_value,
'peak_count': peak_count
}

# Example usage
file_path = 'example_xxx.mp3'
features = extract_features(file_path)
for feature, values in features.items():
print(f'{feature}: {values}')

We have carried out a simple test for two example reporters; these are the results from our voice samples. The first part of the result is for Reporter “M,” and the second part is for Reporter “J.” Both work at leading media companies and have been news reporting for years.

We have extracted the news about Crowdstrike, added the above reporters’ voices, and produced the voice recording samples, executed the voicefeature.py to extract their voice features as below:

CrowdStrike has published a post incident review (PIR) of the buggy update it published that took down 8.5 million Windows machines last week. The detailed post blames a bug in test software for not properly validating the content update that was pushed out to millions of machines on Friday. CrowdStrike is promising to more thoroughly test its content updates, improve its error handling, and implement a staggered deployment to avoid a repeat of this disaster. Shares in cybersecurity firm CrowdStrike have plunged more than 13% as the company reckons with the aftermath of causing a global IT outage.
Chief security officer Shawn Henry said the incident had been a "gut punch" for the firm, which had previously been one of the most trusted names in the industry.
"We let down the very people we committed to protect, and to say we’re devastated is a huge understatement,“ he said.
Many businesses are still recovering after a faulty "content update" last week crashed 8.5 million Microsoft Windows computers around the world.
C:\Users\hacker\Downloads\elevenlabs>python voicefeature.py
pitch: (36.53258, 286.47055)
tone: (0.006881430642110973, 0.10962254489555823)
cadence: (107.666015625, 0.0)
resonance: (0.23388442, 1.1635059)
articulation: (2111.375471239768, 1245.6204469669924)
accent: (1940.907835711859, 625.6070902853011)
rhythm: (0.035560157, 0.023899402)
stress: (22.27156976318561, 12.17001492882612)
intonation: (3827.7638095609095, 2300.3757454273295)
skewness: 0.6346163611970942
kurtosis: 4.095271054585651
peak_count: 111518

C:\Users\hacker\Downloads\elevenlabs>python text2speech.py
Audio stream saved successfully.

C:\Users\hacker\Downloads\elevenlabs>python voicefeature.py
pitch: (17.879423, 180.54068)
tone: (-0.0032661926663031656, 0.11338307019748853)
cadence: (123.046875, 0.0)
resonance: (0.56253284, 2.8386395)
articulation: (2167.783986424077, 1490.7466958581986)
accent: (1963.9484483029048, 650.7770598958132)
rhythm: (0.08746687, 0.05713356)
stress: (23.80613517804654, 12.556464310543193)
intonation: (3944.874079451224, 2445.8176474432753)
skewness: -0.71328553026434
kurtosis: 3.538731521979879
peak_count: 123775

C:\Users\hacker\Downloads\elevenlabs>

Based on the above features, we can make a preliminary comparison with the latest AI model from Claude 3—Sonnet. We asked two simple questions about which sound is 1) sweeter and 2) melodic; the Reporter “M” is deemed a bit sweeter, and Reporter “J” gives melodic and expressive delivery in quantitative measurement.

To compare their voiceprints and determine which voice might sound sweeter, we can analyze some of the key acoustic features:

Pitch: The pitch range for "M" (36.53 - 286.47 Hz) is higher than that of "J" (17.88 - 180.54 Hz). A higher pitch range generally sounds more bright and youthful, which could be perceived as sweeter.
Tone: Both voices have relatively low tone values (close to 0), indicating a neutral tone quality.
Cadence: "J" has a slightly higher cadence (123.05 beats per minute) compared to "M" (107.67 bpm), which could contribute to a more dynamic and lively delivery.
Resonance: "J" has a higher resonance value (0.56 - 2.84) compared to "M" (0.23 - 1.16), suggesting a more resonant and possibly richer voice quality.
Articulation: Both voices have similar articulation values, indicating similar levels of precision in speech production.
Accent: The accent values are also similar for both voices.
Rhythm: "J" has a higher rhythm value (0.088) compared to "M" (0.036), suggesting a more rhythmic and expressive delivery.
Stress: The stress values are close for both voices, indicating similar stress patterns.
Intonation: "J" has a slightly higher intonation value (3944.87) compared to "M" (3827.76), which could contribute to a more melodic and expressive delivery.
Skewness and Kurtosis: These values provide information about the distribution of the acoustic features, but it's difficult to directly interpret their impact on perceived sweetness.
Considering these factors, the voice referred to as "J" may be perceived as sweeter overall due to its higher pitch range, slightly more dynamic cadence, higher resonance, more rhythmic delivery, and potentially more expressive intonation patterns.

2. Voice Print: Stable and Variable

Alright, let us go back to the voiceprint. However, voiceprint can be divided into stable and variable categories. For stable voiceprint, the feature is more or less similar across different recordings using the same language; for variable voiceprint, the feature depends on the speech and content.

Based on the voiceprint statistics provided, some features are more likely to remain consistent for the same speaker, even when speaking different content in the same language. In contrast, other features may vary more significantly.

Features that are likely to be similar/stable:

Pitch range: The overall pitch range (min and max pitch values) is an individual characteristic that tends to be consistent for a given speaker, regardless of the content being spoken.
Tone: The tone quality, reflected in the tone value, is an inherent feature of a speaker’s voice and is likely to remain similar across different speech content.
Resonance: The resonance values, which relate to the vocal tract configuration and voice quality, are generally consistent for an individual speaker.
Articulation: The articulation values, which measure the precision of speech production, are influenced by the speaker’s articulatory habits and are likely to be similar across different content.
Accent: The accent values, which capture the speaker’s regional or linguistic accent, are expected to be consistent for the same speaker and language.
Skewness and Kurtosis: These statistical measures of the distribution of acoustic features are likely to be similar for the same speaker, as they reflect inherent characteristics of the voice.

Features that may vary/be variable:

Cadence: The cadence or speech rate can vary depending on the content being spoken, the speaker’s emotional state, or the intended delivery style.
Rhythm: The rhythmic patterns can differ based on the linguistic content, such as the syllable structure, stress patterns, and phrasing.
Stress: The stress values, which measure the emphasis on certain syllables or words, can vary based on the specific content being spoken and the speaker’s intended emphasis.
Intonation: The intonation patterns, which reflect the melodic contours of speech, can differ based on the content, emotional context, and intended prosody.
Peak Count: The number of peaks in the audio signal can vary depending on the specific content being spoken, as it is influenced by factors like syllable count, pauses, and other speech characteristics.

In summary, features like pitch range, tone, resonance, articulation, accent, skewness, and kurtosis are more likely to remain consistent for the same speaker, even with different speech content in the same language. On the other hand, features like cadence, rhythm, stress, intonation, and peak count may vary more significantly based on the specific content being spoken and the speaker’s intended delivery.

3. Generate LSH and Similarity Check

We input the numbers from Reporter “M” and Reporter “J” in the following LSH.py program to generate and carry out a similarity check. Here is a discussion of how the number of hash functions should be selected.

import numpy as np

# Define the feature vectors
m_stable_features = np.array([36.53258, 286.47055, 0.006881430642110973, 0.10962254489555823, 0.23388442, 1.1635059, 2111.375471239768, 1245.6204469669924, 1940.907835711859, 625.6070902853011, 0.6346163611970942, 4.095271054585651])
m_variable_features = np.array([107.666015625, 0.0, 0.035560157, 0.023899402, 22.27156976318561, 12.17001492882612, 3827.7638095609095, 2300.3757454273295, 111518])

j_stable_features = np.array([17.879423, 180.54068, -0.0032661926663031656, 0.11338307019748853, 0.56253284, 2.8386395, 2167.783986424077, 1490.7466958581986, 1963.9484483029048, 650.7770598958132, -0.71328553026434, 3.538731521979879])
j_variable_features = np.array([123.046875, 0.0, 0.08746687, 0.05713356, 23.80613517804654, 12.556464310543193, 3944.874079451224, 2445.8176474432753, 123775])

# Define the hash function
def hash_function(x, p, m):
return ((x * p) % m) // (m // 2)

# Define the LSH function
def lsh(feature_vector, p, m, num_hashes):
hash_values = []
for i in range(num_hashes):
p_i = p[i]
hash_value = 0
for j in range(len(feature_vector)):
hash_value += hash_function(feature_vector[j], p_i, m)
hash_values.append(hash_value)
return tuple(hash_values)

# Parameters for LSH
p = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] # Random projection vectors
m = 100 # Hash range
num_hashes = 3 # Number of hash functions


# Calculate LSH for stable features
m_stable_lsh = lsh(m_stable_features, p, m, num_hashes)
j_stable_lsh = lsh(j_stable_features, p, m, num_hashes)

# Calculate LSH for variable features
m_variable_lsh = lsh(m_variable_features, p, m, num_hashes)
j_variable_lsh = lsh(j_variable_features, p, m, num_hashes)

# Calculate LSH similarity
def lsh_similarity(lsh1, lsh2):
matches = sum(1 for x, y in zip(lsh1, lsh2) if x == y)
return matches / len(lsh1)

stable_similarity = lsh_similarity(m_stable_lsh, j_stable_lsh)
variable_similarity = lsh_similarity(m_variable_lsh, j_variable_lsh)

# Print the results
print("M Stable Features LSH:", m_stable_lsh)
print("J Stable Features LSH:", j_stable_lsh)
print("M Variable Features LSH:", m_variable_lsh)
print("J Variable Features LSH:", j_variable_lsh)
print("Stable Features LSH Similarity:", stable_similarity)
print("Variable Features LSH Similarity:", variable_similarity)

Here is the result of comparison. From the stable and variable features LSH similarity check, they are both 0.1, which means their voice are pretty different from each other in both voiceprints:

C:\Users\hacker\Downloads\elevenlabs>python LSH.py
M Stable Features LSH: (1.0, 5.0, 2.0)
J Stable Features LSH: (7.0, 4.0, 6.0)
M Variable Features LSH: (0.0, 1.0, 3.0)
J Variable Features LSH: (1.0, 3.0, 2.0)
Stable Features LSH Similarity: 0.1
Variable Features LSH Similarity: 0.1

Summary

This is just a 101 experiment on how we produce a voiceprint for comparison. It is widely applied to detect copyrighted video and audio in the media and music industries. We could have our audio voiceprint app for our family, friends, and co-workers. When every call comes in, we can check against our conversation to guarantee the one we are talking to is not an AI-generated voice.

However, alternative ways exist to figure out whether the voice is AI-generated or video because the adversary usually gathers samples from the public domain and social networks. Some may even compromise one’s internal network to get the online meeting recordings. Please call the suspect directly and verify how he/she replies via the alternative channel. The adversary can rarely get hold of it.

Reference

https://engineering.atspotify.com/2023/10/introducing-voyager-spotifys-new-nearest-neighbor-search-library/

https://www.usenix.org/conference/usenixsecurity23/presentation/deng-jiangyi-voiceprint

https://www.pinecone.io/learn/series/faiss/locality-sensitive-hashing/

--

--

VXRL
VXRL

Written by VXRL

VXRL Team is founded by group of enthusiastic security researchers, providing information security services and contribute to the community. https://www.vxrl.hk

No responses yet