75 lines
2.3 KiB
Markdown
75 lines
2.3 KiB
Markdown
|
|
# Conversational Speech generator tool
|
||
|
|
|
||
|
|
Tool to generate multiple-end audio tracks to simulate conversational speech
|
||
|
|
with two or more participants.
|
||
|
|
|
||
|
|
The input to the tool is a directory containing a number of audio tracks and
|
||
|
|
a text file indicating how to time the sequence of speech turns (see the Example
|
||
|
|
section).
|
||
|
|
|
||
|
|
Since the timing of the speaking turns is specified by the user, the generated
|
||
|
|
tracks may not be suitable for testing scenarios in which there is unpredictable
|
||
|
|
network delay (e.g., end-to-end RTC assessment).
|
||
|
|
|
||
|
|
Instead, the generated pairs can be used when the delay is constant (obviously
|
||
|
|
including the case in which there is no delay).
|
||
|
|
For instance, echo cancellation in the APM module can be evaluated using two-end
|
||
|
|
audio tracks as input and reverse input.
|
||
|
|
|
||
|
|
By indicating negative and positive time offsets, one can reproduce cross-talk
|
||
|
|
(aka double-talk) and silence in the conversation.
|
||
|
|
|
||
|
|
### Example
|
||
|
|
|
||
|
|
For each end, there is a set of audio tracks, e.g., a1, a2 and a3 (speaker A)
|
||
|
|
and b1, b2 (speaker B).
|
||
|
|
The text file with the timing information may look like this:
|
||
|
|
|
||
|
|
```
|
||
|
|
A a1 0
|
||
|
|
B b1 0
|
||
|
|
A a2 100
|
||
|
|
B b2 -200
|
||
|
|
A a3 0
|
||
|
|
A a4 0
|
||
|
|
```
|
||
|
|
|
||
|
|
The first column indicates the speaker name, the second contains the audio track
|
||
|
|
file names, and the third the offsets (in milliseconds) used to concatenate the
|
||
|
|
chunks. An optional fourth column contains positive or negative integral gains
|
||
|
|
in dB that will be applied to the tracks. It's possible to specify the gain for
|
||
|
|
some turns but not for others. If the gain is left out, no gain is applied.
|
||
|
|
|
||
|
|
Assume that all the audio tracks in the example above are 1000 ms long.
|
||
|
|
The tool will then generate two tracks (A and B) that look like this:
|
||
|
|
|
||
|
|
**Track A**
|
||
|
|
```
|
||
|
|
a1 (1000 ms)
|
||
|
|
silence (1100 ms)
|
||
|
|
a2 (1000 ms)
|
||
|
|
silence (800 ms)
|
||
|
|
a3 (1000 ms)
|
||
|
|
a4 (1000 ms)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Track B**
|
||
|
|
```
|
||
|
|
silence (1000 ms)
|
||
|
|
b1 (1000 ms)
|
||
|
|
silence (900 ms)
|
||
|
|
b2 (1000 ms)
|
||
|
|
silence (2000 ms)
|
||
|
|
```
|
||
|
|
|
||
|
|
The two tracks can be also visualized as follows (one characheter represents
|
||
|
|
100 ms, "." is silence and "*" is speech).
|
||
|
|
|
||
|
|
```
|
||
|
|
t: 0 1 2 3 4 5 6 (s)
|
||
|
|
A: **********...........**********........********************
|
||
|
|
B: ..........**********.........**********....................
|
||
|
|
^ 200 ms cross-talk
|
||
|
|
100 ms silence ^
|
||
|
|
```
|