Introduction
Musical instrument
signals generally consist of a transient portion and steady state or quasi-periodic
portion. The transient part is usually the attack of the signal and the
steady state the portion that follows the attack part. When investigating
time variant signals it is critical to make use of both time and
frequency domain analysis techniques. Some important features in musical
signals include duration, amplitude modulation, pitch, spectral harmonicity,
spectral envelope, spectral centroid and the like. Attack time is especially
considered a salient feature of musical timbre (Eagleson and Eagleson 1947;
Saldanha and Corso 1964; Elliot 1975) and has been thought to be a dominant
feature of musical instruments. However, it has also been discovered that
the attack time and also note-to-note transients of a signal are neither
sufficient nor necessary for recognizing
musical instruments (Kendall 1986). This controversial discovery supports
the importance of the steady state portion of a signal.

 

 

 

This chapter
mainly describes the implementation of the signal processing algorithms
used in the software system for extracting features that depict these transient
and stationary characteristics in the frequency and time domain. The frequency
domain analysis section of this chapter is primarily based on the discrete
Fourier transform (DFT). DFT based spectral analysis algorithms discussed
includes
short time Fourier transform, spectral centroid, spectral
smoothness
andtracking
of partials over time
. In the time domain analysis section I will
mainly describe the implementation of algorithms including pitch detection
with interpolation and a period averaging based on the autocorrelation
function. Other modules discussed are amplitude envelope, amplitude modulation,
attack time computation and noise content analysis.


 

 

 

      Frequency
      Domain Analysis
        DFT
        and STFT
The spectral
analysis part of feature extraction is primarily based on the discrete
Fourier transform (DFT). Below the continuous time and discrete time versions
of the Fourier transform are shown.

(2.1)

(2.2)

To extract
transitory spectral characteristics the short time Fourier transform (STFT)
was used (Allen 1977; Allen and Rabiner 1977). The basic algorithm is as
follows.

(2.3)

As seen in
figure 2.1 the STFT can be simply described as windowing and taking the
FFT of the signal. There are various window types available in the program

Figure 2.1 Short time Fourier transform
and Spectral Peak Detection

with different
side-lobe and main lobe characteristics.
The Hamming window has been shown to work particularly well with
musical
signals (De Poli, Piccialli and Roads 1991). See the appendix for details
regarding windowing and its side-lobe and main lobe characteristics.


 

 

 

        Spectral
        Peak Detection and Tracking
Pitched musical
instruments display a high degree of harmonic spectral quality when analyzed
for frequency content. Most tend to have
quasi-integer harmonic
relationships between spectral peaks and the fundamental frequency. In
voice, the spectral envelope displays mountain-like contours or valleys
known as formants. The locations of the formants distinctively describe
vowels. This is also evident in violins, but the number of valleys is greater
and the formant locations change very little with time unlike the voice,
which varies substantially for each vowel. Woodwinds such as the bassoon
and oboe on the other hand have fewer formants than the voice, but tend
to have stronger and clearer spectral contours that perceptually characterize
the woodwind family (Cook 1999). Generally, musical instruments like the
plucked string (figure 2.2) exhibit lower energy in the high frequency
bins. The higher partials normally have less energy and also die out faster
than lower ones over time.
Figure
2.2 Plucked string spectrum

Using the short time Fourier transform,
I have implemented a spectral peak detection and tracking method, extracting
quasi-integer related harmonics from the spectrum. The peak picking algorithm
takes into consideration magnitude and frequency information to select
the most prominent and harmonically behaving peaks. To help in the search
for spectral peaks, various threshold values are used as described below.


 

 

 

The spectral
peak detection algorithm is divided into four main steps. The first pass
roughly locates possible peaks, where the roughness factor for searching
peaks is controlled via a threshold
value
. The threshold value basically dictates the degree of “peakiness”
that is allowed for a local maximum to be considered a possible peak. The
second pass filters out peaks that may have been erroneously selected in
step
1. The third pass looks for any broken harmonic sequence, analyzing
harmonic relationships of the currently selected peaks. In this pass, peaks
that may have been deleted or missed in the previous two passes are inserted.
The final pass looks at the selected peaks and further does a harmonic
analysis ultimately leaving a set of peaks that are most probably
harmonics.
A mean and scalable standard deviation error
method is applied for control of inharmonicity.


 

Figure 2.3 Peak detection algorithm
          Step
          1: Rough Peak Detection
In
the rough peak detection algorithm p
ossible peaks are picked using
negative and positive slope threshold values to guide in the selection
process. As shown in figure 2.4 the polarity
of the slope of the spectrum is computed from bin to bin (DC to Nyquist)
using the basic assumption that a transition from positive to
Figure 2.4 Rough search for peaks
 

negative slope calls for the possibility
of a peak. The following conditions help in the selection of a peak:

 

 

 

    The
    slope must change polarity, positive to negative.
    The
    magnitude difference between the peak candidate and the current
    bin’s magnitude component (X[k]-X[k+4]) must be greater than a threshold
    value – see example (figure 2.5).
    A
    new peak candidate search occurs only after there is a slope change from
    negative to positive and when a threshold value as shown in figure 2.6
    is exceeded.
Refer
to flowcharts in the appendix for details.

 

Figure 2.5 Actual peak assessment
Figure 2.6 Transitional peaks (noise)

 

          Step
          2: Prominent Peak Search
In step 2,
prominent
peaks
are located from a set of potential peaks found in step 1. The
purpose is to filter out local peaks which may be present
between
stronger partial candidates as shown in figure 2.7. The search for prominent
peaks
is done in the following way:

 

Figure 2.7 Prominent peak search
    The bin with
    the maximum magnitude is found.
    Relative
    to position of the peak with maximum amplitude, peaks are analyzed moving
    towards DC.
    Relative
    to position of peak with maximum amplitude, peaks are analyzed moving towards
    the Nyquist frequency.
Local maxima
or peaks are picked out using an adaptive threshold value that is reflective
of a prominent peaks (possible
partials) and its neighboring peaks as shown in figure 2.7. For
example a 50% threshold value will require neighboring peaks to be greater
than at least half the magnitude of the prominent peak (possible
partial). Refer to the appendix for details on algorithm.

 

 

 

          Step
          3: Harmonic Break Search
The third
step is called the harmonic break search. Here, I have tried to
analyze if some “potential partials” were deleted or missed in the previous
steps. This may occur when potentially harmonically related peaks temporarily
have little energy or are simply much weaker than the stronger ones, but
are nevertheless harmonic. The harmonic break search is divided
into the following sub-routines:
    Analyze harmonic
    relationship between current partial candidates, by computing the mean
    bin spacing between all prominent peaks.

    (2.4)

    Detecting any
    harmonic
    breaks
    , or discontinuities between prominent peaks.

    If discontinuities
    are found, going back to step 1 and 2 and do a refined search of possible
    peaks between pairs of prominent peaks.

Figure 2.8 Harmonic break search

In the
harmonic
break search’s
second step, harmonic discontinuities are detected using
a pair of threshold values limiting the range of harmonic deviation. Hence,
the algorithm expects the possibility of a peak within the threshold bounds
computed in sub-step 2 (figure 2.8). Refer to appendix for more details
on algorithm.


 

 

 

          Step
          4: Harmonicity Analysis
Finally in
step 4 an overall harmonicity verification is performed. In this last step,
the first few peaks (selectable in software) are used as a guide to determine
the final set of partials. The reason for choosing the first few
peaks of the spectrum is due to the fact that in highly pitch salient signals,
lower harmonics usually are stronger and more stable than higher ones.
The idea
is to use the gaussian normal distribution function employing mean, variance
and standard deviation for eliminating inharmonic or misbehaving partials.
A peak that is outside a right and left threshold bound is considered inharmonic
and misbehaving. A mean bin spacing value denoting the bin distances between
neighboring peak candidates is computed to render the variance and
standard
deviation
. As the lower partials generally tend to be more stable and
have more energy, the first K (K: integer > 0) peaks are used for the computation
of the standard deviation. A scaled version of the the standard deviation
is then used as a criterion for evaluating inharmonicity of each partial
candidate. The scaledstandard deviation is
increased or decreased to control the permitted spread of each peak.
In other words, the scaled standard deviation is directly
relevant to the amount of inharmonicty tolerated
for selecting the final set of peaks. The scalar that controls the
scaled
standard deviation
is a value between 0 and 1, where 1 is equivalent
to limiting the peaks to the original un-scaled
standard deviation.
This method is implemented by computing an ideal
sequence of harmonics using the above acquired data. Hence the ideal harmonic
series is a sequence of partials as shown below.

(2.5)

The ideal set
of harmonics and the actual set of harmonics are compared and the error (equation
2.6) for each peak is computed and verified against the scaled standard
deviation
for final assessment. Peaks that have excessive error values
are deleted from the final set of peaks and the remaining ones are finally
considered harmonics. See the appendix for more details on algorithm.

(2.6)

Equation 2.6
shows the error between the ideal and actual bins where M is the number
of ideal peaks and N is the number of actual peaks in the spectrum. M and
N have different values as missing partials may exist in the actual set
of peaks.


 

 

 

          Partial
          Tracking between Frames
Once harmonics
have been evaluated in each frame (a frame is equal to the length of the
FFT), they are combined to render a spectrogram. Frame to frame partial
movement is determined using a harmonic continuity criterion as
shown in figure 2.9.
 

Figure 2.9 Partial tracking between
frames
The harmonic
continuity criterion is explained as follows: Each harmonic in a frame
is allowed to sway in frequency within a set of error margin values. Hence,
as shown in figure 2.9, four of the harmonics make a continuous harmonic
path
(k, k+1, k+2, k+3).
However, the harmonic in frame k+4 exceeds the allowed error margin and
breaks the previous harmonic path. At frame k+4 a new path is created and
the path which started at frame k is discontinued. The harmonic continuity
criterion is helpful in observing movements of the harmonics over time
and frequency.

출처: http://silvertone.princeton.edu/~park/thesis/dartmouth/html/ch2-1.html