Modelling pitch reception with adaptive resonance theory artificial neural
networks.
Most modern pitch-perception theories incorporate a pattern-recognition
scheme to
extract pitch. Typically, this involves matching the signal to be classified
against a harmonic-series template for each pitch to find the one with the
best
fit. Although often successful, such approaches tend to lack generality
and may
well fail when faced with signals with much depleted or inharmonic components.
Here, an alternative method is described, which uses an adaptive resonance
theory
(ART) artificial neural network (ANN). By training this with a large number
of
spectrally diverse input signals, we can construct more robust pitch-templates
which can be continually updated without having to re-code knowledge already
acquired by the ANN. The input signal is Fourier-transformed to produce
an
amplitude spectrum. A mapping scheme then transforms this to a distribution
of
amplitude within 'semitone bins'. This pattern is then presented to an ARTMAP
ANN
consisting of an ART2 and ART1 unsupervised ANN linked by a map field. The
system
was trained with pitches ranging over three octaves (C[sub 3] to C[sub 6])
on a
variety of instruments and developed a desirable insensitivity to phase,
timbre
and loudness when classifying.
KEYWORDS: ART, ARTMAP, pitch perception, pattern recognition.
1. Introduction
This paper describes a computer system that models aspects of human pitch
perception using an adaptive resonance theory (ART) artificial neural network
(ANN). ART was introduced in order to analyze how brain networks can learn
about
a changing world in real time in a rapid but stable fashion. Here, ART will
be
used to self-organize musical pitch by using a supervised ANN called ARTMAP
(Carpenter et al., 1991a).
Section 2 briefly describes the auditory system and outlines the various
pitch-perception theories. Section 3 describes an ART system we have developed
that is capable of determining pitch on a wide variety of musical instruments.
Section 4 presents experiments that were undertaken using this system.
2. Aspects of Musical Pitch
Sound waves are conducted via the outer and middle ears to the basilar membrane
in the inner ear (cochlea). This membrane varies in mechanical structure
along
its length in such a way that a signal component of a particular frequency
will
cause a particular place on it to vibrate maximally. An array of neurons
along
the length of the membrane is thus able to convey a rough frequency analysis
to
the brain via the auditory nerve. Historically, therefore, pitch-perception
was
explained in terms of simple spectral cues by the 'place' theory (Helmholtz,
1863).
However, it was found that the pitch of complex periodic sounds corresponds
to
the fundamental frequency, independently of the presence or absence of energy
in
the sound spectrum at this frequency. Consequently, the 'periodicity' theory
(Schouten, 1940) was introduced, which explained perception of sounds in
terms of
the timing of nerve impulses. Later still, the inadequacy of these cochlea-based
explanations for pitch perception of complex tones was revealed by musical
intelligibility tests (Houtsma & Goldstein, 1972) which demonstrated
that the
pitch of complex tones made up of a random number of harmonics can be heard
equally well whether the subject is presented with them monotically (all
in one
ear) or dichotically (different harmonics sent to each ear). Therefore,
neither
energy at the fundamental frequency nor fundamental periodicities in the
cochlea
output are necessary for a subject to determine the pitch of a periodic
sound.
This implies that some pitch processing takes place at a higher level than
the
cochlea. Following this discovery, three pattern-recognition theories were
published that attempted to explain how the brain learned to extract pitch
from
complex sounds.
2. 1. Pattern-recognition Theories
De Boer (1956) suggested how a template model could predict the pitch of
both
harmonic and inharmonic sounds. He argued that the brain, through wide contact
with harmonic stimuli, could be specifically tuned to such harmonic patterns.
Thus, an inharmonic stimulus could be classified by matching the best-fitting
harmonic template. For example, consider three components 1000, 1200 and
1400 Hz,
frequency-shifted by 50 Hz to give 1050, 1250 and 1450 Hz. The reported
resulting
pitch sensation for this set of harmonics is 208 Hz. Similarly, using de
Boer's
harmonic template-matching scheme, the best-fitting pitch would also be
208 Hz,
where 1050, 1250 and 1450 Hz are the closest-fitting harmonics to the 5th,
6th
and 7th harmonics of 208 Hz, i.e. 1040, 1248 and 1456 Hz.
The optimum-processor theory (Goldstein, 1973), the virtual-pitch theory
(Terhardt, 1972) and the pattern-transformation theory (Wightman, 1973)
are all
quantified elaborations on de Boer's ideas. Perhaps the most closely related
theory is the optimum-processor theory which explicitly incorporates these
ideas.
Goldstein includes a hypothetical neural network called the 'optimum-processor',
which finds the best-fitting harmonic template for the spectral patterns
supplied
by its peripheral frequency analyzer. The fundamental frequency is obtained
in a
maximum-likelihood way by calculating the number of harmonics which match
the
stored harmonic template for each pitch and then choosing the winner. The
winning
harmonic template corresponds to the perceived pitch.
Terhardt, in his virtual-pitch theory, distinguishes between two kinds of
pitch
mode: spectral pitch (the pitch of a pure tone) and virtual pitch (the pitch
of a
complex tone). The pitch percept governed by these two modes is described,
respectively, by the spectral-pitch pattern and the virtual-pitch pattern.
The
spectral-pitch pattern is constructed by spectral analysis, extraction of
tonal
components, evaluation of masking effects and weighting according to the
principle of spectral dominance. The virtual-pitch pattern is then obtained
from
the spectral-pitch pattern by subharmonic-coincidence assessment.
Terhardt considers virtual pitch to be an attribute which is a product of
auditory Gestalt perception. The virtual pitch can be generated only if
a
learning process has been undergone previously. In vision, Gestalt perception
invokes the law of closure to explain how we often supply missing information
to
'close a figure' (see Figure 1). Terhardt (1974) argues that, "merely
from the
fact that, in vision, 'contours' may be perceived which are not present
one can
conclude that, in heating, 'tones' may be perceptible which are not present,
either."
Wightman presents a mathematical model of human pitch perception called
the
pattern-transformation model of pitch. It was inspired by what appears to
be a
close similarity between pitch perception and other classic pattern-recognition
problems. In his analogy between pitch and character recognition, he describes
characters as having a certain characteristic about them, regardless of
size,
orientation, type style etc. For example the letter C as it is seen here,
has its
C-ness in common with other Cs, whether it is written by hand, printed in
a
newspaper or anywhere else. Even though the letter style can vary greatly
it is
still recognized as the letter C. Wightman argues that in music this is
also
true, e.g. middle C has the same pitch regardless of the instrument which
produces it. Hence, he concluded that the perception of pitch is a
pattern-recognition problem.
In the pattern-transformation model, pitch recognition is regarded as a
sequence
of transformations, which produce different so-called 'patterns of neural
activity'. A limited-resolution spectrum (called a peripheral activity pattern),
similar to that produced in the cochlea, is created from the input stimuli.
This
is then Fourier transformed to compute the autocorrelation function, resulting
in
a phase-invariant pattern. The final stage of the model incorporates the
pitch
extractor, which although not explicitly described, is performed by a
pattern-matching algorithm.
The three models described here have been demonstrated to be closely related
mathematically (de Boer, 1977). In his paper, de Boer demonstrated that
under
certain conditions, Goldstein's optimum-processor theory can give the same
pitch
predictions as both Wightman's pattern-transformation theory and Terhardt's
virtual-pitch theory. If the spread of a single-frequency component is
substantial, Goldstein's theory is equivalent to Wightman's theory, and
if the
spread is zero it predicts the same pitches as Terhardt's theory.
2.2. Implementations of Pattern-recognition Theories
Implementations have been published of Terhardt's virtual-pitch theory (Terhardt
et al., 1982) and Goldstein's central processor theory (Duifhuis et al.,
1982).
Terhardt et al. (1982) present an algorithm which effectively reproduces
the
virtual-pitch theory mechanisms for determining pitch. They use a fast Fourier
transform (FFT) to calculate the power spectrum of the sampled signal. This
is
then analyzed in order to extract the tonal components of importance to
the
pitch-determining process, which effectively cleans up the spectrum so that
just
high-intensity frequency components remain. Evaluation of masking effects
follows, resulting in the discarding of irrelevant frequency components
and the
frequency-shifting of others. Weighting functions are then applied which
control
the extent to which tonal components contribute towards the pitch-extraction
process. The subsequent spectral-pitch pattern is then processed by a method
of
subharmonic summation to extract the virtual pitch.
The implementation of Goldstein's optimum-processor theory (Duifhuis et
al.,
1982) is often called the 'DWS pitch meter' (Scheffers, 1983). This has
three
stages of operation. In the first stage, a 128-point FFT is computed, producing
a
spectrum of the acoustic signal. A 'component-finder' algorithm then finds
the
relevant frequency components and then by interpolation pinpoints the actual
frequency positions more precisely. The third stage consists of an
optimum-processor scheme which estimates the fundamental frequency whose
template
optimally fits the set of resolved frequency components.
An elaboration of the DWS meter, a year later (Scheffers, 1983), included
a
better approximation of auditory frequency analysis and a modification to
the
'best-fitting fundamental' procedure. Scheffers reported that errors induced
by
the absence of low harmonics were significantly reduced and noted that their
model produced the same results that were originally predicted by Goldstein
in
1973.
Terhardt's subharmonic-summation pitch-extracting scheme has been used in
a large
number of other computer implementations. Hermes (1988) presented an alternative
method of computing the summation by applying a series of logarithmic frequency
shifts to the amplitude spectrum and adding component amplitudes together
to
produce the subharmonic sum spectrum. Many speech-processing algorithms
also use
types of subharmonic summation, e.g. Schroeder (1968), Noll, (1970) and
Martin
(1982). A number of frequency-tracking algorithms also use variations on
the
subharmonic summation scheme, e.g. Piszczalski and Galler (1982) and Brown
(1991,
1992).
However, these implementations generally involve complex algorithms whose
success
is often qualified by the need to set parameters, in an ad hoc fashion,
so that
the results fit empirical data from psychoacoustic experiments. In this
paper, we
propose an alternative system which uses an ART neural network called ARTMAP
to
classify the pitch of an acoustic signal from a Fourier spectrum of harmonics.
In
effect, such a network can fit psychoacoustic data itself by associating
input
signals with desired output states.
2.3. The Use of ANNs for Pitch Classification
ANNs offer an attractive and alternative approach to pitch-determination
by
attempting to improve on the widely used harmonic-ideal template (i.e. the
harmonic series) matched against the input data to find the pitch template
with
the closest fit. Although such methods work well in general, there are musical
instruments which do not readily fit into the harmonic-ideal category, e.g.
those
which produce a much depleted or inharmonic set of spectral components.
Such
spectrally ambiguous patterns may well confuse systems which use simple
comparisons of this kind. Of course, such algorithms may be extended to
cater for
a greater variety of instruments which do not fit the harmonic ideal; but
the
process is by no means a simple one, involving further pitch analysis and
re-coding of the computer implementations.
ANNs, on the other hand, offer an original way of constructing a more robust
harmonic template. This is achieved by training the ANN with a wide variety
of
spectrally different patterns, so that the information relevant to the
pitch-determining process can be extracted. Through this interaction with
a large
variety of pitches taken from different musical instruments the ANN can
learn to
become insensitive to spectral shape and hence timbre when determining pitch.
There are two ways in which the conventional pitch-template pattern can
be
improved:
(1) by training an ANN with various spectral examples for a single pitch
and then
using this information to predict the harmonic template for any pitch;
(2) by training the ANN with various spectral examples taken from a wide
variety
of different pitches.
The second approach was chosen as likely to be the more thorough. This decision
was somewhat intuitive, inspired by the considerable variation in spectral
patterns of different fundamental frequency. We did not directly test the
first
approach experimentally. Previous results (Taylor & Greenhough, 1993)
demonstrate
that the second training scheme can indeed out-perform template matching
and
subharmonic summation.
3. Outline for an Adaptive Resonance Theory System
The model can be divided into three stages as illustrated in Figure 2. The
first
stage performs a Fourier transform on the sampled waveform to produce an
amplitude spectrum. The second stage maps this representation to an array
of
'semitone bins'. Finally, this is presented to an ARTMAP network which learns
to
extract the pitch from the signals. It is essential to the learning process
that
the network acquires an insensitivity to other factors such as spectral
shape
(and hence timbre) and overall amplitude (and hence loudness). All stages,
including the ANN simulation, were written by us using Objective-C on a
NeXT
Workstation. The next three sections describe these stages in more detail.
3.1. Sampling and Spectrum Analysis
The NeXT Workstation's on-board sampling chip was used for the acquisition
of the
pitch examples. This chip has a 12-bit dynamic range and has a fixed sampling
rate of 8012.8 Hz, called the 'CODEC converter rate'. We found that this
was
sufficient for the current work but a more sophisticated chip, ideally with
a
sampling rate of 44.1 kHz and a 16-bit dynamic range, should be used in
a working
application. The CODEC converter rate, according to Nyquist's theorem, can
resolve frequencies up to about 4000 Hz. (Although the human ear can detect
frequencies up to around 20 kHz, any spectral energy above 4000 Hz is likely
to
have little effect on the pitch-determination process (see, for example,
Ritsma
(1962).) A FFT algorithm was then applied to 1024 samples, producing an
amplitude
spectrum of the sampled data. The resulting frequency spacing of 7.825 Hz
(8012.8/1024) represents a somewhat limited resolution, but this was not
a
problem in the context of these experiments.
3.2. Mapping from the Fourier Spectrum to a 'Semitone Bin' Representation
Groups of frequency components lying within bandwidths of a semitone are
mapped
to individual 'semitone bins' of a representative intensity. Thus, the Fourier
spectrum is transformed into a distribution of intensity over the semitones
of
the chromatic scale. However, to do this one has to take into account that
a
bandwidth of a semitone, in Hz, varies with the centre frequency. For example,
the semitone bandwidth around the frequency point G#2 is 7 Hz while that
around
C[sub 8] is 242 Hz. We must, however, ensure the same intensity level in
these
two semitone bins, if the activation levels are the same.
The mapping scheme used here has strongly weighted connections to the area
within
a bandwidth of a semitone around the semitone's centre frequency and weaker
connections outside this area up until the neighbouring semitone's centre
frequency (Figure 3). These weaker connections enable the network to be
more
robust when presented with a mistuned note or harmonic. Figure 4 summarizes
the
actual operations which are performed to produce the required mappings.
The
restriction to just three octaves of a diatonic scale was a consequence
of there
being a considerable variety of network architectures to investigate with
only
limited processing power. A much finer-grained input mapping is possible
and
would allow, for example, an exploration of the subtle pitch-shifts observed
with
groups of inharmonically related frequency components.
The input, therefore, to the ANN is 60 input nodes representing the 60 chromatic
semitone bins in the range C[sub 3] to B[sub 7], i.e. 131 Hz to 3951 Hz
(see
Figure 5). The ANN's output layer is used to represent the pitch names of
the
C-major scale in the range C[sub 3] to C[sub 6] (i.e. 131 Hz to 1047 Hz)
and
therefore consists of 22 nodes (see Figure 6).
Input patterns and corresponding expected output patterns are presented
to the
ARTMAP network in pairs. The goal of the network is to associate each input
pattern with its appropriate output pattern by adjusting its weights (i.e.
by
learning). The network in this experiment has a training and a testing phase.
In
the training phase the network learns the training patterns presented to
it,
while in the testing phase it is asked to classify patterns which it has
not seen
before. Optimally, the ARTMAP network should learn all the training patterns
and
also classify all the test patterns correctly by generalizing using knowledge
it
has learned in the training phase.
3.3. The ART Architecture and Its Use in Pitch Classification
ANNs are currently the subject of much theoretical and practical interest
in a
wide range of applications. One of the main motivations for the recent revival
of
computational models for biological networks is the apparent ease, speed
and
accuracy with which biological systems perform pattern recognition and other
tasks. An ANN is a processing device, implemented either in software or
hardware,
whose design is inspired by the massively parallel structure and function
of the
human brain. It attempts to simulate this highly interconnected, parallel
computational structure consisting of many relatively simple individual
processing elements. Its memory comprises nodes, which are linked together
by
weighted connections. The nodes' activation levels and the strengths of
the
connections may be considered to be short-term memory and long-term memory
respectively. Knowledge is not programmed into the system but is gradually
acquired by the system through interaction with its environment.
The brain has been estimated to have between 10 and 500 billion neurons,
or
processing elements (Rumelhart & McClelland, 1986). According to one
opinion
(Stubbs, 1988), the neurons are arranged into approximately 1000 main modules,
each having 500 neural networks consisting of 100 000 neurons. There are
between
100 and several thousand axons (from other neurons), which connect to the
dendrites of each neuron. Neurons either excite or inhibit neurons to which
they
are connected (Eccles law).
ANNs, on the other hand, typically consist of no more than a few hundred
nodes
(neurons). The ANN pitch-determination system described in this paper consists
of
just 304 neurons (60 input neurons, 22 output neurons and 200 + 22 neurons
connecting to the input layer and the output layer, respectively), with
16 884
weighted connections. It should be emphasized therefore that, generally,
ANNs are
not implemented on the same scale as biological neural networks and that
ANNs are
not considered to be exact descriptions of mammalian brain neural networks.
Rather, they take into account the essential features of the brain that
make it
such a good information processor.
In our system, we have used an ANN architecture called ARTMAP to perform
pattern
recognition on the semitone-bin distribution. The introduction of ART (Grossberg,
1976a, b) has led to the development of a number of ANN architectures including
ART1, ART2, ART3 and ARTMAP ((Carpenter & Grossberg, 1987a, b, 1990)
and
(Carpenter et al., 199 lb)). ART networks differ in topology from most other
ANNs
by having a set of top-down weights as well as bottom-up ones. They also
have the
advantage over most others of an ability to solve the 'stability-plasticity'
dilemma, that is once the network has learned a set of arbitrary input patterns
it can cope with learning additional patterns without destroying previous
knowledge. Most other ANNs have to be retrained in order to learn additional
input patterns.
ART1 self-organizes recognition codes for binary input patterns. ART2 does
the
same for binary and analogue input patterns, while ART3 is based on ART2
but
includes a model of the chemical synapse that solves the memory-search problem
of
ART systems. ARTMAP is a supervised ANN which links together two ART modules
(ART[sub a] and ART[sub b]) by means of an 'inter-ART associative memory',
called
a map field. ART[sub a] and ART[sub b] are both unsupervised ART modules
(e.g.
ART1, ART2, ART3, ART2-A etc.).
Our system consists of an ART2 and an ART1 ANN for the ART[sub a] and ART[sub
b]
modules, respectively. Vector a[sub p] encodes the pitch information in
the form
of semitone bins and the vector b[sub p] encodes its 'predictive consequence'
which corresponds, in this case, to the conventional name of that pitch,
e.g.
A[sub 4], C[sub 3] etc. The next four sections describe the unsupervised
ART[sub
a] and ART[sub b] modules which were used in our system, followed by an
experiment which demonstrates why the supervised learning scheme adopted
was
needed. Finally, details of the ARTMAP system are given.
3.3.1. ART1. ART1 is an ANN which self-organizes, without supervision, a
set of
binary input vectors. Vectors which are similar enough to pass the so-called
'vigilance test' are clustered together to form categories (also known as
exemplars or recognition codes). ART1 has been described as having similar
properties to that of the single leader sequential clustering algorithm
(Lippmann, 1987). Briefly, the leader algorithm selects the first input
as the
leader of the class. The next input is then compared to the first pattern
and the
distance is calculated between the vectors. If this distance is below a
threshold
(set beforehand), then this input is assigned to the first class, otherwise
it
forms a new class and consequently becomes the leader for that class. The
algorithm continues in this fashion until all patterns have been assigned
to a
class. The number of classes produced by the leader algorithm depends on
both the
threshold value chosen and the distance measure used to compare input to
class
leaders.
The leader algorithm differs from ART1 in that it does not attempt to improve
on
its leading pattern for a class (i.e. the weight vector), which would make
the
system more tolerant to future input patterns. Thus, the classes produced
by the
leader algorithm can vary greatly depending on the input presentation order,
whereas in ART1 this is not the case.
The ART1 ANN consists of an attentional subsystem comprising an input layer
(the
F[sub 1] field) and an output layer (the F[sub 2] field), fully interconnected
by
a set of bottom-up and top-down adaptive weights, and an orienting subsystem
which incorporates the reset mechanism (Figure 7). The bottom-up weights
constitutes ART1's long-term memory and the top-down weights store the learned
expectations for each of the categories formed in the F[sub 2] field. The
top-down weights are crucial for recognition-code self-stabilization.
In brief, a binary input pattern is presented to the F[sub 1] field. This
pattern
is then multiplied by the bottom-up weights in order to compute the F[sub
2]
field activation. A winning F[sub 2] node is chosen from the F[sub 2] field
by
lateral inhibition. The top-down weights from this winning node and the
F[sub 1]
field pattern then take part in the vigilance test, to check whether the
match
between the input pattern and the stored expectation of the winning category
is
higher than the set vigilance level. If it is, then learning takes place
by
adapting the bottom-up and top-down weights for the winning F[sub 2] node,
in
such a way as to make the correlation between these weights and the input
pattern
greater. If not, the winning F[sub 2] node is reset (i.e. it is eliminated
from
competition) and the algorithm searches for another suitable category.
Un-committed nodes (i.e. those which have not had previous learning) will
always
accommodate the learning of a new pattern. The algorithm continues in this
way
until all input patterns have found suitable categories.
3.3.2. ART2. The ART2 ANN, like ART1, is self-organizing and unsupervised.
It is
the analogue counterpart of ART1, designed for the processing of analogue,
as
well as binary input patterns. ART2 attempts to pick out and enhance similar
signals embedded in various noisy backgrounds. For this purpose, ART2's
feature
representation fields F[sub 0] and F[sub 1] include several pre-processing
levels
and gain-control systems (Figure 8).
ART2 is made up of two major components, common in the design of all ART
systems,
namely the attentional subsystem and the orienting subsystem. The attentional
subsystem consists of two input representation fields F[sub 0] and F[sub
1] and a
category representation field F[sub 2], which are fully connected by a bottom-up
and a top-down set of weights. As in ART1, the bottom-up weights represent
the
system's long-term memory and the top-down weights store the learned expectations
for each F[sub 2] category. The orienting subsystem interacts with the
attentional subsystem to carry out an internally controlled search process.
In short, an analogue input vector is passed to the F[sub 0] field. This
pattern
is normalized and then a threshold function is applied which sets to zero
any
part of the input pattern which falls below a set threshold value. This
pattern
is then renormalized to produce the input to the F, field. The F[sub 1]
field
also incorporates normalization and threshold operations which are applied
before
the F[sub 2]. activation is calculated by a feed-forward multiplication
of the
output of the F[sub 1] field and the bottom-up weights. The F[sub 2] field
then
undergoes a process of lateral inhibition to find the F[sub 2] node with
the
highest activity. A vigilance test then takes place to check whether the
top-down
learned expectation from the winning F[sub 2] node matches the input pattern
well
enough. If it does, the F[sub 1] calculations are iterated a number of times
and
then learning takes place between the F[sub 1] field and the winning F[sub
2]
node. If not, the F[sub 2] node is reset and another F[sub 2] node is chosen.
This repeats until the vigilance test has been passed, and then, either
a
committed node is found whose stored expectations match the input sufficiently,
or an uncommitted node is chosen (to which learning can always take place).
3.3.3. Why supervised learning is needed. For our implementation, a supervised
learning network was needed. We found that an optimum level of vigilance
could
not be found to give a satisfactory solution using the unsupervised ART
networks.
This is demonstrated in Figure 9. Here, an ART2 ANN was trained with 14
patterns,
7 examples of different spectral shapes taken from 7 different musical
instruments for C[sub 3] (patterns 0-6) and the same for C[sub 4] (patterns
7-13). The instruments were, respectively, a tenor voice singing la, a tenor
voice singing me, a contralto voice singing la, a steel-string guitar strummed
Sul Tasto (by the neck), a steel-string guitar strummed Sul Ponticello (by
the
bridge), a French horn and a piano. Figure 9 shows how these patterns were
clustered together at various levels of vigilance. For example, at vigilance
level 0.98 input patterns 0, 1, 5, 6, 7, 8, 9, 10, 11, 12 and 13 were clustered
to the F[sub 2] category 0 and patterns 2, 3 and 4 were clustered to the
F[sub 2]
node 1. The optimum level of vigilance would be one which separated the
patterns
from different octaves (e.g. the C[sub 3]s from the C[sub 4]s) while clustering
together as much as possible patterns which are from the same octave. This
would
allow the network to distinguish between pitches of different octaves and
also
give some insensitivity to timbre of input patterns having the same pitch
by
picking out common features within the internal structures of such patterns.
It can be seen from Figure 9 that there was no level of vigilance which
could
separate the pitch examples from the two octaves. Even at high levels of
vigilance (i.e. 0.99995), patterns 1 and 9 were clustered together (1 being
the
tenor voice example singing me of C[sub 3] and 9 being the contralto voice
example singing la of C[sub 4]) when all other patterns were associated
with
different output nodes. Thus, a more advanced network was needed to control
the
vigilance of the ART network so that it would cluster together only input
patterns which have the same output state (i.e. have the same pitch). ARTMAP
was
able to perform such an operation. In an ARTMAP network the map field controls
ART[sub a]'s vigilance parameter, as it associates pairs of vectors presented
to
ART[sub a] and ART[sub b] while reducing redundancy by clustering as many
patterns presented to ART[sub a] together as possible, within a base-rate
vigilance, which have the same 'predictive consequence' at ART[sub b]. The
mechanisms of the ARTMAP network are described below.
3.3.4. ARTMAP. ARTMAP is a self-organizing, supervised ANN which consists
of two
unsupervised ART modules, ART[sub a] and ART[sub b], and an inter-ART associative
memory, called a map field (Figure 10). ART[sub a] and ART[sub b] are linked
by
fully connected adaptive connections between ART[sub a]'s F[sub 2] layer
and the
map field, and non-adaptive, bidirectional, one-to-one connections from
the map
field to ART[sub b]'s F[sub 2] layer. The ART[sub b] network self-organizes
the
'predictive consequence' or 'desired output' patterns for each input pattern
presented to ART[sub a].
Briefly, a pair of vectors a[sub p] and b[sub p] are presented to ART[sub
a] and
ART[sub b] simultaneously. The ART[sub a] and the ART[sub b] networks choose
suitable output (F[sub 2]) categories for these vectors. The map field then
checks to see if the ART[sub a]'s choice can correctly predict the choice
at
ART[sub b]. If it can, then outstar learning between F[sub a,[sub 2] and
the map
field takes place, i.e. learning takes place between the map-field node
corresponding to the winning F[sub b,[sub 2] node and the F[sub a,[sub 2]
pattern. Connections to all other F[sub b,[sub 2] nodes are inhibited. If
not,
the map field increases ART[sub a]'s vigilance so that ART[sub a] does not
choose
the same F[sub 2] category again but searches on until a suitable F[sub
a,[sub 2]
category is found. If there are no suitable categories ART[sub a] chooses
an
uncommitted node, in which case learning can always take place.
4. Pitch Classification Experiments Using ARTMAP
4.1. Training the ARTMAP System
The supervised ARTMAP system consisting of an ART2, a map field and an ART1
ANN
was trained to recognize pitches from four different musical instruments.
Instruments that are bright or sharp sounding produce a spectrum with relatively
strong higher harmonics, whereas instruments with a mellower tone tend to
have a
strong fundamental but weak higher harmonics. Figure 11 illustrates this
by
showing the amplitude spectrum of three different sound sources. Each instrument
(or vocal) was asked to play (or sing) a note with a pitch of C[sub 4].
The top
spectrum is that of a clarinet and contains a full spectrum of harmonics
with
varying amplitudes. The middle spectrum was taken from a vocal singing la
and the
bottom spectrum is taken from a vocal singing me. It can be seen that spectra
vary greatly between different sound sources and therefore, to determine
pitches
on a variety of instruments, the network was required to learn features
from a
range of possible spectra for each pitch and thus become insensitive to
spectral
shape (and hence timbre) when classifying.
The network was trained with a total of 194 patterns, which were examples
from a
C-major scale between C[sub 3] and C[sub 6] on the four instruments. Samples
were
taken at various times (determined randomly) within the duration of the
note. The
training patterns were divided up as follows:
(1) 22 patterns (C[sub 3]-C[sub 6]) from a 'tubular-bells' sound on a commercial
synthesizer.
(2) 22 patterns (C[sub 3]-C[sub 6]) from a 'strings' sound on a commercial
synthesizer.
(3) 22 patterns taken from an upright piano recorded digitally via a (Toshiba
EM-420) microphone placed in front of it (roughly at the place the pianist's
head
would be).
(4) 128 patterns taken from an electric guitar. More training patterns are
needed
for a guitar because the spectrum of a particular note varies with the string
it
is played on, and with where and how that string is plucked. These patterns
can
be divided into two sets, consisting, respectively, of examples played Sul
Tasto
and Sul Ponticello. Each of these sets consists of patterns taken from every
note
in the C-major scale between C[sub 3] and C[sub 6], played on all possible
strings. Patterns were presented in ascending order of pitch for each of
the
first three cases. However, it was felt appropriate to present the many
repeated
notes from the different strings of the guitar in a quasi-random order.
A slow
learning rate was used to train the network with the 194 input patterns.
100
cycles were performed, which took about half an hour on a NeXT Workstation.
The
194 patterns were clustered to 46 patterns at the F[sub a,[sub 2] layer
and then
mapped to 22 nodes corresponding to the names of each pitch in the F[sub
b,[sub
2] layer. Thus, the ART[sub a] network achieved a code compression of the
input
patterns of greater than 4 to 1.
4.2. The Test Data
The network was tested using 412 patterns. 194 of these were taken from
the
original training set but were sampled at different points in time, thus
producing a spectrum which was in general rather different from the original
training pattern. In some cases, e.g. tubular bells, the difference can
be quite
considerable.
The other test patterns consisted of:
(1) 108 patterns from the guitar. These included plucking the notes halfway
between Sul Tasto and Sul Ponticello positions, playing each note Sul Tasto
with
vibrato, and playing each note Sul Ponticello with vibrato. Vibrato involves
a
frequency modulation of the order of a percent or so at a rate of a few
hertz and
is a common characteristic of musical tones. It is normally accompanied
by
amplitude modulation of a similar order of magnitude. Together they present
a
realistic challenge to a pitch-determining system.
(2) 22 tubular-bells patterns with each note played with clearly audible
amplitude modulation.
(3) 22 notes from the strings sound played with amplitude modulation.
(4) 22 notes played softly, and 22 notes played with the sustain pedal depressed,
on an acoustic piano.
(5) 22 notes played on the piano sound of a commercial synthesizer.
4.3. Results
Overall the ARTMAP network made mistakes in about 5% of cases. Around 70%
of
these mistakes, in turn, were 'octave' mistakes, by which we mean a
classification by the network one octave above or below the correct pitch.
For
example, an octave mistake that arose with a soft-piano pattern was:
Correct classification B[sub 3]:
(1) B[sub 4]:intensity 796;
(2) B[sub 3]:intensity 795;
(3) G[sub 3]:intensity 693.
where (1) is first choice, (2) is second choice, etc. and the intensity
(the
measure of the network's certainty) is a score scaled to 1000. Here, it
can be
seen that there is ambiguity in the network's choice. Its preference for
B[sub 4]
over B[sub 3] is so slight as to be insignificant compared with statistical
variations inherent in the method. The competition is particularly close
in this
example. However, it was generally the case that when the network did make
a
mistake the winning intensity was rather low in itself, or very close (within
50
out of 1000) to the correct classification. In the latter case the network
could
be adjusted to output a 'don't know'. This, however, would mildly impair
its
estimated accuracy in other respects as it sometimes made 'correct'
classifications which could be considered ambiguous, e.g. from the strings
sound:
Correct classification E[sub 3]:
(1) E[sub 3]:intensity 853;
(2) E[sub 4]:intensity 839;
(3) C[sub 3]:intensity 492.
Generally, in the ambiguous 'correct' classifications (about 4%) the runner-up
was an octave away from the winner. Most of the 'correct' classifications,
however, were clear cut:
Correct classification A[sub 3]:
(1)A[sub 3]:intensity 980;
(2)E[sub 4]:intensity 318;
(3)F[sub 5]:intensity 279.
The network made mistakes on only two of the data sets, namely the upright
piano
and the tubular-bells sets. Most of the piano mistakes were made when the
note
was hit softly. Inspecting the Fourier spectrum in each of these cases revealed
that the harmonic components were of very low amplitude and partly obscured
by
low-level noise. Indeed, checking the quality of the recorded notes did
reveal a
rather poor signal-to-noise ratio.
The network made the remaining mistakes on the tubular-bells sound. The
pitch of
such sounds is in any case more difficult for listeners to determine, because
the
components are to some considerable degree inharmonic. The network classified
86%
of test tubular-bells sounds correctly, which we regard as a reasonable
performance. Increasing the number of training examples of this sound would
probably improve this figure greatly. The network was able to classify all
but
one of the synthesized piano sounds correctly (over 95% accuracy). This
is to be
compared with the real piano sounds where only 86% accuracy was achieved.
This is
not surprising as synthesized sounds tend to vary less dramatically with
how they
are played, whereas the spectrum of an acoustic piano note is quite sensitive
to,
for example, the velocity of striking.
4.4. Conclusions
The ARTMAP network has proved capable of determining pitches with a high
degree
of reliability over a wide range. A significant proportion of the small
number of
mistakes made are readily explicable in terms of noise or spectral inharmonicity
and could be largely eliminated by straightforward improvements in the quality
and quantity of the training patterns. The 'octave' mistakes which predominate
are particularly interesting and significant. Such mistakes are common amongst
musically sophisticated human listeners and reflect the inherent similarity
of
tones whose fundamental frequencies can be expressed as a simple, whole-number
ratio. Indeed, it could be argued that such similarity is the very foundation
of
western harmony.
The indications are that the capability of the system may be extended to
a great
variety of instruments. This would result in a generalized pitch determiner
which
exhibited a desirable insensitivity to the dynamic level and spectral shape
of
the input signal.
5. Recent and Future Work
The ARTMAP system described here was written as one program (object) in
Objective
C on a NeXT Workstation. Recently, by using object-oriented techniques,
we have
created different classes for each ART network, e.g. for ART1, ART2, ART2-A
(Carpenter et al., 1991b), SART (modification of ART2-A, developed by Taylor
(1994)) and a map field, all of which can be dynamically created at run-time
(Taylor & Greenhough, 1993). This allows more complicated networks to
be built by
simply instantiating the relevant classes needed for an application. A general
ARTMAP network that can link any combination of the above networks together
has
also been constructed, allowing great flexibility when applying different
ARTMAP
topologies to particular applications. This has led to the use of ARTMAP
in other
music-pattern classification tasks such as coping with polyphonic acoustic
signals (Taylor et al., 1993).
Also recently, a comparison of performance of three different ANN architectures
for pitch determination has been made. These include back-propagation, and
two
distinct ARTMAP architectures. Each ARTMAP architecture contains a different
ART[sub a] module, respectively, ART2-A, and SART.
Each ANN was trained with a total of 198 examples of pitched notes from
the
C-major scale on 10 different instruments in the range C[sub 3] to C[sub
6].
Instruments were chosen to cover a wide variety of spectral shapes so that
the
network could pick out the characteristic features for each pitched note,
and yet
acquire an insensitivity to timbre. These instruments included soprano,
contralto
and tenor voices, saxophone and French horn, violin, acoustic guitar and
sitar,
piano and some examples of whistling.
In order to assess the system's robustness, the majority of testing examples
were
chosen from different instruments but some were taken from instruments from
the
training set and sung or played with vibrato. The test set altogether consisted
of 443 patterns. The network was also tested on the 198 training patterns,
and
showed that it had learned all the pitch examples. Test-set instruments
included
alto and bass voices, classical, 12-string and electric guitar, recorder,
clarinet, mandolin and steel drums, as well as some synthetic sounds (produced
by
computer) which consisted of a harmonic series without a fundamental. Other
test
examples were: soprano, contralto (x2) and tenor (x2) voices either sung
with
vibrato or taken from different singers, saxophone and violin with vibrato,
and
piano played softly.
For this experiment, the back-propagation network was investigated for a
large
number of input-parameter combinations to optimize it fully (2500 different
conditions in all, taking the equivalent of about 12 days of Sun SPARC Station
time). Each ART network was also optimized by first setting the optimum
theta
(threshold) value and then tweaking its vigilance level which in both cases
took
considerably less time than the back-propagation network (around 4 hours
in the
ART2-A's case and less than 30 minutes in SART's case).
It was found that the best back-propagation network achieved an absolute-pitch
classification rate of 98.0% and a chroma-pitch classification rate (i.e.
ignoring the octave) of 99.1%. In the same trial, the two ARTMAP networks
achieved rates of 97.7% and 99.6% (for SART) and 98.2% and 99.8% (for ART2-A),
respectively. It was also found that for these cases the back-propagation
network
took over 8 minutes (on a NeXT Workstation) to learn the training data.
The
ART2-A network took just over 2 minutes while the SART network took less
than 20
seconds to learn the training data. It was also found that all ANNs performed
better than straightforward pattern-matching against stored templates which
achieved an abso-lute-pitch classification rate of 97.1% and a chroma-pitch
classification rate of 98.2% (Taylor & Greenhough, 1993).
It is found that although the ANN-based ARTMAP pitch-determining system
had to be
trained, this is well worth while, since it only takes 2 minutes in the
ART2-A's
case (and just 20 seconds in SART's case) and results in a more robust
pitch-classification ability. The use of the back-propagation ANN, however,
would
have to be offset against the training time and optimization problems if
it is to
be considered for use in a pitch-determining system.
Future work includes the tracking of vocal and instrumental solos in the
presence
of an accompaniment. This requires the ANN to reject everything in the source
except the main melody. We believe that this can be accomplished by teaching
the
ANN what a particular instrument sounds like (i.e. its spectral characteristics
and thus its timbre) for each pitch. The network can then 'listen' for
recurrences of this sound in the music, and thus classify just the pitches
of the
instrument it has been trained for. Preliminary experiments with some musical
instruments have shown that one particular ANN network (ARTMAP) can indeed
be
trained to perform such classifications. Such tracking systems may eventually
find application in certain contemporary music performances involving the
interaction of human players and computer systems, and perhaps in
ethnomusicological studies where it is required to transcribe previously
unnotated live or recorded music. The latter application may require the
handling
of a pitch continuum, in which case the input mapping system will need to
be
refined, as suggested in Section 3.2.
ILLUSTRATION: Figure 1. Visual analogy of virtual pitch. The visual system
perceives contours which are not physically present (Terhardt, 1974).
DIAGRAM: Figure 2. The adaptive resonance theory system. A Fourier transform
is
performed on the input waveform to produce an amplitude spectrum. This is
mapped
to a 'semitone-bin' distribution which is then fed into the ART neural network
architecture called ARTMAP, which learns to extract pitch.
DIAGRAM: Figure 3. The mapping scheme from a Fourier spectrum to a distribution
of 'semitone bins'. There are strong connections for frequencies within
+/-
1/4-tone (approximately, +/-1/2(F[sub s] - F[sub s-t)) of the semitone frequency
and weaker ones within +/- 1/2-tone.
Figure 4. Operations used to map frequency components in the Fourier spectrum
to
a semitone-bin distribution.
Figure 5. The input-layer representation for the ARTMAP ANN.
Figure 6. The output-layer representation for the ARTMAP ANN.
DIAGRAM: Figure 7. An ART1 ANN, consisting of an F[sub 1] field (input layer),
an
F[sub 2] field (output layer), a reset mechanism and two sets of fully connected
weights (bottom-up and top-down).
DIAGRAM: Figure 8. A simplified version of the ART2 ANN architecture consisting
of an attentional subsystem, containing three fields, F[sub 0], F[sub 1]
and
F[sub 2] and two sets of adaptive weights (bottom-up and top-down) and an
orienting subsystem incorporating the reset mechanism. Large filled circles
(gain-control nuclei) represent normalization operations carried out by
the
network.
Figure 9 The category representations made by the ART2 ANN for 14 input
patterns
(7 examples taken from different instruments for the note C[sub 3] and the
same
for C[sub 4]) at 9 different levels of vigilance.
Information is presented in the following order: category; vigilance; 0.98;
0.99;
0.9975; 0.998; 0.999; 0.9995; 0.9998; 0.99991; 0.99995
0; ; 0 1 5 6; 0 1 5 6; 0 1 5 6; 0 1 5; 0 1 7; 0 7; 0 7; 0; 0
; ; 7 8 9; 7 8 9; 7 1 2; 7 1 2; 12
; ; 10 11; 10 12; 13
; ; 12 13; 13
1; ; 2 3 4; 2 3 4; 2 3 4; 2 3 4; 2 3; 1 5 8 9; 1 6 8; 1 9 10; 1 9
; ; ; ; ; ; ; 10
2; ; ; 11; 8 9 10; 6 8 9; 4 5; 2 3; 2 3; 2 3; 2
; ; ; ; ; 10 13
3; ; ; ; 11; 11; 6 8 9; 4; 4; 4; 3
; ; ; ; ; ; 10 13
4; ; ; ; ; ; 11; 6 12; 5; 5; 4
; ; ; ; ; ; ; 13
5; ; ; ; ; ; ; 11; 9 10; 6 8; 5
6; ; ; ; ; ; ; ; 11; 7; 6
7; ; ; ; ; ; ; ; 12 13; 11; 7
8; ; ; ; ; ; ; ; ; 12; 8
9; ; ; ; ; ; ; ; ; 13; 10
10; ; ; ; ; ; ; ; ; ; 11
11; ; ; ; ; ; ; ; ; ; 12
12; ; ; ; ; ; ; ; ; ; 13
DIAGRAM: Figure 10. A predictive ART, or ARTMAP, system includes two ART
modules
linked by a map field. Internal control structures actively regulate learning
and
information flow. ART[sub a] generally self-organizes input data (e.g. pitch
information) and ART[sub b] self-organizes desired output states (e.g. pitch
names).
DIAGRAM: Figure 11. The amplitude spectrum of three different musical instruments
for the note C[sub 4]. The top spectrum is that of a clarinet, the middle
spectrum was taken from a vocal singing la and the bottom spectrum is taken
from
a vocal singing me. It can be seen that spectra vary greatly between different
sound sources.
References
Brown, J.C. (1991) Musical frequency tracking using the methods of conventional
and 'narrowed' autocorrelation. The Journal of the Acoustical Society of
America,
89, 2346-2354.
Brown, J.C. (1992) Musical fundamental frequency tracking using a pattern
recognition method. The Journal of the Acoustical Society of America, 92,
1394-1402.
Carpenter, G.A. & Grossberg, S. (1987a) A massively parallel architecture
for a
self-organizing neural pattern recognition machine. Computer Vision, Graphics,
and Image Processing, 37, 54-115.
Carpenter, G.A. & Grossberg, S. (1987b) ART2: Self-organization of stable
category recognition codes for analog input patterns. Applied Optics, 26,
4919-4930.
Carpenter, G.A. & Grossberg, S. (1990) ART3: Hierarchical search using
chemical
transmitters in self-organizing pattern recognition architectures. Neural
Networks, 3, 129-152.
Carpenter, G.A., Grossberg, S. & Reynolds, J.H. (1991a) ARTMAP: Supervised
real-time learning and classification of nonstationary data by a self-organizing
neural network. Neural Networks, 4, 565-588.
Carpenter, G.A., Grossberg, S. & Rosen, D.B. (1991b) ART2-A: An adaptive
resonance algorithm for rapid category learning and recognition. Neural
Networks,
4, 493-504.
De Boer (1956) On the Residue in Hearing. Unpublished doctoral dissertation.
University of Amsterdam.
De Boer, E. (1977) Pitch theories unified. In E.F. Evans & J.P. Wilson
(Eds),
Psychophysics and Physiology of Hearing, pp. 323-334. London: Academic.
Duifhuis, H., Willems, L.F. & Sluyter, R.J. (1982) Measurement of pitch
in
speech: An implementation of Goldstein's theory of pitch perception. The
Journal
of the Acoustical Society of America, 71, 1568-1580.
Goldstein, J.L. (1973) An optimum processor for the central formation of
pitch of
complex tones. The Journal of the Acoustical Society of America, 54, 1496-1516.
Grossberg, S. (1976a) Adaptive pattern classification and universal recoding.
I:
Parallel development and coding of neural feature detectors. Biological
Gybernetics, 23, 121-134.
Grossberg, S. (1976b) Adaptive pattern classification and universal recoding,
II:
Feedback, expectation, olfaction, and illusions. Biological Cybernetics,
23,
187-202.
Helmholtz, H.L.F. von (1863) On the Sensations of Tone as a Physiological
Basis
for the Theory of Music. Translated by A.J. Ellis from the 4th German edition,
1877, Leymans, London, 885 (reprinted, New York: Dover, 1954).
Hermes, D.J. (1988) Measurement of pitch by subharmonic summation. The Journal
of
the Acoustical Society of America, 83, 257-264.
Houtsma, A.J.M. & Goldstein, J.L. (1972) The central origin of the pitch
of
complex tones: Evidence from musical interval recognition. The Journal of
the
Acoustical Society of America, 51, 520-529.
Lippmann, R.P. (1987) An introduction to computing with neural nets, IEEE
ASSP
Magazine, 3, 4-22.
Martin Ph. (1982) Comparison of pitch detection by cepstrum and spectral
comb
analysis. Proceedings of IEEE International Conference on Acoustics, Speech
and
Signal Processing, pp. 180-183. ICASSP-82.
Noll, A.M. (1970) Pitch determination of human speech by the harmonic product
spectrum, the harmonic sum spectrum, and a maximum likelihood estimate.
In the
Microwave Institute (Eds), Symposium on Computer Processing in Communication
19,
779-797, Polytechnic University of Brooklyn, New York.
Piszczalski, M. & Galler, B.F. (1982) A computer model of music recognition.
In
M. Clynes (Ed.), Music, Mind and the Brain: The Neuropsychology of Music,
pp.
321-351. London: Plenum Press.
Ritsma, R.J. (1962) Existence region of the tonal residue, I. The Journal
of the
Acoustical Society of America, 34, 1224-1229.
Rumelhart, D.E. & McClelland, J.L. (1986) Parallel Distributed Processing,
Explorations in the Microstructure of Cognition, Vol. 1: Foundations, Cambridge,
MA: MIT Press.
Scheffers, M.T.M. (1983) Simulation of auditory analysis of pitch: an elaboration
on the DWS pitch meter. The Journal of the Acoustical Society of America,
74,
1716-1725.
Schouten, J.F. (1940) The residue and the mechanism of hearing. Proceedings
Konikl. Ned. Akad. Wetenschap. 43, 991-999.
Schroeder, M.R. (1968) Period histogram and product spectrum: new methods
for
fundamental-fre-quency measurement. The Journal of the Acoustical Society
of
America, 43, 829-834.
Stubbs, D. (1988) Neurocomputers, M.D. Computing, 5, 14-24.
Taylor, I. (1994) Artificial Neural Network Types for the Determination
of
Musical Pitch. PhD thesis, University of Wales, College of Cardiff.
Taylor, I. & Greenhough, M. (1993) An object-oriented ARTMAP system
for
classifying pitch. Proceedings of International Computer Music Conference,
pp.
244-247.
Taylor, I., Page, M. & Greenhough, M. (1993) Neural networks for processing
musical signals and structures. Acoustics Bulletin, 18, 5-9.
Terhardt, E. (1972) Zur Tonhoehenwahrnehmung von Klaengen II: ein
Funktionsschema. Acustica, 26, 187-199.
Terhardt, E. (1974) Pitch, consonance, and harmony. The Journal of the Acoustical
Society of America, 55, 1061-1069.
Terhardt, E., Stoll, G. & Seewann, M. (1982) Algorithm for extraction
of pitch
and pitch salience from complex tonal signals. The Journal of the Acoustical
Society of America, 71, 679-688.
Wightman, F.L. (1973) The pattern-transformation model of pitch. The Journal
of
the Acoustical Society of America, 54, 407-416.
~
BY IAN TAYLOR & MIKE GREENHOUGH I. Taylor and M. Greenhough, Department
of
Physics and Astronomy, University of Wales College of Cardiff, PO Box 913,
Cardiff CF2 3YB, UK. E-mail: ijtcm.cf.ac.uk and greenhoughcardiff.ac.uk
From:Taylor, Mike, Modelling pitch reception with adaptive resonance theory
artificial neural networks.., Vol. 6, Connection Science, 01-01-1994, pp
135.
Psychoacoustics - The Magic of Tone and the Art of Music -
Document Two

Return to the GS
WorldView Index 'Parent Directory'