Modelling pitch reception with adaptive resonance theory artificial neural
networks.

Most modern pitch-perception theories incorporate a pattern-recognition scheme to
extract pitch. Typically, this involves matching the signal to be classified
against a harmonic-series template for each pitch to find the one with the best
fit. Although often successful, such approaches tend to lack generality and may
well fail when faced with signals with much depleted or inharmonic components.
Here, an alternative method is described, which uses an adaptive resonance theory
(ART) artificial neural network (ANN). By training this with a large number of
spectrally diverse input signals, we can construct more robust pitch-templates
which can be continually updated without having to re-code knowledge already
acquired by the ANN. The input signal is Fourier-transformed to produce an
amplitude spectrum. A mapping scheme then transforms this to a distribution of
amplitude within 'semitone bins'. This pattern is then presented to an ARTMAP ANN
consisting of an ART2 and ART1 unsupervised ANN linked by a map field. The system
was trained with pitches ranging over three octaves (C[sub 3] to C[sub 6]) on a
variety of instruments and developed a desirable insensitivity to phase, timbre
and loudness when classifying.


KEYWORDS: ART, ARTMAP, pitch perception, pattern recognition.


1. Introduction


This paper describes a computer system that models aspects of human pitch
perception using an adaptive resonance theory (ART) artificial neural network
(ANN). ART was introduced in order to analyze how brain networks can learn about
a changing world in real time in a rapid but stable fashion. Here, ART will be
used to self-organize musical pitch by using a supervised ANN called ARTMAP
(Carpenter et al., 1991a).


Section 2 briefly describes the auditory system and outlines the various
pitch-perception theories. Section 3 describes an ART system we have developed
that is capable of determining pitch on a wide variety of musical instruments.
Section 4 presents experiments that were undertaken using this system.


2. Aspects of Musical Pitch


Sound waves are conducted via the outer and middle ears to the basilar membrane
in the inner ear (cochlea). This membrane varies in mechanical structure along
its length in such a way that a signal component of a particular frequency will
cause a particular place on it to vibrate maximally. An array of neurons along
the length of the membrane is thus able to convey a rough frequency analysis to
the brain via the auditory nerve. Historically, therefore, pitch-perception was
explained in terms of simple spectral cues by the 'place' theory (Helmholtz,
1863).


However, it was found that the pitch of complex periodic sounds corresponds to
the fundamental frequency, independently of the presence or absence of energy in
the sound spectrum at this frequency. Consequently, the 'periodicity' theory
(Schouten, 1940) was introduced, which explained perception of sounds in terms of
the timing of nerve impulses. Later still, the inadequacy of these cochlea-based
explanations for pitch perception of complex tones was revealed by musical
intelligibility tests (Houtsma & Goldstein, 1972) which demonstrated that the
pitch of complex tones made up of a random number of harmonics can be heard
equally well whether the subject is presented with them monotically (all in one
ear) or dichotically (different harmonics sent to each ear). Therefore, neither
energy at the fundamental frequency nor fundamental periodicities in the cochlea
output are necessary for a subject to determine the pitch of a periodic sound.
This implies that some pitch processing takes place at a higher level than the
cochlea. Following this discovery, three pattern-recognition theories were
published that attempted to explain how the brain learned to extract pitch from
complex sounds.


2. 1. Pattern-recognition Theories


De Boer (1956) suggested how a template model could predict the pitch of both
harmonic and inharmonic sounds. He argued that the brain, through wide contact
with harmonic stimuli, could be specifically tuned to such harmonic patterns.
Thus, an inharmonic stimulus could be classified by matching the best-fitting
harmonic template. For example, consider three components 1000, 1200 and 1400 Hz,
frequency-shifted by 50 Hz to give 1050, 1250 and 1450 Hz. The reported resulting
pitch sensation for this set of harmonics is 208 Hz. Similarly, using de Boer's
harmonic template-matching scheme, the best-fitting pitch would also be 208 Hz,
where 1050, 1250 and 1450 Hz are the closest-fitting harmonics to the 5th, 6th
and 7th harmonics of 208 Hz, i.e. 1040, 1248 and 1456 Hz.


The optimum-processor theory (Goldstein, 1973), the virtual-pitch theory
(Terhardt, 1972) and the pattern-transformation theory (Wightman, 1973) are all
quantified elaborations on de Boer's ideas. Perhaps the most closely related
theory is the optimum-processor theory which explicitly incorporates these ideas.
Goldstein includes a hypothetical neural network called the 'optimum-processor',
which finds the best-fitting harmonic template for the spectral patterns supplied
by its peripheral frequency analyzer. The fundamental frequency is obtained in a
maximum-likelihood way by calculating the number of harmonics which match the
stored harmonic template for each pitch and then choosing the winner. The winning
harmonic template corresponds to the perceived pitch.


Terhardt, in his virtual-pitch theory, distinguishes between two kinds of pitch
mode: spectral pitch (the pitch of a pure tone) and virtual pitch (the pitch of a
complex tone). The pitch percept governed by these two modes is described,
respectively, by the spectral-pitch pattern and the virtual-pitch pattern. The
spectral-pitch pattern is constructed by spectral analysis, extraction of tonal
components, evaluation of masking effects and weighting according to the
principle of spectral dominance. The virtual-pitch pattern is then obtained from
the spectral-pitch pattern by subharmonic-coincidence assessment.


Terhardt considers virtual pitch to be an attribute which is a product of
auditory Gestalt perception. The virtual pitch can be generated only if a
learning process has been undergone previously. In vision, Gestalt perception
invokes the law of closure to explain how we often supply missing information to
'close a figure' (see Figure 1). Terhardt (1974) argues that, "merely from the
fact that, in vision, 'contours' may be perceived which are not present one can
conclude that, in heating, 'tones' may be perceptible which are not present,
either."


Wightman presents a mathematical model of human pitch perception called the
pattern-transformation model of pitch. It was inspired by what appears to be a
close similarity between pitch perception and other classic pattern-recognition
problems. In his analogy between pitch and character recognition, he describes
characters as having a certain characteristic about them, regardless of size,
orientation, type style etc. For example the letter C as it is seen here, has its
C-ness in common with other Cs, whether it is written by hand, printed in a
newspaper or anywhere else. Even though the letter style can vary greatly it is
still recognized as the letter C. Wightman argues that in music this is also
true, e.g. middle C has the same pitch regardless of the instrument which
produces it. Hence, he concluded that the perception of pitch is a
pattern-recognition problem.


In the pattern-transformation model, pitch recognition is regarded as a sequence
of transformations, which produce different so-called 'patterns of neural
activity'. A limited-resolution spectrum (called a peripheral activity pattern),
similar to that produced in the cochlea, is created from the input stimuli. This
is then Fourier transformed to compute the autocorrelation function, resulting in
a phase-invariant pattern. The final stage of the model incorporates the pitch
extractor, which although not explicitly described, is performed by a
pattern-matching algorithm.


The three models described here have been demonstrated to be closely related
mathematically (de Boer, 1977). In his paper, de Boer demonstrated that under
certain conditions, Goldstein's optimum-processor theory can give the same pitch
predictions as both Wightman's pattern-transformation theory and Terhardt's
virtual-pitch theory. If the spread of a single-frequency component is
substantial, Goldstein's theory is equivalent to Wightman's theory, and if the
spread is zero it predicts the same pitches as Terhardt's theory.


2.2. Implementations of Pattern-recognition Theories


Implementations have been published of Terhardt's virtual-pitch theory (Terhardt
et al., 1982) and Goldstein's central processor theory (Duifhuis et al., 1982).


Terhardt et al. (1982) present an algorithm which effectively reproduces the
virtual-pitch theory mechanisms for determining pitch. They use a fast Fourier
transform (FFT) to calculate the power spectrum of the sampled signal. This is
then analyzed in order to extract the tonal components of importance to the
pitch-determining process, which effectively cleans up the spectrum so that just
high-intensity frequency components remain. Evaluation of masking effects
follows, resulting in the discarding of irrelevant frequency components and the
frequency-shifting of others. Weighting functions are then applied which control
the extent to which tonal components contribute towards the pitch-extraction
process. The subsequent spectral-pitch pattern is then processed by a method of
subharmonic summation to extract the virtual pitch.


The implementation of Goldstein's optimum-processor theory (Duifhuis et al.,
1982) is often called the 'DWS pitch meter' (Scheffers, 1983). This has three
stages of operation. In the first stage, a 128-point FFT is computed, producing a
spectrum of the acoustic signal. A 'component-finder' algorithm then finds the
relevant frequency components and then by interpolation pinpoints the actual
frequency positions more precisely. The third stage consists of an
optimum-processor scheme which estimates the fundamental frequency whose template
optimally fits the set of resolved frequency components.


An elaboration of the DWS meter, a year later (Scheffers, 1983), included a
better approximation of auditory frequency analysis and a modification to the
'best-fitting fundamental' procedure. Scheffers reported that errors induced by
the absence of low harmonics were significantly reduced and noted that their
model produced the same results that were originally predicted by Goldstein in
1973.


Terhardt's subharmonic-summation pitch-extracting scheme has been used in a large
number of other computer implementations. Hermes (1988) presented an alternative
method of computing the summation by applying a series of logarithmic frequency
shifts to the amplitude spectrum and adding component amplitudes together to
produce the subharmonic sum spectrum. Many speech-processing algorithms also use
types of subharmonic summation, e.g. Schroeder (1968), Noll, (1970) and Martin
(1982). A number of frequency-tracking algorithms also use variations on the
subharmonic summation scheme, e.g. Piszczalski and Galler (1982) and Brown (1991,
1992).


However, these implementations generally involve complex algorithms whose success
is often qualified by the need to set parameters, in an ad hoc fashion, so that
the results fit empirical data from psychoacoustic experiments. In this paper, we
propose an alternative system which uses an ART neural network called ARTMAP to
classify the pitch of an acoustic signal from a Fourier spectrum of harmonics. In
effect, such a network can fit psychoacoustic data itself by associating input
signals with desired output states.


2.3. The Use of ANNs for Pitch Classification


ANNs offer an attractive and alternative approach to pitch-determination by
attempting to improve on the widely used harmonic-ideal template (i.e. the
harmonic series) matched against the input data to find the pitch template with
the closest fit. Although such methods work well in general, there are musical
instruments which do not readily fit into the harmonic-ideal category, e.g. those
which produce a much depleted or inharmonic set of spectral components. Such
spectrally ambiguous patterns may well confuse systems which use simple
comparisons of this kind. Of course, such algorithms may be extended to cater for
a greater variety of instruments which do not fit the harmonic ideal; but the
process is by no means a simple one, involving further pitch analysis and
re-coding of the computer implementations.


ANNs, on the other hand, offer an original way of constructing a more robust
harmonic template. This is achieved by training the ANN with a wide variety of
spectrally different patterns, so that the information relevant to the
pitch-determining process can be extracted. Through this interaction with a large
variety of pitches taken from different musical instruments the ANN can learn to
become insensitive to spectral shape and hence timbre when determining pitch.
There are two ways in which the conventional pitch-template pattern can be
improved:


(1) by training an ANN with various spectral examples for a single pitch and then
using this information to predict the harmonic template for any pitch;


(2) by training the ANN with various spectral examples taken from a wide variety
of different pitches.


The second approach was chosen as likely to be the more thorough. This decision
was somewhat intuitive, inspired by the considerable variation in spectral
patterns of different fundamental frequency. We did not directly test the first
approach experimentally. Previous results (Taylor & Greenhough, 1993) demonstrate
that the second training scheme can indeed out-perform template matching and
subharmonic summation.


3. Outline for an Adaptive Resonance Theory System


The model can be divided into three stages as illustrated in Figure 2. The first
stage performs a Fourier transform on the sampled waveform to produce an
amplitude spectrum. The second stage maps this representation to an array of
'semitone bins'. Finally, this is presented to an ARTMAP network which learns to
extract the pitch from the signals. It is essential to the learning process that
the network acquires an insensitivity to other factors such as spectral shape
(and hence timbre) and overall amplitude (and hence loudness). All stages,
including the ANN simulation, were written by us using Objective-C on a NeXT
Workstation. The next three sections describe these stages in more detail.


3.1. Sampling and Spectrum Analysis


The NeXT Workstation's on-board sampling chip was used for the acquisition of the
pitch examples. This chip has a 12-bit dynamic range and has a fixed sampling
rate of 8012.8 Hz, called the 'CODEC converter rate'. We found that this was
sufficient for the current work but a more sophisticated chip, ideally with a
sampling rate of 44.1 kHz and a 16-bit dynamic range, should be used in a working
application. The CODEC converter rate, according to Nyquist's theorem, can
resolve frequencies up to about 4000 Hz. (Although the human ear can detect
frequencies up to around 20 kHz, any spectral energy above 4000 Hz is likely to
have little effect on the pitch-determination process (see, for example, Ritsma
(1962).) A FFT algorithm was then applied to 1024 samples, producing an amplitude
spectrum of the sampled data. The resulting frequency spacing of 7.825 Hz
(8012.8/1024) represents a somewhat limited resolution, but this was not a
problem in the context of these experiments.


3.2. Mapping from the Fourier Spectrum to a 'Semitone Bin' Representation


Groups of frequency components lying within bandwidths of a semitone are mapped
to individual 'semitone bins' of a representative intensity. Thus, the Fourier
spectrum is transformed into a distribution of intensity over the semitones of
the chromatic scale. However, to do this one has to take into account that a
bandwidth of a semitone, in Hz, varies with the centre frequency. For example,
the semitone bandwidth around the frequency point G#2 is 7 Hz while that around
C[sub 8] is 242 Hz. We must, however, ensure the same intensity level in these
two semitone bins, if the activation levels are the same.


The mapping scheme used here has strongly weighted connections to the area within
a bandwidth of a semitone around the semitone's centre frequency and weaker
connections outside this area up until the neighbouring semitone's centre
frequency (Figure 3). These weaker connections enable the network to be more
robust when presented with a mistuned note or harmonic. Figure 4 summarizes the
actual operations which are performed to produce the required mappings. The
restriction to just three octaves of a diatonic scale was a consequence of there
being a considerable variety of network architectures to investigate with only
limited processing power. A much finer-grained input mapping is possible and
would allow, for example, an exploration of the subtle pitch-shifts observed with
groups of inharmonically related frequency components.


The input, therefore, to the ANN is 60 input nodes representing the 60 chromatic
semitone bins in the range C[sub 3] to B[sub 7], i.e. 131 Hz to 3951 Hz (see
Figure 5). The ANN's output layer is used to represent the pitch names of the
C-major scale in the range C[sub 3] to C[sub 6] (i.e. 131 Hz to 1047 Hz) and
therefore consists of 22 nodes (see Figure 6).


Input patterns and corresponding expected output patterns are presented to the
ARTMAP network in pairs. The goal of the network is to associate each input
pattern with its appropriate output pattern by adjusting its weights (i.e. by
learning). The network in this experiment has a training and a testing phase. In
the training phase the network learns the training patterns presented to it,
while in the testing phase it is asked to classify patterns which it has not seen
before. Optimally, the ARTMAP network should learn all the training patterns and
also classify all the test patterns correctly by generalizing using knowledge it
has learned in the training phase.


3.3. The ART Architecture and Its Use in Pitch Classification


ANNs are currently the subject of much theoretical and practical interest in a
wide range of applications. One of the main motivations for the recent revival of
computational models for biological networks is the apparent ease, speed and
accuracy with which biological systems perform pattern recognition and other
tasks. An ANN is a processing device, implemented either in software or hardware,
whose design is inspired by the massively parallel structure and function of the
human brain. It attempts to simulate this highly interconnected, parallel
computational structure consisting of many relatively simple individual
processing elements. Its memory comprises nodes, which are linked together by
weighted connections. The nodes' activation levels and the strengths of the
connections may be considered to be short-term memory and long-term memory
respectively. Knowledge is not programmed into the system but is gradually
acquired by the system through interaction with its environment.


The brain has been estimated to have between 10 and 500 billion neurons, or
processing elements (Rumelhart & McClelland, 1986). According to one opinion
(Stubbs, 1988), the neurons are arranged into approximately 1000 main modules,
each having 500 neural networks consisting of 100 000 neurons. There are between
100 and several thousand axons (from other neurons), which connect to the
dendrites of each neuron. Neurons either excite or inhibit neurons to which they
are connected (Eccles law).


ANNs, on the other hand, typically consist of no more than a few hundred nodes
(neurons). The ANN pitch-determination system described in this paper consists of
just 304 neurons (60 input neurons, 22 output neurons and 200 + 22 neurons
connecting to the input layer and the output layer, respectively), with 16 884
weighted connections. It should be emphasized therefore that, generally, ANNs are
not implemented on the same scale as biological neural networks and that ANNs are
not considered to be exact descriptions of mammalian brain neural networks.
Rather, they take into account the essential features of the brain that make it
such a good information processor.


In our system, we have used an ANN architecture called ARTMAP to perform pattern
recognition on the semitone-bin distribution. The introduction of ART (Grossberg,
1976a, b) has led to the development of a number of ANN architectures including
ART1, ART2, ART3 and ARTMAP ((Carpenter & Grossberg, 1987a, b, 1990) and
(Carpenter et al., 199 lb)). ART networks differ in topology from most other ANNs
by having a set of top-down weights as well as bottom-up ones. They also have the
advantage over most others of an ability to solve the 'stability-plasticity'
dilemma, that is once the network has learned a set of arbitrary input patterns
it can cope with learning additional patterns without destroying previous
knowledge. Most other ANNs have to be retrained in order to learn additional
input patterns.


ART1 self-organizes recognition codes for binary input patterns. ART2 does the
same for binary and analogue input patterns, while ART3 is based on ART2 but
includes a model of the chemical synapse that solves the memory-search problem of
ART systems. ARTMAP is a supervised ANN which links together two ART modules
(ART[sub a] and ART[sub b]) by means of an 'inter-ART associative memory', called
a map field. ART[sub a] and ART[sub b] are both unsupervised ART modules (e.g.
ART1, ART2, ART3, ART2-A etc.).


Our system consists of an ART2 and an ART1 ANN for the ART[sub a] and ART[sub b]
modules, respectively. Vector a[sub p] encodes the pitch information in the form
of semitone bins and the vector b[sub p] encodes its 'predictive consequence'
which corresponds, in this case, to the conventional name of that pitch, e.g.
A[sub 4], C[sub 3] etc. The next four sections describe the unsupervised ART[sub
a] and ART[sub b] modules which were used in our system, followed by an
experiment which demonstrates why the supervised learning scheme adopted was
needed. Finally, details of the ARTMAP system are given.


3.3.1. ART1. ART1 is an ANN which self-organizes, without supervision, a set of
binary input vectors. Vectors which are similar enough to pass the so-called
'vigilance test' are clustered together to form categories (also known as
exemplars or recognition codes). ART1 has been described as having similar
properties to that of the single leader sequential clustering algorithm
(Lippmann, 1987). Briefly, the leader algorithm selects the first input as the
leader of the class. The next input is then compared to the first pattern and the
distance is calculated between the vectors. If this distance is below a threshold
(set beforehand), then this input is assigned to the first class, otherwise it
forms a new class and consequently becomes the leader for that class. The
algorithm continues in this fashion until all patterns have been assigned to a
class. The number of classes produced by the leader algorithm depends on both the
threshold value chosen and the distance measure used to compare input to class
leaders.


The leader algorithm differs from ART1 in that it does not attempt to improve on
its leading pattern for a class (i.e. the weight vector), which would make the
system more tolerant to future input patterns. Thus, the classes produced by the
leader algorithm can vary greatly depending on the input presentation order,
whereas in ART1 this is not the case.


The ART1 ANN consists of an attentional subsystem comprising an input layer (the
F[sub 1] field) and an output layer (the F[sub 2] field), fully interconnected by
a set of bottom-up and top-down adaptive weights, and an orienting subsystem
which incorporates the reset mechanism (Figure 7). The bottom-up weights
constitutes ART1's long-term memory and the top-down weights store the learned
expectations for each of the categories formed in the F[sub 2] field. The
top-down weights are crucial for recognition-code self-stabilization.


In brief, a binary input pattern is presented to the F[sub 1] field. This pattern
is then multiplied by the bottom-up weights in order to compute the F[sub 2]
field activation. A winning F[sub 2] node is chosen from the F[sub 2] field by
lateral inhibition. The top-down weights from this winning node and the F[sub 1]
field pattern then take part in the vigilance test, to check whether the match
between the input pattern and the stored expectation of the winning category is
higher than the set vigilance level. If it is, then learning takes place by
adapting the bottom-up and top-down weights for the winning F[sub 2] node, in
such a way as to make the correlation between these weights and the input pattern
greater. If not, the winning F[sub 2] node is reset (i.e. it is eliminated from
competition) and the algorithm searches for another suitable category.
Un-committed nodes (i.e. those which have not had previous learning) will always
accommodate the learning of a new pattern. The algorithm continues in this way
until all input patterns have found suitable categories.


3.3.2. ART2. The ART2 ANN, like ART1, is self-organizing and unsupervised. It is
the analogue counterpart of ART1, designed for the processing of analogue, as
well as binary input patterns. ART2 attempts to pick out and enhance similar
signals embedded in various noisy backgrounds. For this purpose, ART2's feature
representation fields F[sub 0] and F[sub 1] include several pre-processing levels
and gain-control systems (Figure 8).


ART2 is made up of two major components, common in the design of all ART systems,
namely the attentional subsystem and the orienting subsystem. The attentional
subsystem consists of two input representation fields F[sub 0] and F[sub 1] and a
category representation field F[sub 2], which are fully connected by a bottom-up
and a top-down set of weights. As in ART1, the bottom-up weights represent the
system's long-term memory and the top-down weights store the learned expectations
for each F[sub 2] category. The orienting subsystem interacts with the
attentional subsystem to carry out an internally controlled search process.


In short, an analogue input vector is passed to the F[sub 0] field. This pattern
is normalized and then a threshold function is applied which sets to zero any
part of the input pattern which falls below a set threshold value. This pattern
is then renormalized to produce the input to the F, field. The F[sub 1] field
also incorporates normalization and threshold operations which are applied before
the F[sub 2]. activation is calculated by a feed-forward multiplication of the
output of the F[sub 1] field and the bottom-up weights. The F[sub 2] field then
undergoes a process of lateral inhibition to find the F[sub 2] node with the
highest activity. A vigilance test then takes place to check whether the top-down
learned expectation from the winning F[sub 2] node matches the input pattern well
enough. If it does, the F[sub 1] calculations are iterated a number of times and
then learning takes place between the F[sub 1] field and the winning F[sub 2]
node. If not, the F[sub 2] node is reset and another F[sub 2] node is chosen.
This repeats until the vigilance test has been passed, and then, either a
committed node is found whose stored expectations match the input sufficiently,
or an uncommitted node is chosen (to which learning can always take place).


3.3.3. Why supervised learning is needed. For our implementation, a supervised
learning network was needed. We found that an optimum level of vigilance could
not be found to give a satisfactory solution using the unsupervised ART networks.
This is demonstrated in Figure 9. Here, an ART2 ANN was trained with 14 patterns,
7 examples of different spectral shapes taken from 7 different musical
instruments for C[sub 3] (patterns 0-6) and the same for C[sub 4] (patterns
7-13). The instruments were, respectively, a tenor voice singing la, a tenor
voice singing me, a contralto voice singing la, a steel-string guitar strummed
Sul Tasto (by the neck), a steel-string guitar strummed Sul Ponticello (by the
bridge), a French horn and a piano. Figure 9 shows how these patterns were
clustered together at various levels of vigilance. For example, at vigilance
level 0.98 input patterns 0, 1, 5, 6, 7, 8, 9, 10, 11, 12 and 13 were clustered
to the F[sub 2] category 0 and patterns 2, 3 and 4 were clustered to the F[sub 2]
node 1. The optimum level of vigilance would be one which separated the patterns
from different octaves (e.g. the C[sub 3]s from the C[sub 4]s) while clustering
together as much as possible patterns which are from the same octave. This would
allow the network to distinguish between pitches of different octaves and also
give some insensitivity to timbre of input patterns having the same pitch by
picking out common features within the internal structures of such patterns.


It can be seen from Figure 9 that there was no level of vigilance which could
separate the pitch examples from the two octaves. Even at high levels of
vigilance (i.e. 0.99995), patterns 1 and 9 were clustered together (1 being the
tenor voice example singing me of C[sub 3] and 9 being the contralto voice
example singing la of C[sub 4]) when all other patterns were associated with
different output nodes. Thus, a more advanced network was needed to control the
vigilance of the ART network so that it would cluster together only input
patterns which have the same output state (i.e. have the same pitch). ARTMAP was
able to perform such an operation. In an ARTMAP network the map field controls
ART[sub a]'s vigilance parameter, as it associates pairs of vectors presented to
ART[sub a] and ART[sub b] while reducing redundancy by clustering as many
patterns presented to ART[sub a] together as possible, within a base-rate
vigilance, which have the same 'predictive consequence' at ART[sub b]. The
mechanisms of the ARTMAP network are described below.


3.3.4. ARTMAP. ARTMAP is a self-organizing, supervised ANN which consists of two
unsupervised ART modules, ART[sub a] and ART[sub b], and an inter-ART associative
memory, called a map field (Figure 10). ART[sub a] and ART[sub b] are linked by
fully connected adaptive connections between ART[sub a]'s F[sub 2] layer and the
map field, and non-adaptive, bidirectional, one-to-one connections from the map
field to ART[sub b]'s F[sub 2] layer. The ART[sub b] network self-organizes the
'predictive consequence' or 'desired output' patterns for each input pattern
presented to ART[sub a].


Briefly, a pair of vectors a[sub p] and b[sub p] are presented to ART[sub a] and
ART[sub b] simultaneously. The ART[sub a] and the ART[sub b] networks choose
suitable output (F[sub 2]) categories for these vectors. The map field then
checks to see if the ART[sub a]'s choice can correctly predict the choice at
ART[sub b]. If it can, then outstar learning between F[sub a,[sub 2] and the map
field takes place, i.e. learning takes place between the map-field node
corresponding to the winning F[sub b,[sub 2] node and the F[sub a,[sub 2]
pattern. Connections to all other F[sub b,[sub 2] nodes are inhibited. If not,
the map field increases ART[sub a]'s vigilance so that ART[sub a] does not choose
the same F[sub 2] category again but searches on until a suitable F[sub a,[sub 2]
category is found. If there are no suitable categories ART[sub a] chooses an
uncommitted node, in which case learning can always take place.


4. Pitch Classification Experiments Using ARTMAP


4.1. Training the ARTMAP System


The supervised ARTMAP system consisting of an ART2, a map field and an ART1 ANN
was trained to recognize pitches from four different musical instruments.
Instruments that are bright or sharp sounding produce a spectrum with relatively
strong higher harmonics, whereas instruments with a mellower tone tend to have a
strong fundamental but weak higher harmonics. Figure 11 illustrates this by
showing the amplitude spectrum of three different sound sources. Each instrument
(or vocal) was asked to play (or sing) a note with a pitch of C[sub 4]. The top
spectrum is that of a clarinet and contains a full spectrum of harmonics with
varying amplitudes. The middle spectrum was taken from a vocal singing la and the
bottom spectrum is taken from a vocal singing me. It can be seen that spectra
vary greatly between different sound sources and therefore, to determine pitches
on a variety of instruments, the network was required to learn features from a
range of possible spectra for each pitch and thus become insensitive to spectral
shape (and hence timbre) when classifying.


The network was trained with a total of 194 patterns, which were examples from a
C-major scale between C[sub 3] and C[sub 6] on the four instruments. Samples were
taken at various times (determined randomly) within the duration of the note. The
training patterns were divided up as follows:


(1) 22 patterns (C[sub 3]-C[sub 6]) from a 'tubular-bells' sound on a commercial
synthesizer.


(2) 22 patterns (C[sub 3]-C[sub 6]) from a 'strings' sound on a commercial
synthesizer.


(3) 22 patterns taken from an upright piano recorded digitally via a (Toshiba
EM-420) microphone placed in front of it (roughly at the place the pianist's head
would be).


(4) 128 patterns taken from an electric guitar. More training patterns are needed
for a guitar because the spectrum of a particular note varies with the string it
is played on, and with where and how that string is plucked. These patterns can
be divided into two sets, consisting, respectively, of examples played Sul Tasto
and Sul Ponticello. Each of these sets consists of patterns taken from every note
in the C-major scale between C[sub 3] and C[sub 6], played on all possible
strings. Patterns were presented in ascending order of pitch for each of the
first three cases. However, it was felt appropriate to present the many repeated
notes from the different strings of the guitar in a quasi-random order. A slow
learning rate was used to train the network with the 194 input patterns. 100
cycles were performed, which took about half an hour on a NeXT Workstation. The
194 patterns were clustered to 46 patterns at the F[sub a,[sub 2] layer and then
mapped to 22 nodes corresponding to the names of each pitch in the F[sub b,[sub
2] layer. Thus, the ART[sub a] network achieved a code compression of the input
patterns of greater than 4 to 1.


4.2. The Test Data


The network was tested using 412 patterns. 194 of these were taken from the
original training set but were sampled at different points in time, thus
producing a spectrum which was in general rather different from the original
training pattern. In some cases, e.g. tubular bells, the difference can be quite
considerable.


The other test patterns consisted of:


(1) 108 patterns from the guitar. These included plucking the notes halfway
between Sul Tasto and Sul Ponticello positions, playing each note Sul Tasto with
vibrato, and playing each note Sul Ponticello with vibrato. Vibrato involves a
frequency modulation of the order of a percent or so at a rate of a few hertz and
is a common characteristic of musical tones. It is normally accompanied by
amplitude modulation of a similar order of magnitude. Together they present a
realistic challenge to a pitch-determining system.


(2) 22 tubular-bells patterns with each note played with clearly audible
amplitude modulation.


(3) 22 notes from the strings sound played with amplitude modulation.


(4) 22 notes played softly, and 22 notes played with the sustain pedal depressed,
on an acoustic piano.


(5) 22 notes played on the piano sound of a commercial synthesizer.


4.3. Results


Overall the ARTMAP network made mistakes in about 5% of cases. Around 70% of
these mistakes, in turn, were 'octave' mistakes, by which we mean a
classification by the network one octave above or below the correct pitch. For
example, an octave mistake that arose with a soft-piano pattern was:


Correct classification B[sub 3]:


(1) B[sub 4]:intensity 796;


(2) B[sub 3]:intensity 795;


(3) G[sub 3]:intensity 693.


where (1) is first choice, (2) is second choice, etc. and the intensity (the
measure of the network's certainty) is a score scaled to 1000. Here, it can be
seen that there is ambiguity in the network's choice. Its preference for B[sub 4]
over B[sub 3] is so slight as to be insignificant compared with statistical
variations inherent in the method. The competition is particularly close in this
example. However, it was generally the case that when the network did make a
mistake the winning intensity was rather low in itself, or very close (within 50
out of 1000) to the correct classification. In the latter case the network could
be adjusted to output a 'don't know'. This, however, would mildly impair its
estimated accuracy in other respects as it sometimes made 'correct'
classifications which could be considered ambiguous, e.g. from the strings sound:


Correct classification E[sub 3]:


(1) E[sub 3]:intensity 853;


(2) E[sub 4]:intensity 839;


(3) C[sub 3]:intensity 492.


Generally, in the ambiguous 'correct' classifications (about 4%) the runner-up
was an octave away from the winner. Most of the 'correct' classifications,
however, were clear cut:


Correct classification A[sub 3]:


(1)A[sub 3]:intensity 980;


(2)E[sub 4]:intensity 318;


(3)F[sub 5]:intensity 279.


The network made mistakes on only two of the data sets, namely the upright piano
and the tubular-bells sets. Most of the piano mistakes were made when the note
was hit softly. Inspecting the Fourier spectrum in each of these cases revealed
that the harmonic components were of very low amplitude and partly obscured by
low-level noise. Indeed, checking the quality of the recorded notes did reveal a
rather poor signal-to-noise ratio.


The network made the remaining mistakes on the tubular-bells sound. The pitch of
such sounds is in any case more difficult for listeners to determine, because the
components are to some considerable degree inharmonic. The network classified 86%
of test tubular-bells sounds correctly, which we regard as a reasonable
performance. Increasing the number of training examples of this sound would
probably improve this figure greatly. The network was able to classify all but
one of the synthesized piano sounds correctly (over 95% accuracy). This is to be
compared with the real piano sounds where only 86% accuracy was achieved. This is
not surprising as synthesized sounds tend to vary less dramatically with how they
are played, whereas the spectrum of an acoustic piano note is quite sensitive to,
for example, the velocity of striking.


4.4. Conclusions


The ARTMAP network has proved capable of determining pitches with a high degree
of reliability over a wide range. A significant proportion of the small number of
mistakes made are readily explicable in terms of noise or spectral inharmonicity
and could be largely eliminated by straightforward improvements in the quality
and quantity of the training patterns. The 'octave' mistakes which predominate
are particularly interesting and significant. Such mistakes are common amongst
musically sophisticated human listeners and reflect the inherent similarity of
tones whose fundamental frequencies can be expressed as a simple, whole-number
ratio. Indeed, it could be argued that such similarity is the very foundation of
western harmony.


The indications are that the capability of the system may be extended to a great
variety of instruments. This would result in a generalized pitch determiner which
exhibited a desirable insensitivity to the dynamic level and spectral shape of
the input signal.


5. Recent and Future Work


The ARTMAP system described here was written as one program (object) in Objective
C on a NeXT Workstation. Recently, by using object-oriented techniques, we have
created different classes for each ART network, e.g. for ART1, ART2, ART2-A
(Carpenter et al., 1991b), SART (modification of ART2-A, developed by Taylor
(1994)) and a map field, all of which can be dynamically created at run-time
(Taylor & Greenhough, 1993). This allows more complicated networks to be built by
simply instantiating the relevant classes needed for an application. A general
ARTMAP network that can link any combination of the above networks together has
also been constructed, allowing great flexibility when applying different ARTMAP
topologies to particular applications. This has led to the use of ARTMAP in other
music-pattern classification tasks such as coping with polyphonic acoustic
signals (Taylor et al., 1993).


Also recently, a comparison of performance of three different ANN architectures
for pitch determination has been made. These include back-propagation, and two
distinct ARTMAP architectures. Each ARTMAP architecture contains a different
ART[sub a] module, respectively, ART2-A, and SART.


Each ANN was trained with a total of 198 examples of pitched notes from the
C-major scale on 10 different instruments in the range C[sub 3] to C[sub 6].
Instruments were chosen to cover a wide variety of spectral shapes so that the
network could pick out the characteristic features for each pitched note, and yet
acquire an insensitivity to timbre. These instruments included soprano, contralto
and tenor voices, saxophone and French horn, violin, acoustic guitar and sitar,
piano and some examples of whistling.


In order to assess the system's robustness, the majority of testing examples were
chosen from different instruments but some were taken from instruments from the
training set and sung or played with vibrato. The test set altogether consisted
of 443 patterns. The network was also tested on the 198 training patterns, and
showed that it had learned all the pitch examples. Test-set instruments included
alto and bass voices, classical, 12-string and electric guitar, recorder,
clarinet, mandolin and steel drums, as well as some synthetic sounds (produced by
computer) which consisted of a harmonic series without a fundamental. Other test
examples were: soprano, contralto (x2) and tenor (x2) voices either sung with
vibrato or taken from different singers, saxophone and violin with vibrato, and
piano played softly.


For this experiment, the back-propagation network was investigated for a large
number of input-parameter combinations to optimize it fully (2500 different
conditions in all, taking the equivalent of about 12 days of Sun SPARC Station
time). Each ART network was also optimized by first setting the optimum theta
(threshold) value and then tweaking its vigilance level which in both cases took
considerably less time than the back-propagation network (around 4 hours in the
ART2-A's case and less than 30 minutes in SART's case).


It was found that the best back-propagation network achieved an absolute-pitch
classification rate of 98.0% and a chroma-pitch classification rate (i.e.
ignoring the octave) of 99.1%. In the same trial, the two ARTMAP networks
achieved rates of 97.7% and 99.6% (for SART) and 98.2% and 99.8% (for ART2-A),
respectively. It was also found that for these cases the back-propagation network
took over 8 minutes (on a NeXT Workstation) to learn the training data. The
ART2-A network took just over 2 minutes while the SART network took less than 20
seconds to learn the training data. It was also found that all ANNs performed
better than straightforward pattern-matching against stored templates which
achieved an abso-lute-pitch classification rate of 97.1% and a chroma-pitch
classification rate of 98.2% (Taylor & Greenhough, 1993).


It is found that although the ANN-based ARTMAP pitch-determining system had to be
trained, this is well worth while, since it only takes 2 minutes in the ART2-A's
case (and just 20 seconds in SART's case) and results in a more robust
pitch-classification ability. The use of the back-propagation ANN, however, would
have to be offset against the training time and optimization problems if it is to
be considered for use in a pitch-determining system.


Future work includes the tracking of vocal and instrumental solos in the presence
of an accompaniment. This requires the ANN to reject everything in the source
except the main melody. We believe that this can be accomplished by teaching the
ANN what a particular instrument sounds like (i.e. its spectral characteristics
and thus its timbre) for each pitch. The network can then 'listen' for
recurrences of this sound in the music, and thus classify just the pitches of the
instrument it has been trained for. Preliminary experiments with some musical
instruments have shown that one particular ANN network (ARTMAP) can indeed be
trained to perform such classifications. Such tracking systems may eventually
find application in certain contemporary music performances involving the
interaction of human players and computer systems, and perhaps in
ethnomusicological studies where it is required to transcribe previously
unnotated live or recorded music. The latter application may require the handling
of a pitch continuum, in which case the input mapping system will need to be
refined, as suggested in Section 3.2.


ILLUSTRATION: Figure 1. Visual analogy of virtual pitch. The visual system
perceives contours which are not physically present (Terhardt, 1974).


DIAGRAM: Figure 2. The adaptive resonance theory system. A Fourier transform is
performed on the input waveform to produce an amplitude spectrum. This is mapped
to a 'semitone-bin' distribution which is then fed into the ART neural network
architecture called ARTMAP, which learns to extract pitch.


DIAGRAM: Figure 3. The mapping scheme from a Fourier spectrum to a distribution
of 'semitone bins'. There are strong connections for frequencies within +/-
1/4-tone (approximately, +/-1/2(F[sub s] - F[sub s-t)) of the semitone frequency
and weaker ones within +/- 1/2-tone.


Figure 4. Operations used to map frequency components in the Fourier spectrum to
a semitone-bin distribution.


Figure 5. The input-layer representation for the ARTMAP ANN.


Figure 6. The output-layer representation for the ARTMAP ANN.


DIAGRAM: Figure 7. An ART1 ANN, consisting of an F[sub 1] field (input layer), an
F[sub 2] field (output layer), a reset mechanism and two sets of fully connected
weights (bottom-up and top-down).


DIAGRAM: Figure 8. A simplified version of the ART2 ANN architecture consisting
of an attentional subsystem, containing three fields, F[sub 0], F[sub 1] and
F[sub 2] and two sets of adaptive weights (bottom-up and top-down) and an
orienting subsystem incorporating the reset mechanism. Large filled circles
(gain-control nuclei) represent normalization operations carried out by the
network.


Figure 9 The category representations made by the ART2 ANN for 14 input patterns
(7 examples taken from different instruments for the note C[sub 3] and the same
for C[sub 4]) at 9 different levels of vigilance.


Information is presented in the following order: category; vigilance; 0.98; 0.99;
0.9975; 0.998; 0.999; 0.9995; 0.9998; 0.99991; 0.99995


0; ; 0 1 5 6; 0 1 5 6; 0 1 5 6; 0 1 5; 0 1 7; 0 7; 0 7; 0; 0


; ; 7 8 9; 7 8 9; 7 1 2; 7 1 2; 12


; ; 10 11; 10 12; 13


; ; 12 13; 13


1; ; 2 3 4; 2 3 4; 2 3 4; 2 3 4; 2 3; 1 5 8 9; 1 6 8; 1 9 10; 1 9


; ; ; ; ; ; ; 10


2; ; ; 11; 8 9 10; 6 8 9; 4 5; 2 3; 2 3; 2 3; 2


; ; ; ; ; 10 13


3; ; ; ; 11; 11; 6 8 9; 4; 4; 4; 3


; ; ; ; ; ; 10 13


4; ; ; ; ; ; 11; 6 12; 5; 5; 4


; ; ; ; ; ; ; 13


5; ; ; ; ; ; ; 11; 9 10; 6 8; 5


6; ; ; ; ; ; ; ; 11; 7; 6


7; ; ; ; ; ; ; ; 12 13; 11; 7


8; ; ; ; ; ; ; ; ; 12; 8


9; ; ; ; ; ; ; ; ; 13; 10


10; ; ; ; ; ; ; ; ; ; 11


11; ; ; ; ; ; ; ; ; ; 12


12; ; ; ; ; ; ; ; ; ; 13




DIAGRAM: Figure 10. A predictive ART, or ARTMAP, system includes two ART modules
linked by a map field. Internal control structures actively regulate learning and
information flow. ART[sub a] generally self-organizes input data (e.g. pitch
information) and ART[sub b] self-organizes desired output states (e.g. pitch
names).


DIAGRAM: Figure 11. The amplitude spectrum of three different musical instruments
for the note C[sub 4]. The top spectrum is that of a clarinet, the middle
spectrum was taken from a vocal singing la and the bottom spectrum is taken from
a vocal singing me. It can be seen that spectra vary greatly between different
sound sources.


References


Brown, J.C. (1991) Musical frequency tracking using the methods of conventional
and 'narrowed' autocorrelation. The Journal of the Acoustical Society of America,
89, 2346-2354.


Brown, J.C. (1992) Musical fundamental frequency tracking using a pattern
recognition method. The Journal of the Acoustical Society of America, 92,
1394-1402.


Carpenter, G.A. & Grossberg, S. (1987a) A massively parallel architecture for a
self-organizing neural pattern recognition machine. Computer Vision, Graphics,
and Image Processing, 37, 54-115.


Carpenter, G.A. & Grossberg, S. (1987b) ART2: Self-organization of stable
category recognition codes for analog input patterns. Applied Optics, 26,
4919-4930.


Carpenter, G.A. & Grossberg, S. (1990) ART3: Hierarchical search using chemical
transmitters in self-organizing pattern recognition architectures. Neural
Networks, 3, 129-152.


Carpenter, G.A., Grossberg, S. & Reynolds, J.H. (1991a) ARTMAP: Supervised
real-time learning and classification of nonstationary data by a self-organizing
neural network. Neural Networks, 4, 565-588.


Carpenter, G.A., Grossberg, S. & Rosen, D.B. (1991b) ART2-A: An adaptive
resonance algorithm for rapid category learning and recognition. Neural Networks,
4, 493-504.


De Boer (1956) On the Residue in Hearing. Unpublished doctoral dissertation.
University of Amsterdam.


De Boer, E. (1977) Pitch theories unified. In E.F. Evans & J.P. Wilson (Eds),
Psychophysics and Physiology of Hearing, pp. 323-334. London: Academic.


Duifhuis, H., Willems, L.F. & Sluyter, R.J. (1982) Measurement of pitch in
speech: An implementation of Goldstein's theory of pitch perception. The Journal
of the Acoustical Society of America, 71, 1568-1580.


Goldstein, J.L. (1973) An optimum processor for the central formation of pitch of
complex tones. The Journal of the Acoustical Society of America, 54, 1496-1516.


Grossberg, S. (1976a) Adaptive pattern classification and universal recoding. I:
Parallel development and coding of neural feature detectors. Biological
Gybernetics, 23, 121-134.


Grossberg, S. (1976b) Adaptive pattern classification and universal recoding, II:
Feedback, expectation, olfaction, and illusions. Biological Cybernetics, 23,
187-202.


Helmholtz, H.L.F. von (1863) On the Sensations of Tone as a Physiological Basis
for the Theory of Music. Translated by A.J. Ellis from the 4th German edition,
1877, Leymans, London, 885 (reprinted, New York: Dover, 1954).


Hermes, D.J. (1988) Measurement of pitch by subharmonic summation. The Journal of
the Acoustical Society of America, 83, 257-264.


Houtsma, A.J.M. & Goldstein, J.L. (1972) The central origin of the pitch of
complex tones: Evidence from musical interval recognition. The Journal of the
Acoustical Society of America, 51, 520-529.


Lippmann, R.P. (1987) An introduction to computing with neural nets, IEEE ASSP
Magazine, 3, 4-22.


Martin Ph. (1982) Comparison of pitch detection by cepstrum and spectral comb
analysis. Proceedings of IEEE International Conference on Acoustics, Speech and
Signal Processing, pp. 180-183. ICASSP-82.


Noll, A.M. (1970) Pitch determination of human speech by the harmonic product
spectrum, the harmonic sum spectrum, and a maximum likelihood estimate. In the
Microwave Institute (Eds), Symposium on Computer Processing in Communication 19,
779-797, Polytechnic University of Brooklyn, New York.


Piszczalski, M. & Galler, B.F. (1982) A computer model of music recognition. In
M. Clynes (Ed.), Music, Mind and the Brain: The Neuropsychology of Music, pp.
321-351. London: Plenum Press.


Ritsma, R.J. (1962) Existence region of the tonal residue, I. The Journal of the
Acoustical Society of America, 34, 1224-1229.


Rumelhart, D.E. & McClelland, J.L. (1986) Parallel Distributed Processing,
Explorations in the Microstructure of Cognition, Vol. 1: Foundations, Cambridge,
MA: MIT Press.


Scheffers, M.T.M. (1983) Simulation of auditory analysis of pitch: an elaboration
on the DWS pitch meter. The Journal of the Acoustical Society of America, 74,
1716-1725.


Schouten, J.F. (1940) The residue and the mechanism of hearing. Proceedings
Konikl. Ned. Akad. Wetenschap. 43, 991-999.


Schroeder, M.R. (1968) Period histogram and product spectrum: new methods for
fundamental-fre-quency measurement. The Journal of the Acoustical Society of
America, 43, 829-834.


Stubbs, D. (1988) Neurocomputers, M.D. Computing, 5, 14-24.


Taylor, I. (1994) Artificial Neural Network Types for the Determination of
Musical Pitch. PhD thesis, University of Wales, College of Cardiff.


Taylor, I. & Greenhough, M. (1993) An object-oriented ARTMAP system for
classifying pitch. Proceedings of International Computer Music Conference, pp.
244-247.


Taylor, I., Page, M. & Greenhough, M. (1993) Neural networks for processing
musical signals and structures. Acoustics Bulletin, 18, 5-9.


Terhardt, E. (1972) Zur Tonhoehenwahrnehmung von Klaengen II: ein
Funktionsschema. Acustica, 26, 187-199.


Terhardt, E. (1974) Pitch, consonance, and harmony. The Journal of the Acoustical
Society of America, 55, 1061-1069.


Terhardt, E., Stoll, G. & Seewann, M. (1982) Algorithm for extraction of pitch
and pitch salience from complex tonal signals. The Journal of the Acoustical
Society of America, 71, 679-688.


Wightman, F.L. (1973) The pattern-transformation model of pitch. The Journal of
the Acoustical Society of America, 54, 407-416.


~

BY IAN TAYLOR & MIKE GREENHOUGH I. Taylor and M. Greenhough, Department of
Physics and Astronomy, University of Wales College of Cardiff, PO Box 913,
Cardiff CF2 3YB, UK. E-mail: ijtcm.cf.ac.uk and greenhoughcardiff.ac.uk



From:Taylor, Mike, Modelling pitch reception with adaptive resonance theory
artificial neural networks.., Vol. 6, Connection Science, 01-01-1994, pp 135.


Psychoacoustics - The Magic of Tone and the Art of Music - Document Two


Return to the GS WorldView Index 'Parent Directory'