FINITE WORDLENGTH EFFECTS IN DIGITAL FILTERS [615499]

FINITE WORDLENGTH EFFECTS IN DIGITAL FILTERS

Abstract: If implemented with unlimited wordlength of coefficients and variables, a digital filter behaves as expected
if the design has been done properly. In this case the choice of one of the numerous structures influences only the
complexity and thus the achievable speed, if a particular hardware is u sed. Besides that, the performance will always
be the same. In reality however, the situation if far more different and far more complicated. Coefficients as well as
data can be implemented with finite wordlengths only. Quantized coefficients will lead to a more erroneous behavior
of the system for example a very different frequency response. The deviation from the expected performance will
depend on the chosen structure and its sensitivity to perturbations. The process of quantizing the coefficients of a
stable system turn it unstable afterwards.
Keywords: Wordlength, Quantization, Digital Filter, Rounding Errors, Limit Cycles.
1. INTRODUCTION:
When discrete -time systems are implemen ted
in hardware or in software, all parameters and
arithmetic operation s are implemented using
finite -precision numbers and hence their
effect is unavoidable. This leads to four types
of finite wordlength effects. Discretization
(quantization) of the filter coefficients has the
effec t of perturbing the location of the filter
poles and zeroes. As a result, the actual filter
response differs slightly from the ideal
response. The use of finite precision
arithmetic makes it necessary to quantize
filter calculations by rounding or truncati on.
Roundoff noise is that error in the filter
output that results from rounding or
truncating calculations within the filter. As
the name implies, this error looks like low –
level noise at the filter output. Quantization
of the filter calculations also renders the filter
slight ly nonlinear. For large signals this
nonlinearity is negligible and roundoff noise
is the major concern. However, for recursive
filters with a zero or constant input, this non
linearity can cause spurious oscillations
called limit cycles. With fixed -point
arithmetic it is possible for filter calculations
to overflow. The term overflow oscillation,
sometimes also called adder overflow limit
cycle, refers to a high -level oscillation that
can exist in an otherwise stable filter due to
the nonlinearity associated with the overflow
of internal filter calculations. 2. THEORETICAL FUNDAMENTALS
In practice, there are two differe nt
operations by which the quantization process
is made to the nearest number or level: the
truncation operation and the rounding
operation. These operati ons affect the
accuracy as well as general characteristics of
digital filters and DSP operations. When a
sum of products calculation is performed, the
quantization can be perf ormed either after
each multiplication or after all products have
been summed with double -length precision.
We will examine three types of fixed –
point quantization —rounding, truncation,
and magnitude truncation. If X is an exact
value, then the rounded value will be denoted
𝑄𝑟(𝑋), the truncat ed value 𝑄𝑡(𝑋) and the
magnitude truncated value 𝑄𝑚𝑡(𝑋).

Fig. 1. The statistical model of a quantizer
Since rounding selects the quantized value
nearest the unquantized value, it gives a value
which is never more than ±∆/2 away from
the exact value. If the rounding error is
denoted by
𝜖= 𝑄𝑟(𝑋)−𝑋 (1)

then
−∆
2≤𝜖𝑟≤∆
2 (2)
where Δ is the quantization step.
Truncation simply discards the least
significant bits giving a quantized value that
is always less than or equal to the exact
value so we get:
−∆<𝜖𝑡≤0 (3)
Magnitude truncation chooses the nearest
quantized value that has a magnitude less
than or equal to the exact value:

−∆<𝜖𝑚𝑡<∆ (4)

Now that we’re done with fixed -point
arithmetic we will focus on floating -point
arithmetic. With floating -point arithmetic it is
necessary to quantize after both
multiplications and additions. In this
only the mantissa M is affected by the
quantization process. However, the number x
is represented by 𝑀∙2𝐸 where E is the
exponent. Hence the quantizer errors are
multiplicative and depend on the magnitude
of x. Therefore, the more appropriate
measure of error is the relative error rather
than the absolute error ( 𝜖=𝑄(𝑥)−𝑥). The
relative error is defined by:

𝜀𝑟≜𝑄𝑟(𝑥)−𝑥
𝑥=𝜖𝑟
𝑥
(5)
If the quantized mantissa has B fractional
bits after the decimal, |𝜖|<∆
2. Therefore,
since 0.5≤𝑚<1,

|𝜀𝑟|<∆ (6)

2.1 Sensitivity of Filter Structures We now study the finite word -length
effects on the filter responses, pole -zero
locations, and stability when the filter
coefficients are quantized. Consider an IIR
filter described by:

𝐻(𝑧)=∑ 𝑏𝑘𝑧−𝑘 𝑀
𝑘=0
1+∑ 𝑎𝑘𝑧−𝑘 𝑁
𝑘=1

(7)
where 𝑏𝑘 and 𝑎𝑘 are the filter coefficients.
Now let’s assume that these coefficients are
represented by their finite precision numbers
𝑏̂𝑘 and 𝑎̂𝑘 and so we obtain a new filter
system function:

𝐻̂(𝑧)≜∑ 𝑏̂𝑘𝑧−𝑘 𝑀
𝑘=0
1+∑ 𝑎̂𝑘𝑧−𝑘 𝑁
𝑘=1

(8)
Since this is a new filter we would like to
know how different this filter is exactly from
the original one. For that we will need to
compare their frequency responses, or phase
responses, or the change in their pole -zero
locations.

2.1.1 Effect on pole -zero location
To evaluate the movement of the poles
after quantization we will consider the
denominator polynomial of 𝐻(𝑧) in (7):

𝐷(𝑧)≜1+∑𝑎𝑘𝑧−𝑘𝑁
𝑘=1
=∏(1−𝑝𝑙𝑧−1)𝑁
𝑙=1

(9)
where 𝑝𝑙 are the poles of 𝐻(𝑧) and each pole
is a function of the filter coefficients

𝑝𝑙=𝑓(𝑎1,…,𝑎𝑁), 𝑙=1÷𝑁 (10)
The total perturbation error ∆𝑝𝑙 can be
expressed as:

∆𝑝𝑙=∑𝜕𝑝𝑙
𝜕𝑎𝑘∆𝑎𝑘𝑁
𝑘=1
(11)
This formula measures the movement of
the 𝑙th pole 𝑝𝑙 to changes in each of the
coefficient 𝑎𝑘 hence it is known as the filter
sensitivity formula. It shows that if the
coefficients 𝑎𝑘 are such that if the poles 𝑝𝑙
and 𝑝𝑖 are very close to each other for some 𝑙
and 𝑖 then (𝑝𝑙−𝑝𝑖) is very small and as a
result the filter is very sensitive to the
changes in filter coefficients.
A very similar result can be obtained for
the sensitivity of zeros to changes in filter
coefficients 𝑏𝑘.

It is well known that in a higher order
polynomia l with clustered roots, the root
location is a very sensitive function of the
polynomial coefficients. Therefore, filter
poles and zeros can be much more accurately
controller if higher order filters are realized
by breaking them up into parallel or cascad e
connections of first and second order
subfilters containing widely separated poles .
Thus, each 2nd -order section will have low
sensitivity in that its p ole locations will be
perturbed only slightly. Consequently, we
expect that the overall system functio n 𝐻(𝑧)
will be perturbed only slightly. Th us, the
cascade or the parallel forms, when realized
properly, will have lo w sensitivity to the
changes or errors in filter coefficients.
One exception to this rule is the case of
linear -phase FIR filters in whi ch the
symmetry of the polynomial coefficients and
the spacing of the filter zeros around the unit
circle usually permits an acceptable direct
realization using the convolution summation.
Given a filter structure it is necessary to
assign the ideal pole an d zero locations to the
realizable positions. This is generally done by
simply rounding or truncating the filter
coefficients to the available number of bits,
or by assigning the ideal pole and zero
positions to the nearest realizable locations. A more com plicated alternative is to consider
the original filter design problem as a
problem in discrete optimization and choose
the realizable pole and zero locations that
give the best approximation to the desired
filter response.
2.1.2 Effects on frequency respo nse
The frequency response of an IIR filter is
given by:

𝐻(𝑒𝑗𝜔)=∑ 𝑏𝑘𝑒−𝑗𝜔𝑘 𝑀
𝑘=0
1+∑ 𝑎𝑘𝑒−𝑗𝜔𝑘 𝑁
𝑘=1

(12)
When the coefficients 𝑏𝑘 and 𝑎𝑘 are
quantized to 𝑏̂𝑘 and 𝑎̂𝑘 the new frequency
response is given by:

𝐻̂(𝑒𝑗𝜔)=∑ 𝑏̂𝑘𝑒−𝑗𝜔𝑘 𝑀
𝑘=0
1+∑ 𝑎̂𝑘𝑒−𝑗𝜔𝑘 𝑁
𝑘=1

(13)
A similar analysis as to that for the
movement of the poles and zeros to obtain
maximum change in the magnitude response
or phase response due to filter coefficients
can be performed.

2.2 Limit Cycles
A zero -input limit cycle is a nonzer o
periodic output sequence pro duced by
nonlinear elements or quant izers in the
feedback loop of a digital filter.

There are two types of limit cycles. Th e
granular limit cycles are due to nonlinearities
in multiplication quantiz ation and are of low
amplitude. The overflow limit cycles are a
result of overflow in add ition and can have
large amplitudes. Limit cycles require
recursion to exist and do not occur in
nonrecursive FIR filters.
The important operation that we have to
consider is t he arithmetic overflow

characteristics. We assume that the
represented numbers are in fractional t wo’s –
complement format. Then in practice, two
overflow characteristics are used: a two’s –
complement over flow, which is a modulo
(periodic) functio n, and a sat uration, which is
a limiting function.

Fig. 2. Overflow characteristics used in quantization

2.2.1 Granular Limit Cycles
A granular limit cycle, sometimes
referred to as a multiplier roundoff limit
cycle, is a low -level oscillation that can exist
in an otherwise stable filter as a result of the
nonlinearity associated with rounding or
truncating internal filter calculations . This
type of limit cycle can easily be de monstrated
with a simple round ing quantizer following a
multiplication .
2.2.2 Overflow Li mit Cycles
This type of limit cycle is also a zero –
input behavior that gives an os cillatory
output. It is due to overflow in the addition
even if we ignore multiplication or product
quantization in th e filter implementation.
This happens when two numbers of the same
sign add to give a value having magnitude
greater than one. Since numbers with
magnitude greater than one are not
representable, the result overflows. This is a
more serious limit cycle bec ause the
oscillations can cover the entire dynamic
range of the quantizer. It can be avoided in
practice by using the saturation
characteristics instead of overflow in the
quantizer.
Like granular limit cycles, overflow limit
cycles require recursion to e xist and do not occur in nonrecursive FIR filters. These
overflow limit cycles also do not occur with
floating -point arithmetic due to the virtual
impossibility of overflow.

3. EXPERIMENTAL RESULTS
3.1 Sensitivity of Filter Structures
3.1.1 Effect on Pole-Zero Location
Figure 3 shows the pole -zero plots for
filters with both infinite and 16 -bit precision
coefficients. Clearly, with 16 -bit wo rd length,
the resulting filter is completely different
from the original one and is unstable.

Fig. 3. Pole-Zero plot for direct -form filter structure

From the results shown in Figure 4 we
can observ e that not only for 16 -bit
representation but also for 11 -bit
representation, the resulting filter is
essentially the same as the original one and is
stable. C learly, the cascade form structure has
better finite word -length properties than the
direct form structure .

Fig. 4. Pole-Zero plot for cascade -form filter
structure

3.1.2 Effect on Frequency Response

Fig. 5. Frequency response for an IIR filter

Fig. 6. Frequency response for a FIR Filter
The logarithmic magnitude responses
and pole-zero locatio ns of an IIR and FIR
filter are computed and plotted in Figure s 5
and (6 along with those of the original filter.
When 16 bits are used, the resulting filter is
virtually indi stinguishable from the original
one. However, when 8 bits are used, the filter
behavior is severely distorted and the filter
does not satisfy the design specifications.
Fig. 7. 8-bit Quantization Effects on the Cascade –
Form Filter Structure
Figure 7 clearly show that now there is
insignificant deterioration caused by the
quantization process. The cascade structure
gets rid of all short -comings of the direct and
parallel structure as both poles and zeros are
almost not affected by q uantization. The
cascade form structure is definitely more
robust to parameter quantization.
3.2 Limit Cycles
3.2.1 Granular Limit Cycles
The resulting plots are shown in Figure 8.
The output signal in the left plot has an
asymptotic period of two samples, The
middle plot for α = 0.5 (lowpass fil ter) shows
that the limit cycle has a period of one sample
with amplitude of 0.25. Finally, the r ight plot
shows that the limit cycles vanish for the
truncation o peration. This behavior for the
truncation operation is also exhibited for
lowpass filters.
The resulting plots for granular limit
cycles in 2nd -order systems are shown in
Figure 9. The round -off limit cycles have a
period of six s amples and amplitude of 0.25.
Unlike in the case of 1st -order filters, the
limit cycles fo r the 2nd -order exist even when
truncation is u sed in the quantizer.

Fig. 8. Granular Limit Cycles for Rounding and
Truncati ng

Fig. 9. Granular Limit Cycles for Rounding and
Truncating in a 2nd -order BPF
3.2.2 Overflow Limit Cycles
The resulting plots are shown in Figure
10. As e xpected, the infinite -precision
implementation has no limit cycles
whatsoever . The granular limit cycles are of
smaller amplitudes. Clearly, the overflow
limit cycles have large amplitudes spanning
the − 1 to 1 range of the quantizers.

Fig. 10. Comparison of Limit Cycles
4. CONCLUSION
When realizing IIR filters, either a
parallel or cascade connection of first and
second -order subfilters is almost always
preferable to a high -order direct form
realization. With the availability of very low
cost floating -point digital signal processors, i t
is highly recommended that floating -point
arithmetic be used for IIR filters. Floating –
point arithmetic simultaneously eliminates
most concerns regarding scaling, granular
limit cycles and overflow limit cycle
oscillations. Regardless of the arithmetic
employed, a low roundoff noise structure
should be used for the second -order sections
because while having low fixed -point
roundoff noise they also have low floating –
point roundoff noise. The use of a low
roundoff noise structure for the second -order
sectio ns also tends to give an implementation
with low coefficient quantization sensitivity.
First-order sections are not as critical in
determining the roundoff noise and
coefficient sensitivity of a realization and so
can generally be implemented with a simple
direct form structure. Linear -phase FIR digital filters can
generally be implemented with acceptable
quantization sensitivity using the direct
convolution sum method. When
implemented in this way on a digital signal
processor, fixed -point arithmetic is no t only
acceptable but may actually be preferable to
floating -point arithmetic. Virtually all fixed –
point digital signal processors accumulate a
sum of products in a double -length
accumulator. This means that only a single
quantization is necessary to compu te an
output. Floating -point arithmetic, on the
other hand, requires quantization after every
multiplication and addition in the
convolution summation. But with 32 bit
floating -point arithmetic these quantizations
introduce a small enough error to be
insig nificant for many applications.
5. REFERENCES
[1] A.V. Oppenheim and R. W. Schafer,
Discrete -Time Signal Processing, Prentice
Hall, Englewood Cliffs, NJ, 1989.
[2] L. R. Rabiner and B. Gold, Theory
and Application of Digital Signal
Processing , Prentice Hall, Englewood,
Cliffs, NJ, 1975.
[3] R. A. Roberts and C. T. Mullis,
Digital Signal Processing, Addison -Wesley,
Reading, MA, 1987.
[4] J. S. Lim and A. V. Oppenheim,
Advanced Topics in Signal Processing,
Prentice Hall, Englewood Cliffs, NJ, 19 88.
[5] H. W. Schüßler and Y. Dong, “A
new method for measuring the performance
of weakly nonlinear systems”, Proceedings
of ICASSP -89, pages 2089 -2092, 1989.
[6] J. H. McClellan, C.S. Burrus, A.V.
Oppenheim, T. W. Parks, R. W. Schafer and
H. W. Schuessler , Computer -Based
Exercises for Signal Processing Using
MATLAB Ver. 5, Prentice Hall, Upper
Saddle River, NJ .
[7] E.W. Kamen and B. S. Heck,
Fundamentals of Signals and Systems Using

the Web and MATLAB, Second Edition,
Prentice Hall, Inc., 2000.
[8] S.K. M itra, Digital Signal
Processing,A Computer -Based Approach,
Third edition, McGraw -Hill Companies,
Inc., 2006.
[9] Madisetti V., Williams D. (eds.) –
Digital Signal Processing Handbook -CRC
1999
[10] Vinay K. Ingle, John G. Proakis –
Digital Signa l Processing Using MATLAB
Third Edition -Cengage Learning 2011
[11] A.I. Zverev, Handbook of Filter
Synthesis, John Wiley & Sons, New –
York,1967.
[12] R. W. Hamming, Digital Filters,
Third Edition, Prentice Hall, Englewood
Cliffs NJ,1989
[13] L.B. Jackson, “ An Analysis of Limit
Cycles due to Multiplication Rounding in Recursive Digital (Sub) Filters”, Proc. 7th
Annual Allerton Conference on Circuit and
System Theory, pp. 69 -78, 1969.
[14] T. A. C. M. Classen, W. F. G.
Mecklenbrauker and J. B. H. Peek, “Second –
Order Digital Filter with only one
Magnitude -Truncation Quantizer and
Having Practically no Limit Cycles”,
Electron. Lett., Vol. 9, November 1973.
[15] L.B. Jackson, “Roundoff Noise
Analysis for Fixed -Point Digital Filters
Reali zed in Cascade or Paralell Form”, IEEE
Trans. Audio and Electroacoustics, Vol.
AU-18, pp. 107 -122, June 1970.
[16] L. Grama, A. Grama, C. Rusu, Filtre
numerice. Aplicații și probleme,Editura U.
T. PRESS, Cluj – Napoca, 2008
[17] MATLAB Home Page,
http://ww w.mathworks.com/ .

EFECTELE LUNGIMII FINITE ALE CUVINTELOR IN FILTRELE DIGITALE
Abstract: Un filtru digital se comportă conform așteptărilor dacă este implementat cu coeficienți și variabile
cuantizate pe o lungime infinită de cuvinte și dacă proiectarea a fost făcută corect. În acest caz, alegerea uneia dintre
numeroasele structuri influențează doar complexitatea și, astfel, viteza realizabilă, dacă se folosește o anumită
componentă hardware. Pe lângă asta, performanța va fi întotdeauna a ceeași. În realitate, însă, situația este mult mai
diferită și mult mai complicată. Coeficienții, precum și datele pot fi implementate doar cu lungimi de cuvânt finite.
Coeficienții cuantificați vor conduce la un comportament mai eronat al sistemului, de e xemplu, un răspuns de
frecvență foarte diferit. Abaterea de la performanța scontată va depinde de structura aleasă și de sensibilitatea
acesteia la perturbări. Procesul de cuantificare a coeficienților unui sistem stabil îl va face instabil ulterior.

Similar Posts