Unit - 5
Finite word length effects in digital filters
Q1) Assume number is using a 32-bit format which reserves 1 bit for the sign, 15 bits for the integer part, and 16 bits for the fractional part.
A1) Then, -43. 625 is represented as follows:
Where 0 is used to represent + and 1 is used to represent. 000000000101011 is a 15-bit binary value for decimal 43 and 1010000000000000 is a 16-bit binary value for fractional 0. 625.
These are above the smallest positive number and largest positive number which can be store in 32-bit representation as given above format. Therefore, the smallest positive number is 2-16 ≈ 0. 000015 approximate and the largest positive number is (215-1) +(1-2-16) =215(1-2-16) =32768, and the gap between these numbers is 2-16.
We can move the radix point either left or right with the help of only integer field is 1.
Q2) Suppose the number is using the 32-bit format: the 1-bit sign bit, 8 bits for the signed exponent, and 23 bits for the fractional part. The leading bit 1 is not stored (as it is always 1 for a normalized number) and is referred to as a “hidden bit”.
A2) Then −53.5 is normalized as -53.5= (-110101.1)2= (-1.101011) x25, which is represented as below,
Where 00000101 is the 8-bit binary value of exponent value +5.
Note that 8-bit exponent field is used to store integer exponents -126 ≤ n ≤ 127.
The smallest normalized positive number that fits into 32 bits is (1. 00000000000000000000000)2x2-126=2-126≈1.18x10-38 and the largest normalized positive number that fits into 32 bits is (1. 11111111111111111111111)2x2127= (224-1) x2104 ≈ 3.40x1038. These numbers are represented as below,
The precision of a floating-point format is the number of positions reserved for binary digits plus one (for the hidden bit). In the examples considered here, the precision is 23+1=24.
The gap between 1 and the next normalized floating-point number is known as machine epsilon. The gap is (1+2-23)-1=2-23for the above example, but this is the same as the smallest positive floating-point number because of non-uniform spacing unlike in the fixed-point scenario.
Note that non-terminating binary numbers can be represented in floating-point representation, e. g., 1/3 = (0. 010101 . . . )2 cannot be a floating-point number as its binary representation is non-terminating.
Q3) Compare fixed point and floating point arithmetic?
A3) The main difference between fixed point and floating point is that the fixed point has a specific number of digits reserved for the integer part and fractional part while the floating point does not have a specific number of digits reserved for the integer part and fractional part.
Fixed point and floating point are two ways of representing numbers. In fixed point, there is a specific number of digits to represent the integer section and fraction section. In other words, there is a fixed number of digits for each portion even though the number is very large or small. On the other hand, in floating point, there is no specific number of digits to represent integer section and fraction section. Floating point representation can cover a large range or numbers when compared to fixed point.
S.No | Fixed point arithmetic | Float point arithmetic |
1. | Fast operation | Slow operation |
2. | Relatively economical | More expensive because of costlier hardware. |
3. | Small dynamic range | Increased dynamic range. |
4. | Round off errors occurs only for addition | Round off errors can occur with addition and multiplication |
5 | Overflow occur in addition | Overflow does not arise |
6 | Used in small computers | Used in large general purpose computers. |
Q4) The output signal of an A/D converter is passed through a first order low pass filter, with transfer function given by H(z)= (1-a) z/(z-a) for 0<a<1. Find the steady state output noise power due to quantization at the output of the digital filter.
A4)
Where:
= 2-2b/12
Q5) Find the steady state variance of the noise in the output due to quantization of input for the first order filter. y(n)= ay(n-1) + x(n)
A5)
The impulse response for the above filter is given by h(n) = anu(n)
Taking z-transform on both sides we get
Q6) The output of the A/D converter is applied to a digital filter with the system function H(z) = 0.45z/z-0.72. Find the output noise power of the digital filter, when the input signal is quantized to 7 bits.
A6)
The poles of H(Z)H(Z-1) Z-1 are p1 = 0.72 and p2 = 1.3889
Output noise power due to input quantization
Q7) Consider the transfer function H(z) = H1(z)H2(z) where H1(z) = [1/1-a1z-1] and H2(z)=[1/1-a2z-1]. Find the output round off noise power. Assume 1 0.5 and 2 0.6 and find output round off noise power.
A7)
The round off noise model for H(z) = H1(z)H2(z) is given by, From the realization we can find that the noise transfer function seen by noise source e1(n) is H(z), where,
The total steady state noise variance can be obtained we have
If a1 and a2 are less than the poles z=1/a1 and z=1/a2 lies outside of the circle |z|=1. So, the residue of H(z) H(z-1) z-1 at z=1/a1 and z=1/a2 are zero. Consequently, we have
The steady state noise power for a1 0.5, a2 0.6 is given by
Q8) Draw the quantization noise model for a second order system H(z) = and find the steady state output noise variance.
A8)
The quantization noise model is, we know,
Both noise sources see the same transfer function
H(z) =
The impulse response of the transfer function is given by
The steady state output noise variance is given by
But we also know that
Q9) Consider a second order IIR filter with H(z) = . Find the effect on quantization on pole locations of the given system function in direct form and in cascade form. Take b=3bits.
A9)
Given that
H(z) =
H(z) =
H(z) =
The roots of the denominator of H(z) are the original poles of H(z). Let the original poles of H(z) be p1 and p2. Here p1=0.5 and p2=0.45.
Direct Form I
Let us quantize the coefficients by truncation.
Let be the transfer function of the IIR system after quantizing the coefficients.
Let
On cross multiplying the above equations we get
Cascade Form
In cascade realization the system can be realized as cascade of first order sections.
Where and
Let us quantize the coefficients of and by truncation
Let and be the transfer function of the first order sections after quantizing the coefficients
Let
On cross multiplication of above equation, we get
Q10) Consider a 1st order FIR system equation y(n) x(n) ay (n 1) with x(n) =0.875 for n=0 and 0 otherwise. Find the limit cycle effect and the dead band. Assume b=4 and a=0.95.
A10)
Dead band
(round off to 4-biits) | |||||
0 | 0.875 | 0 | 0 | 0.0000 | |
1 | 0 | 0.8125 | |||
2 | 0 | 0.8125 | |||
3 | 0 | 0.75 | 0.75×09.5= | ||
4 | 0 | 0.6875 | 0.6875×0.95- |
| |
5 | 0 | 0.625 | |||
6 | 0 | 0.625 |
The dead band of the filter is 0.625. When n 5 the output remains constant at 0.625 causing limit cycle oscillations.
Q11) For the given transfer function H(z)= . Find scaling factor so as to avoid overflow in the adder ‘1’ of the filter.
A11)
Given D(z)= 1-0.5z-1
D(z‑1)1-0.5z
Residue of
I=1.3333
S0=1/√I
S0=1/√1.3333=0.866
Q12) Consider the recursive filter shown in fig. The input x(n) has a range of values of ±100V, represented by 8 bits. Compute the variance of output due to A/D conversion process.
A12)
Given the range is ±100V
The difference equation of the system is given by y(n)= 0.8y(n-1)+x(n) whose impulse response h(n) can be obtained by
h(n)= (0.8) n u(n)
Quantization step size= (range of signal)/ (No. Of quantization levels)
= 200/28
=0.78125
Variance of the error signal
= 0.05086
Variance of output
Q13) The input to the system y(n)=0.999y(n-1)+x(n) is applied to an ADC. What is the power produced by the quantization noise at the output of the filter if the input is quantized to a) 8 bits b)16 bits.
A13)
y(n)=0.999y(n-1)+x(n)
Taking z-transform on both sides
Output noise power due to input quantization
Where p1,p2,……pN are poles of H(z)H(z-1 )z-1 , that lies inside the unit circle in z-plane
a) bits (Assuming including sign bit)
b) b+1=16 bits
Q14) Find the effect of coefficient quantization on pole locations of the given second order IIR system, when it is realized in direct form I and in cascade form. Assume a word length of 4 bits through truncation. H(z)=
A14)
Let b=4 bits including a sign bit
Integer part
After truncation we get (0.111)2 = (0.875)10
(0.2)10 = (0.00110…)2
After truncating we get
(0.001)2 = (0.125)10
The system function after coefficient quantization is
H(z)=
Now the pole locations are given by z1=0.695 and z2=0.178
Cascade form
H(z)=
(0.5)10= (0.1000)2
After truncation we get
(0.100)2 = (0.5)10
After truncation we get (0.011)2 = (0.375)10
(0.4)10 = (0.01100…)2
The system function after coefficient quantization is
H(z)=
The pole locations are given by z1=0.5 and z2=0.375 on comparing the poles of the cascade system with original poles we can say that one of the poles is same and other pole is very close to original pole.
Q15) A LTI system is characterized by the difference equation y(n)=0.68y(n-1)+0.5x(n). The input signal x(n) has a range of -5V to +5V, represented by 8-bits. Find the quantization step size, variance of the error signal and variance of the quantization noise at the output.
A15)
Range R=-5V to +5V = 5-(-5) =10
Size of binary, B= 8 bits (including sign bit)
Quantization step size,
q=R/28=10/28=0.0390625
Variance error of the signal = q2/12=1.27116 x 10-4
The difference equation governing the LTI system is
Y (n) =0.68y (n-1) +0.15x (n)
On taking z transform of above equation we get
Now, poles of H (z) H (z-1) z-1 are p1=0.68, p2=1.4706 Here, p1=0.68 is the only pole that lies inside the unit circle in z-plane Variance of the input quantization noise at the output.
Q16) What are called overflow oscillations? How it can be prevented?
A16)
- We know that the limit cycle oscillation is caused by rounding the result of multiplication.
- The limit cycle occurs due to the overflow of adder is known as overflow limit cycle oscillations.
- Several types of limit cycle oscillations are caused by addition, which makes the filter output oscillate between maximum and minimum amplitudes.
Let us consider 2 positive numbers n1 & n2
n1 =0.1117/8
n2 =0.1106/8
n1 + n2 = 1.101-5/8 in sign magnitude form
The sum is wrongly interpreted as a negative number.
The transfer characteristics of a saturation adder is shown in fig below where n is the input to the adder and f(n) is the corresponding output
Fig. Saturation adder transfer characteristics
From the transfer characteristics, we find that when overflow occurs, the sum of adder is set equal to the maximum value.
Q17) Explain how reduction of round-off errors is achieved in digital filters.
A17)
Saturation arithmetic eliminates limit cycles due to overflow, but it causes undeniable signal distortion due to the non linearity of the clipper. In order to limit the amount of non linear distortion, it is important to scale input signal and unit sample response between input and any internal summing node in the system to avoid overflow
Let us consider a second order IIR filter as shown in the above figure. Here a scale factor S0 is introduced between the input x(n) and the adder 1 to prevent overflow at the output of adder 1.
Now the overall input-output transfer function is
Where: S(z)=1/D(z)
Using Schwartz inequality
Applying Parseval’s Theorem
If z=ejΘ then dz=j ejΘ dΘ
Which gives
dΘ=dz/jz
By substituting the values
When
Q18) Explain in detail about round off effects in digital filters?
A18)
The presence of one or more quantizer in the realization of a digital filter results in a non-linear device. i.e. recursive digital filter may exhibit undesirable oscillations in its output. In the finite arithmetic operations, some registers may overflow if the input signal level becomes large. This overflow represents non-linear distortion leading to limit cycle oscillations. There are two types of limit cycle oscillations which includes 1. Zero input limit cycle oscillations (Low amplitude compared to overflow limit cycle oscillations) 2. Over flow limit cycle oscillations.
Zero input limit cycle oscillations: The arithmetic operations produces oscillations even when the input is zero or some non zero constant values. Such oscillations are called zero input limit cycle oscillations.
Overflow limit cycle oscillations: The limit cycle occurs due to the overflow of adder is known as overflow limit cycle oscillations.
Dead Band: The limit cycle occurs as a result of quantization effect in multiplication. The amplitude of the output during a limit cycle is confined to a range of values called the dead band of the filter.
Consider a first order filter
y(n)= ay(n-1) + x(n); n>0
After rounding the product
yq(n)= Q[a*y(n-1)] + x(n)
The round
Where, er is the difference between the quantized value and the actual value.
The dead band of the filter for the limit cycle oscillations are
Q[ay(n-1)-ay(n-1)] ≤ 2-b/2
Q[ay(n-1)] = y(n-1), a>0
=-y(n-1), a<0
|y(n-1) | -a|y(n-1) | ≤ 2-b/2
y(n-1) (1-|a|) ≤ 2-b/2
Dead band of the filter
Q19) Derive the equation for quantization noise power (or) Steady state Input Noise Power?
A19)
After quantization, we have noise power as input noise power. Therefore, Output noise power of system is given by
Where h(n) is the impulse response of the system.
Let error E(n) be output noise power due to quantization Error
E(n)= e(n)*h(n) =
The variance of error E(n) is called output noise power
By Parseval’s theorem,
When the closed contour integration is evaluated using the method of residue by taking only the poles that lie inside the unit circle.
The z-transform of h(n) will be
Q20) Explain truncating error in sign magnitude form?
A20)
The abrupt termination of given number having a large string of bits (or) Truncation is a process of discarding all bits less significant than the LSB that is retained. Suppose if we truncate the following binary number from 8 bits to 4 bits, we obtain
0.00110011 to 0.0011 (8 bits) (4 bits)
1.01001001 to 1.0100 (8 bits) (4 bits)
When we truncate the number, the signal value is approximated by the highest quantization level that is not greater than the signal.
Truncation error in sign magnitude form:
Consider a 5 bit number which has value of 0.110012 (0.7815)10
This 5 bit number is truncated to a 4 bit number
0.11002 (0.75)10
i.e. 5 bit number 0.11001 has ‘1’ bits
4 bit number 0.1100 has ‘b’ bits
Truncation error, et = 0.1100 – 0.11001
= -0.00001 (-0.03125)10
Here original length is ‘1’ bits. (1=5). The truncated length is ‘b’ bits.
The truncation error, et = 2-b -2-l = -(2-l-2-b)
et = -(2-5 -2-4) = -2-1
The truncation error for a positive number is –(2-b-2-l) ≤et≤0 Non causal
The truncation error for a negative number is 0 et (2-b-2-l) Causal
Truncation error in two’s complement:
- For a positive number, the truncation results in a smaller number and hence remains same as in the case of sign magnitude form.
- For a negative number, the truncation produces negative error in two’s complement –(2-b-2-l) ≤et≤ (2-b-2-l)