- Home
- Documents
*MEMORY-EFFICIENT CONCURRENT VLSI ... ... MEMORY-EFFICIENT CONCURRENT VLSI ARCHITECTURES FOR...*

prev

next

out of 13

View

3Download

0

Embed Size (px)

MEMORY-EFFICIENT CONCURRENT VLSI ARCHITECTURES FOR TWO-DIMENSIONAL DISCRETE

WAVELET TRANSFORM

Synopsis

of

Ph.D. thesis

By

Anurag Mahajan (Enrollment Number: 07P01005G)

Under the Guidance of

Prof. B.K.Mohanty

Department of Electronics and Communication Engineering JAYPEE UNIVERSITY OF ENGINEERING AND TECHNOLOGY,

GUNA (M.P.) - INDIA August - 2013

Synopsis - 1

Preface

Discrete wavelet transform (DWT) is a mathematical technique that provides a new method

for signal processing. It decomposes a signal in the time domain by using dilated / contracted

and translated versions of a single basis function, named as prototype wavelet Mallat (1989);

Daubachies (1992); Meyer (1993); Vetterli and Kovacevic (1995). DWT offers wide variety

of useful features over other unitary transforms like discrete Fourier transforms (DFT),

discrete cosine transform (DCT) and discrete sine transform (DST). Two-dimensional (2-D)

DWT has been applied in image compression, image analysis and image watermarking etc.

Lewis and Knowles (1992). Currently, 2-D DWT is used in JPEG 2000 image compression standard Skodars et al. (2001). The 2-D DWT is highly computation intensive and many of its application need real-time processing to deliver better performance. The 2-D DWT is

currently implemented in very large scale integration (VLSI) system to meet the space-time

requirement of various real-time applications. Several design schemes have been suggested

for efficient implementation of 2-D DWT in a VLSI system.

The hardware complexity of multilevel 2-D DWT structure is broadly divided into

two parts (i) arithmetic and (ii) memory. The arithmetic component is comprised of

multipliers and adders, and its complexity depends on wavelet filter size (k). The memory

component is comprised of line buffer and frame buffer. The memory complexity depends on

image size (MN), where M and N represent the height and width of input image. Small size

filters (k < 10) are used in DWT where the standard image size is (512 × 512). Therefore, the

complexity of multilevel 2-D DWT structure is dominated by the complexity of memory

component. Most of the existing design strategies are focused on arithmetic complexity, cycle

period and throughput rate. There is no specific memory-centric design method is proposed

for multilevel 2-D DWT. The objective of the proposed thesis work is to explore memory-

centric design approaches and proposes area-delay-power efficient hardware designs for

implementation of multilevel 2-D DWT.

Objective

The thesis entitled “Memory-Efficient Concurrent VLSI Architectures for Two-Dimensional

Discrete Wavelet Transform” has the following aims and objectives:

To improve memory utilization efficiency of 2-D DWT structure.

To reduce transposition memory size.

Synopsis - 2

To eliminate frame buffer.

To reduce arithmetic complexity using low complexity design scheme.

The summary of the thesis is given below:

Chapter 1: Introduction

In this Chapter, computation scheme of one dimensional (1-D) and 2-D DWT are

discussed. 1-D DWT can be performed using convolution scheme or lifting scheme proposed

by Sweldens (1996). Convolution scheme involves more arithmetic resources and memory

space than the lifting scheme. However, the lifting scheme is suitable for bi-orthogonal

wavelet filters. The 2-D DWT computation is performed by two approaches: (i) separable and

(ii) non-separable. In non-separable approach, row and column transforms of 2-D DWT are

performed simultaneously using 2-D wavelet filters. In separable approach, row and column

transforms of 2-D DWT are performed separately using 1-D DWT. Separable approach is

more popular than non-separable approach as it demands less computation than non-

separable approach. However, separable approach requires transposition memory between

row and column transform. Multilevel 2-D DWT computation can be performed using

pyramid algorithm (PA), recursive pyramid algorithm (RPA) of Vishwanath (1994) and

folded scheme of Wu and Chen (2001). Due to design simplicity, 100% hardware utilization

efficiency (HUE) and lower arithmetic resource requirement, folded scheme is more popular

than the PA and RPA for hardware realization. Keeping this in view, several architectures

based on folded scheme have been proposed for efficient implementation of 2-D DWT.

Chapter 2: Hardware Complexity Analysis

Folded 2-D DWT computation is performed level by level using one separable 2-

DWT refer to as processing unit (PU) and one frame buffer. The low-low subband of the

current DWT level is stored in the frame buffer to compute higher DWT levels. The PU

comprised of one row-processor (to perform 1-D DWT computation row-wise), one column-

processor (to perform 1-D DWT computation column-wise), one transposition memory and

one temporal memory. Transposition memory stores the intermediate matrices low-pass (Ul)

and high-pass (Uh) while temporal memory is used by the column-processor to store the

partial results of column DWT. Frame memory may either be on-chip or off-chip, while the

other two are usually on-chip memories.

Synopsis - 3

The arithmetic complexity of folded structure depends on the DWT computation

scheme and filter length. The size of frame buffer size is MN/4, words which is independent

of data access scheme, type of DWT computation scheme (convolution or lifting) and length

of wavelet filter. Temporal memory size depends on DWT computation scheme and wavelet

filter length. For convolution-based 2-D DWT temporal memory size is zero when a direct-

form FIR structure is used for computation of 1-D DWT. In case of lifting-based 2-D DWT,

the size of temporal memory depends on number of lifting steps of bi-orthogonal wavelet

filter. Transposition memory size mainly depends on the data access scheme adopted to feed

2-D input samples and DWT computation scheme (convolution or lifting). In general, the

sizes of the transposition memory and temporal memory are some multiple of image width,

while the size of frame memory is some multiple of image size. On other hand, the

complexity of arithmetic component depends on the size of the wavelet filter. The standard

image size is (512 × 512) where the size of most commonly used wavelet filter is less than

10. The hardware complexity of folded 2-D DWT structure is dominated by complexity of

memory component.

Several VLSI architectures have been suggested for the folded 2-D DWT in last

decade to meet space and time requirement of real-time application. All these designs differ

by arithmetic complexity, cycle period, throughput rate and they use almost same amount of

on-chip memory words and equal amount of frame buffer words. The Arithmetic complexity

(in terms of multiplier and adder) and memory complexity (in terms of memory words) of

best available designs are estimated for 9/7 wavelet filter and image size (512 × 512) Wu et

al. (2005); Xiong et al. (2006); Xiong et al. (2007); Cheng et al. (2007). It is found that

memory complexity is almost 103 times higher than arithmetic complexity. Consequently, the

memory words per output (MPO) of the existing designs are significantly higher than the

arithmetic complexity per output. Since, the logic complexity of arithmetic components and

memory components are widely different. Transistor count is considered to estimate

arithmetic and memory complexity of the existing structures. We find that, the transistor

count of memory component is almost 97% on average of total transistor count of the folded

designs. Therefore, memory component of folded design consumes most of the chip area and

power. However, the existing design approaches are focused on optimizing the arithmetic

complexity and cycle period. There is no specific design is suggested to address memory

complexity which is a major component of folded 2-D DWT structure.

Synopsis - 4

Chapter 3: Block-Based Architecture for Folded 2-D DWT Using Line Scanning

Folded 2-D DWT structure is memory intensive. Few two-input and two-output and

four-input and four-output designs have been suggested in Xiong et al. (2006), Xiong et al.

(2007), Li et al. (2009) and Lai et al. (2009) for high throughput implementation of folded 2-

D DWT. The arithmetic complexity of these structures is varying proportionality with

throughput rate, but the memory complexity is almost independent of throughput rate. For

example the structure of Xiong et al. (2007) processes four samples per cycle and involves

on-chip memory of size nearly 5.5N words, where the structure of Xiong et al. (2006)

processes two samples per cycle and involves on-chip memory of size 5.5N words, but both

the designs involve frame buffer of size MN/4 words. In general, on-chip and off-chip

memory of folded design is almost independent of input block size. Therefore, block