Graphic processing Units (GPUs) are gaining ground in high-performance
computing. CUDA (an extension to C) is most widely used parallel
programming framework for general purpose GPU computations. However, the
task of writing optimized CUDA program is complex even for experts. We
present a method for restructuring loops into an optimized CUDA kernels
based on a 3-step algorithm which are loop tiling, coalesced memory
access, and maximizing machine utilization. For this we identify the GPU
constraints for maximum performance such that the memory usage (global
memory and shared memory), number of blocks, and number of threads per
block. In addition we identify the condition for maximizing utilization
of the GPU resources. We also establish the relationships between the
influencing parameters and propose a method for finding possible tiling
solutions with coalesced memory access that best meets the identified
constraints. We also present a simplified algorithm for restructuring
loops and rewrite them as an efficient CUDA Kernel. The execution model
of synthesized kernel consists of uniformly distributing the kernel
threads to keep all cores busy while transferring a tailored data
locality which is accessed using coalesced pattern to amortize the long
latency of the secondary memory. In the evaluation, we implement some
simple applications using the proposed restructuring strategy and
evaluate the performance in terms of execution time and GPU throughput.