AbstractThe massive amount of fine-grained parallelism exposed by a GPU program makes it difficult to exploit shared cache benefits even there is good program locality. The non deterministic feature of thread execution in the bulk synchronize parallel (BSP) model makes the situation even worse. Most prior work in exploiting GPU cache sharing focuses on regular applications that have linear memory access indices. In this paper, we formulate a generic workload partitioning model that systematically exploits the complexity and approximation bound for optimal cache sharing among GPU threads. Our exploration in this paper demonstrates that it is possible to utilize GPU cache efficiently without significant programming overhead or ad-hoc application-specific implementation.
RightsThis Item is protected by copyright and/or related rights.You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use.For other uses you need to obtain permission from the rights-holder(s).