In order to leverage the memory efficient attention to speed up the unet we only need to update the file in diffusers/src/diffusers/models/attention.py and add the following two blocks import xformersĪnd class MemoryEfficientCrossAttention(nn.Module):ĭef _init_(self, query_dim, context_dim=None, heads=8, dim_head=64, dropout=0.0):Ĭontext_dim = default(context_dim, query_dim) Speeding up Stable diffusion Code updates Note that the above is a very simplified description and that getting this to work for training is no small feat. OpenAI’s Triton language also proposes an implementation of this method. We used the kernels developed by the xformers team, which refer to the original FlashAttention kernels in some cases but also use more optimized kernels for some configurations. Implementing this efficiently on GPU is difficult, notably due to these chips requiring a high level of parallelism to be efficient. ”FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”, Tri Dao et al. Coincidentally, it also relieves a lot of the memory pressure, since the full attention matrix is never materialized. This formulation removes all intermediate reads and writes, which increases speed by removing an I/O bottleneck. The resulting per-tile computation is immediately used against a tile of V, only the end result being written to the main GPU memory. K, Q are read over tiles, and a running softmax formulation is used. How this works is that the above three steps can be fused into one computation, given the insights that there is no dependency across the lines of the attention matrix, and that the softmax computation can be done without materializing the full line. Luckily, there exist work tackling this issue, starting for instance with Rabe et al., and more recently with Tri Dao et al, under the name of Flash attention (‣). Now consider for instance this nice blog post from Horace He, and it becomes apparent that a significant amount of time will be spent on the I/O, which will be a bottleneck for the GPU compute units. Relatively comparable (in terms of the number of elements involved) to the compute. Apples to oranges, but one can also remark that the IO needs are Both I/O and compute costs scale around O(Nˆ2), N is related to the size of the latent space in Stable Diffusion (which itself relates to the output resolution). The attention operation is thus a lot more complicated and demanding than it looks. There are multiple trips to the main memory attached to significant data sizes NxN matrix stored in main GPU memory, often referred to as the “attention matrix”įinal NxH result stored in main memory, O(Nˆ2) reads and O(NH) writesĮven in this simplified form, which does not account for training, and saving activation for instance, there are multiple takeaways. Reads and writes are O(Nˆ2), compute is also O(Nˆ2) NxN result stored in main memory, NxH reads If we put aside the batch dimension (global multiplier), and use N for the context length and H for the head size (let’s suppose Q, K and V have the same dimensions for the sake of clarity), a breakdown of this operation as executed by PyTorch is as follows: Its formulation is as follows, and looks fairly innocuous: attention = softmax(QKˆT).V įrom a complexity standpoint, three things can be considered here: the compute cost of this operation, its memory footprint, and the I/O (input/output, ie: memory operations) that it entails. This operation is not restricted to Transformers though, and the latent diffusion model on which is based Stable Diffusion uses it inside the core denoising steps, notably to take various forms of guidance into account. If all three refer to the same tensor, it becomes known as self-attention. This operation typically takes three inputs: the Query, the Key and the Value. It’s very useful for a model to make sense of the connections which can happen between elements of a sequence, which can be sound bites, pixels or words for instance. The attention operation is at the heart of the Transformer model architecture, which got popular in the last couple of years in the AI space. A few words about memory efficient attention In a previous blog post, we investigated how to make stable diffusion faster using TensorRT at inference time, here we will investigate how to make it even faster, using Memory Efficient Attention from the xformers library. Their quality and expressivity, starting from a user prompt, were an opportunity to improve the PhotoRoomer experience. Diffusion models are a recent take on this, based on iterative steps: a pipeline runs recursive operations starting from a noisy image until it generates the final high-quality image. At PhotoRoom we build photo editing apps, and being able to generate what you have in mind is a superpower.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |