.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "recipes/regional_compilation.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        Click :ref:`here <sphx_glr_download_recipes_regional_compilation.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_recipes_regional_compilation.py:


Reducing torch.compile cold start compilation time with regional compilation
============================================================================

**Author:** `Animesh Jain <https://github.com/anijain2305>`_

As deep learning models get larger, the compilation time of these models also
increases. This extended compilation time can result in a large startup time in
inference services or wasted resources in large-scale training. This recipe
shows an example of how to reduce the cold start compilation time by choosing to
compile a repeated region of the model instead of the entire model.

Prerequisites
----------------

* Pytorch 2.5 or later

Setup
-----
Before we begin, we need to install ``torch`` if it is not already
available.

.. code-block:: sh

   pip install torch

.. note::
   This feature is available starting with the 2.5 release. If you are using version 2.4,
   you can enable the configuration flag ``torch._dynamo.config.inline_inbuilt_nn_modules=True``
   to prevent recompilations during regional compilation. In version 2.5, this flag is enabled by default.

.. GENERATED FROM PYTHON SOURCE LINES 32-35

.. code-block:: default


    from time import perf_counter


.. GENERATED FROM PYTHON SOURCE LINES 36-50

Steps
-----

In this recipe, we will follow these steps:

1. Import all necessary libraries.
2. Define and initialize a neural network with repeated regions.
3. Understand the difference between the full model and the regional compilation.
4. Measure the compilation time of the full model and the regional compilation.

First, let's import the necessary libraries for loading our data:


.. GENERATED FROM PYTHON SOURCE LINES 50-55

.. code-block:: default


    import torch
    import torch.nn as nn


.. GENERATED FROM PYTHON SOURCE LINES 56-64

Next, let's define and initialize a neural network with repeated regions.

Typically, neural networks are composed of repeated layers. For example, a
large language model is composed of many Transformer blocks. In this recipe,
we will create a ``Layer`` using the ``nn.Module`` class as a proxy for a repeated region.
We will then create a ``Model`` which is composed of 64 instances of this
``Layer`` class.


.. GENERATED FROM PYTHON SOURCE LINES 64-101

.. code-block:: default

    class Layer(torch.nn.Module):
        def __init__(self):
            super().__init__()
            self.linear1 = torch.nn.Linear(10, 10)
            self.relu1 = torch.nn.ReLU()
            self.linear2 = torch.nn.Linear(10, 10)
            self.relu2 = torch.nn.ReLU()

        def forward(self, x):
            a = self.linear1(x)
            a = self.relu1(a)
            a = torch.sigmoid(a)
            b = self.linear2(a)
            b = self.relu2(b)
            return b


    class Model(torch.nn.Module):
        def __init__(self, apply_regional_compilation):
            super().__init__()
            self.linear = torch.nn.Linear(10, 10)
            # Apply compile only to the repeated layers.
            if apply_regional_compilation:
                self.layers = torch.nn.ModuleList(
                    [torch.compile(Layer()) for _ in range(64)]
                )
            else:
                self.layers = torch.nn.ModuleList([Layer() for _ in range(64)])

        def forward(self, x):
            # In regional compilation, the self.linear is outside of the scope of `torch.compile`.
            x = self.linear(x)
            for layer in self.layers:
                x = layer(x)
            return x


.. GENERATED FROM PYTHON SOURCE LINES 102-111

Next, let's review the difference between the full model and the regional compilation.

In full model compilation, the entire model is compiled as a whole. This is the common approach
most users take with ``torch.compile``. In this example, we apply ``torch.compile`` to
the ``Model`` object. This will effectively inline the 64 layers, producing a
large graph to compile. You can look at the full graph by running this recipe
with ``TORCH_LOGS=graph_code``.


.. GENERATED FROM PYTHON SOURCE LINES 111-116

.. code-block:: default


    model = Model(apply_regional_compilation=False).cuda()
    full_compiled_model = torch.compile(model)


.. GENERATED FROM PYTHON SOURCE LINES 117-122

The regional compilation, on the other hand, compiles a region of the model.
By strategically choosing to compile a repeated region of the model, we can compile a
much smaller graph and then reuse the compiled graph for all the regions.
In the example, ``torch.compile`` is applied only to the ``layers`` and not the full model.


.. GENERATED FROM PYTHON SOURCE LINES 122-125

.. code-block:: default


    regional_compiled_model = Model(apply_regional_compilation=True).cuda()


.. GENERATED FROM PYTHON SOURCE LINES 126-139

Applying compilation to a repeated region, instead of full model, leads to
large savings in compile time. Here, we will just compile a layer instance and
then reuse it 64 times in the ``Model`` object.

Note that with repeated regions, some part of the model might not be compiled.
For example, the ``self.linear`` in the ``Model`` is outside of the scope of
regional compilation.

Also, note that there is a tradeoff between performance speedup and compile
time. Full model compilation involves a larger graph and,
theoretically, offers more scope for optimizations. However, for practical
purposes and depending on the model, we have observed many cases with minimal
speedup differences between the full model and regional compilation.

.. GENERATED FROM PYTHON SOURCE LINES 142-148

Next, let's measure the compilation time of the full model and the regional compilation.

``torch.compile`` is a JIT compiler, which means that it compiles on the first invocation.
In the code below, we measure the total time spent in the first invocation. While this method is not
precise, it provides a good estimate since the majority of the time is spent in
compilation.

.. GENERATED FROM PYTHON SOURCE LINES 148-170

.. code-block:: default


    def measure_latency(fn, input):
        # Reset the compiler caches to ensure no reuse between different runs
        torch.compiler.reset()
        with torch._inductor.utils.fresh_inductor_cache():
            start = perf_counter()
            fn(input)
            torch.cuda.synchronize()
            end = perf_counter()
            return end - start


    input = torch.randn(10, 10, device="cuda")
    full_model_compilation_latency = measure_latency(full_compiled_model, input)
    print(f"Full model compilation time = {full_model_compilation_latency:.2f} seconds")

    regional_compilation_latency = measure_latency(regional_compiled_model, input)
    print(f"Regional compilation time = {regional_compilation_latency:.2f} seconds")

    assert regional_compilation_latency < full_model_compilation_latency


.. GENERATED FROM PYTHON SOURCE LINES 171-179

Conclusion
-----------

This recipe shows how to control the cold start compilation time if your model
has repeated regions. This approach requires user modifications to apply `torch.compile` to
the repeated regions instead of more commonly used full model compilation. We
are continually working on reducing cold start compilation time.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  0.000 seconds)


.. _sphx_glr_download_recipes_regional_compilation.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example


    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: regional_compilation.py <regional_compilation.py>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: regional_compilation.ipynb <regional_compilation.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_