Automatic differentiation package - torch.autograd¶
torch.autograd
provides classes and functions implementing automatic
differentiation of arbitrary scalar valued functions. It requires minimal
changes to the existing code - you only need to declare Tensor
s
for which gradients should be computed with the requires_grad=True
keyword.
As of now, we only support autograd for floating point Tensor
types (
half, float, double and bfloat16) and complex Tensor
types (cfloat, cdouble).
backward |
Computes the sum of gradients of given tensors with respect to graph leaves. |
grad |
Computes and returns the sum of gradients of outputs with respect to the inputs. |
Forward-mode Automatic Differentiation¶
Warning
This API is in beta. Even though the function signatures are very unlikely to change, improved operator coverage is planned before we consider this stable.
Please see the forward-mode AD tutorial for detailed steps on how to use this API.
Context-manager that enables forward AD. |
|
Associates a tensor value with a forward gradient, the tangent, to create a "dual tensor", which is used to compute forward AD gradients. |
|
Unpacks a "dual tensor" to get both its Tensor value and its forward AD gradient. |
Functional higher level API¶
Warning
This API is in beta. Even though the function signatures are very unlikely to change, major improvements to performances are planned before we consider this stable.
This section contains the higher level API for the autograd that builds on the basic API above and allows you to compute jacobians, hessians, etc.
This API works with user-provided functions that take only Tensors as input and return
only Tensors.
If your function takes other arguments that are not Tensors or Tensors that don’t have requires_grad set,
you can use a lambda to capture them.
For example, for a function f
that takes three inputs, a Tensor for which we want the jacobian, another
tensor that should be considered constant and a boolean flag as f(input, constant, flag=flag)
you can use it as functional.jacobian(lambda x: f(x, constant, flag=flag), input)
.
Function that computes the Jacobian of a given function. |
|
Function that computes the Hessian of a given scalar function. |
|
Function that computes the dot product between a vector |
|
Function that computes the dot product between the Jacobian of the given function at the point given by the inputs and a vector |
|
Function that computes the dot product between a vector |
|
Function that computes the dot product between the Hessian of a given scalar function and a vector |
Locally disabling gradient computation¶
See Locally disabling gradient computation for more information on the differences between no-grad and inference mode as well as other related mechanisms that may be confused with the two. Also see Locally disabling gradient computation for a list of functions that can be used to locally disable gradients.
Default gradient layouts¶
When a non-sparse param
receives a non-sparse gradient during
torch.autograd.backward()
or torch.Tensor.backward()
param.grad
is accumulated as follows.
If param.grad
is initially None
:
If
param
’s memory is non-overlapping and dense,.grad
is created with strides matchingparam
(thus matchingparam
’s layout).Otherwise,
.grad
is created with rowmajor-contiguous strides.
If param
already has a non-sparse .grad
attribute:
If
create_graph=False
,backward()
accumulates into.grad
in-place, which preserves its strides.If
create_graph=True
,backward()
replaces.grad
with a new tensor.grad + new grad
, which attempts (but does not guarantee) matching the preexisting.grad
’s strides.
The default behavior (letting .grad
s be None
before the first
backward()
, such that their layout is created according to 1 or 2,
and retained over time according to 3 or 4) is recommended for best performance.
Calls to model.zero_grad()
or optimizer.zero_grad()
will not affect .grad
layouts.
In fact, resetting all .grad
s to None
before each
accumulation phase, e.g.:
for iterations...
...
for param in model.parameters():
param.grad = None
loss.backward()
such that they’re recreated according to 1 or 2 every time,
is a valid alternative to model.zero_grad()
or optimizer.zero_grad()
that may improve performance for some networks.
Manual gradient layouts¶
If you need manual control over .grad
’s strides,
assign param.grad =
a zeroed tensor with desired strides
before the first backward()
, and never reset it to None
.
3 guarantees your layout is preserved as long as create_graph=False
.
4 indicates your layout is likely preserved even if create_graph=True
.
In-place operations on Tensors¶
Supporting in-place operations in autograd is a hard matter, and we discourage their use in most cases. Autograd’s aggressive buffer freeing and reuse makes it very efficient and there are very few occasions when in-place operations actually lower memory usage by any significant amount. Unless you’re operating under heavy memory pressure, you might never need to use them.
In-place correctness checks¶
All Tensor
s keep track of in-place operations applied to them, and
if the implementation detects that a tensor was saved for backward in one of
the functions, but it was modified in-place afterwards, an error will be raised
once backward pass is started. This ensures that if you’re using in-place
functions and not seeing any errors, you can be sure that the computed
gradients are correct.
Variable (deprecated)¶
Warning
The Variable API has been deprecated: Variables are no longer necessary to
use autograd with tensors. Autograd automatically supports Tensors with
requires_grad
set to True
. Below please find a quick guide on what
has changed:
Variable(tensor)
andVariable(tensor, requires_grad)
still work as expected, but they return Tensors instead of Variables.var.data
is the same thing astensor.data
.Methods such as
var.backward(), var.detach(), var.register_hook()
now work on tensors with the same method names.
In addition, one can now create tensors with requires_grad=True
using factory
methods such as torch.randn()
, torch.zeros()
, torch.ones()
, and others
like the following:
autograd_tensor = torch.randn((2, 3, 4), requires_grad=True)
Tensor autograd functions¶
|
This attribute is |
|
Is |
|
All Tensors that have |
|
Computes the gradient of current tensor w.r.t. |
|
Returns a new Tensor, detached from the current graph. |
|
Detaches the Tensor from the graph that created it, making it a leaf. |
|
Registers a backward hook. |
|
Enables this Tensor to have their |
Function¶
- class torch.autograd.Function(*args, **kwargs)[source]¶
Base class to create custom autograd.Function
To create a custom autograd.Function, subclass this class and implement the
forward()
andbackward()
static methods. Then, to use your custom op in the forward pass, call the class methodapply
. Do not callforward()
directly.To ensure correctness and best performance, make sure you are calling the correct methods on
ctx
and validating your backward function usingtorch.autograd.gradcheck()
.See Extending torch.autograd for more details on how to use this class.
Examples:
>>> class Exp(Function): >>> @staticmethod >>> def forward(ctx, i): >>> result = i.exp() >>> ctx.save_for_backward(result) >>> return result >>> >>> @staticmethod >>> def backward(ctx, grad_output): >>> result, = ctx.saved_tensors >>> return grad_output * result >>> >>> # Use it by calling the apply method: >>> output = Exp.apply(input)
This function is to be overridden by all subclasses. |
|
Defines a formula for differentiating the operation with backward mode automatic differentiation (alias to the vjp function). |
|
Defines a formula for differentiating the operation with forward mode automatic differentiation. |
|
Defines a rule for the behavior of this autograd.Function underneath |
Context method mixins¶
When creating a new Function
, the following methods are available to ctx.
Marks given tensors as modified in an in-place operation. |
|
Marks outputs as non-differentiable. |
|
Saves given tensors for a future call to |
|
Sets whether to materialize grad tensors. |
Numerical gradient checking¶
gradcheck |
Check gradients computed via small finite differences against analytical gradients w.r.t. |
gradgradcheck |
Check gradients of gradients computed via small finite differences against analytical gradients w.r.t. |
Profiler¶
Autograd includes a profiler that lets you inspect the cost of different
operators inside your model - both on the CPU and GPU. There are three modes
implemented at the moment - CPU-only using profile
.
nvprof based (registers both CPU and GPU activity) using
emit_nvtx
.
and vtune profiler based using
emit_itt
.
- class torch.autograd.profiler.profile(enabled=True, *, use_cuda=False, record_shapes=False, with_flops=False, profile_memory=False, with_stack=False, with_modules=False, use_kineto=False, use_cpu=True, experimental_config=None)[source]¶
Context manager that manages autograd profiler state and holds a summary of results. Under the hood it just records events of functions being executed in C++ and exposes those events to Python. You can wrap any code into it and it will only report runtime of PyTorch functions. Note: profiler is thread local and is automatically propagated into the async tasks
- Parameters:
enabled (bool, optional) – Setting this to False makes this context manager a no-op.
use_cuda (bool, optional) – Enables timing of CUDA events as well using the cudaEvent API. Adds approximately 4us of overhead to each tensor operation.
record_shapes (bool, optional) – If shapes recording is set, information about input dimensions will be collected. This allows one to see which dimensions have been used under the hood and further group by them using prof.key_averages(group_by_input_shape=True). Please note that shape recording might skew your profiling data. It is recommended to use separate runs with and without shape recording to validate the timing. Most likely the skew will be negligible for bottom most events (in a case of nested function calls). But for higher level functions the total self cpu time might be artificially increased because of the shape collection.
with_flops (bool, optional) – If with_flops is set, the profiler will estimate the FLOPs (floating point operations) value using the operator’s input shape. This allows one to estimate the hardware performance. Currently, this option only works for the matrix multiplication and 2D convolution operators.
profile_memory (bool, optional) – track tensor memory allocation/deallocation.
with_stack (bool, optional) – record source information (file and line number) for the ops.
with_modules (bool) – record module hierarchy (including function names) corresponding to the callstack of the op. e.g. If module A’s forward call’s module B’s forward which contains an aten::add op, then aten::add’s module hierarchy is A.B Note that this support exist, at the moment, only for TorchScript models and not eager mode models.
use_kineto (bool, optional) – experimental, enable profiling with Kineto profiler.
use_cpu (bool, optional) – profile CPU events; setting to
False
requiresuse_kineto=True
and can be used to lower the overhead for GPU-only profiling.experimental_config (_ExperimentalConfig) – A set of experimental options used by profiler libraries like Kineto. Note, backward compatibility is not guaranteed.
Example
>>> x = torch.randn((1, 1), requires_grad=True) >>> with torch.autograd.profiler.profile() as prof: >>> for _ in range(100): # any normal python code, really! >>> y = x ** 2 >>> y.backward() >>> # NOTE: some columns were removed for brevity >>> print(prof.key_averages().table(sort_by="self_cpu_time_total")) ----------------------------------- --------------- --------------- --------------- Name Self CPU total CPU time avg Number of Calls ----------------------------------- --------------- --------------- --------------- mul 32.048ms 32.048ms 200 pow 27.041ms 27.041ms 200 PowBackward0 9.727ms 55.483ms 100 torch::autograd::AccumulateGrad 9.148ms 9.148ms 100 torch::autograd::GraphRoot 691.816us 691.816us 100 ----------------------------------- --------------- --------------- ---------------
Exports an EventList as a Chrome tracing tools file. |
|
Averages all function events over their keys. |
|
Returns total time spent on CPU obtained as a sum of all self times across all the events. |
|
Averages all events. |
- class torch.autograd.profiler.emit_nvtx(enabled=True, record_shapes=False)[source]¶
Context manager that makes every autograd operation emit an NVTX range.
It is useful when running the program under nvprof:
nvprof --profile-from-start off -o trace_name.prof -- <regular command here>
Unfortunately, there’s no way to force nvprof to flush the data it collected to disk, so for CUDA profiling one has to use this context manager to annotate nvprof traces and wait for the process to exit before inspecting them. Then, either NVIDIA Visual Profiler (nvvp) can be used to visualize the timeline, or
torch.autograd.profiler.load_nvprof()
can load the results for inspection e.g. in Python REPL.- Parameters:
enabled (bool, optional) – Setting
enabled=False
makes this context manager a no-op. Default:True
.record_shapes (bool, optional) – If
record_shapes=True
, the nvtx range wrapping each autograd op will append information about the sizes of Tensor arguments received by that op, in the following format:[[arg0.size(0), arg0.size(1), ...], [arg1.size(0), arg1.size(1), ...], ...]
Non-tensor arguments will be represented by[]
. Arguments will be listed in the order they are received by the backend op. Please note that this order may not match the order in which those arguments were passed on the Python side. Also note that shape recording may increase the overhead of nvtx range creation. Default:False
Example
>>> with torch.cuda.profiler.profile(): ... model(x) # Warmup CUDA memory allocator and profiler ... with torch.autograd.profiler.emit_nvtx(): ... model(x)
Forward-backward correlation
When viewing a profile created using
emit_nvtx
in the Nvidia Visual Profiler, correlating each backward-pass op with the corresponding forward-pass op can be difficult. To ease this task,emit_nvtx
appends sequence number information to the ranges it generates.During the forward pass, each function range is decorated with
seq=<N>
.seq
is a running counter, incremented each time a new backward Function object is created and stashed for backward. Thus, theseq=<N>
annotation associated with each forward function range tells you that if a backward Function object is created by this forward function, the backward object will receive sequence number N. During the backward pass, the top-level range wrapping each C++ backward Function’sapply()
call is decorated withstashed seq=<M>
.M
is the sequence number that the backward object was created with. By comparingstashed seq
numbers in backward withseq
numbers in forward, you can track down which forward op created each backward Function.Any functions executed during the backward pass are also decorated with
seq=<N>
. During default backward (withcreate_graph=False
) this information is irrelevant, and in fact,N
may simply be 0 for all such functions. Only the top-level ranges associated with backward Function objects’apply()
methods are useful, as a way to correlate these Function objects with the earlier forward pass.Double-backward
If, on the other hand, a backward pass with
create_graph=True
is underway (in other words, if you are setting up for a double-backward), each function’s execution during backward is given a nonzero, usefulseq=<N>
. Those functions may themselves create Function objects to be executed later during double-backward, just as the original functions in the forward pass did. The relationship between backward and double-backward is conceptually the same as the relationship between forward and backward: The functions still emit current-sequence-number-tagged ranges, the Function objects they create still stash those sequence numbers, and during the eventual double-backward, the Function objects’apply()
ranges are still tagged withstashed seq
numbers, which can be compared to seq numbers from the backward pass.
- class torch.autograd.profiler.emit_itt(enabled=True, record_shapes=False)[source]¶
Context manager that makes every autograd operation emit an ITT range.
It is useful when running the program under Intel(R) VTune Profiler:
vtune <--vtune-flags> <regular command here>
The Instrumentation and Tracing Technology (ITT) API enables your application to generate and control the collection of trace data during its execution across different Intel tools. This context manager is to annotate Intel(R) VTune Profiling trace. With help of this context manager, you will be able to see labled ranges in Intel(R) VTune Profiler GUI.
- Parameters:
enabled (bool, optional) – Setting
enabled=False
makes this context manager a no-op. Default:True
.record_shapes (bool, optional) – If
record_shapes=True
, the itt range wrapping each autograd op will append information about the sizes of Tensor arguments received by that op, in the following format:[[arg0.size(0), arg0.size(1), ...], [arg1.size(0), arg1.size(1), ...], ...]
Non-tensor arguments will be represented by[]
. Arguments will be listed in the order they are received by the backend op. Please note that this order may not match the order in which those arguments were passed on the Python side. Also note that shape recording may increase the overhead of itt range creation. Default:False
Example
>>> with torch.autograd.profiler.emit_itt(): ... model(x)
Opens an nvprof trace file and parses autograd annotations. |
Anomaly detection¶
- class torch.autograd.detect_anomaly(check_nan=True)[source]¶
Context-manager that enable anomaly detection for the autograd engine.
This does two things:
Running the forward pass with detection enabled will allow the backward pass to print the traceback of the forward operation that created the failing backward function.
If
check_nan
isTrue
, any backward computation that generate “nan” value will raise an error. DefaultTrue
.
Warning
This mode should be enabled only for debugging as the different tests will slow down your program execution.
Example
>>> import torch >>> from torch import autograd >>> class MyFunc(autograd.Function): ... @staticmethod ... def forward(ctx, inp): ... return inp.clone() ... @staticmethod ... def backward(ctx, gO): ... # Error during the backward pass ... raise RuntimeError("Some error in backward") ... return gO.clone() >>> def run_fn(a): ... out = MyFunc.apply(a) ... return out.sum() >>> inp = torch.rand(10, 10, requires_grad=True) >>> out = run_fn(inp) >>> out.backward() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/your/pytorch/install/torch/_tensor.py", line 93, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/your/pytorch/install/torch/autograd/__init__.py", line 90, in backward allow_unreachable=True) # allow_unreachable flag File "/your/pytorch/install/torch/autograd/function.py", line 76, in apply return self._forward_cls.backward(self, *args) File "<stdin>", line 8, in backward RuntimeError: Some error in backward >>> with autograd.detect_anomaly(): ... inp = torch.rand(10, 10, requires_grad=True) ... out = run_fn(inp) ... out.backward() Traceback of forward call that caused the error: File "tmp.py", line 53, in <module> out = run_fn(inp) File "tmp.py", line 44, in run_fn out = MyFunc.apply(a) Traceback (most recent call last): File "<stdin>", line 4, in <module> File "/your/pytorch/install/torch/_tensor.py", line 93, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/your/pytorch/install/torch/autograd/__init__.py", line 90, in backward allow_unreachable=True) # allow_unreachable flag File "/your/pytorch/install/torch/autograd/function.py", line 76, in apply return self._forward_cls.backward(self, *args) File "<stdin>", line 8, in backward RuntimeError: Some error in backward
- class torch.autograd.set_detect_anomaly(mode, check_nan=True)[source]¶
Context-manager that sets the anomaly detection for the autograd engine on or off.
set_detect_anomaly
will enable or disable the autograd anomaly detection based on its argumentmode
. It can be used as a context-manager or as a function.See
detect_anomaly
above for details of the anomaly detection behaviour.
Autograd graph¶
Autograd exposes methods that allow one to inspect the graph and interpose behavior during the backward pass.
The grad_fn
attribute of a torch.Tensor
holds a torch.autograd.graph.Node
if the tensor is the output of a operation that was recorded by autograd (i.e., grad_mode is
enabled and at least one of the inputs required gradients), or None
otherwise.
Returns the name. |
|
Returns the metadata. |
|
Registers a backward hook. |
|
Registers a backward pre-hook. |
Some operations need intermediary results to be saved during the forward pass
in order to execute the backward pass.
These intermediary results are saved as attributes on the grad_fn
and can be accessed.
For example:
>>> a = torch.tensor([0., 0., 0.], requires_grad=True)
>>> b = a.exp()
>>> print(isinstance(b.grad_fn, torch.autograd.graph.Node))
True
>>> print(dir(b.grad_fn))
['__call__', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_raw_saved_result', '_register_hook_dict', '_saved_result', 'metadata', 'name', 'next_functions', 'register_hook', 'register_prehook', 'requires_grad']
>>> print(torch.allclose(b.grad_fn._saved_result, b))
True
You can also define how these saved tensors should be packed / unpacked using hooks. A common application is to trade compute for memory by saving those intermediary results to disk or to CPU instead of leaving them on the GPU. This is especially useful if you notice your model fits on GPU during evaluation, but not training. Also see Hooks for saved tensors.
- class torch.autograd.graph.saved_tensors_hooks(pack_hook, unpack_hook)[source]¶
Context-manager that sets a pair of pack / unpack hooks for saved tensors.
Use this context-manager to define how intermediary results of an operation should be packed before saving, and unpacked on retrieval.
In that context, the
pack_hook
function will be called everytime an operation saves a tensor for backward (this includes intermediary results saved usingsave_for_backward()
but also those recorded by a PyTorch-defined operation). The output ofpack_hook
is then stored in the computation graph instead of the original tensor.The
unpack_hook
is called when the saved tensor needs to be accessed, namely when executingtorch.Tensor.backward()
ortorch.autograd.grad()
. It takes as argument the packed object returned bypack_hook
and should return a tensor which has the same content as the original tensor (passed as input to the correspondingpack_hook
).The hooks should have the following signatures:
pack_hook(tensor: Tensor) -> Any
unpack_hook(Any) -> Tensor
where the return value of
pack_hook
is a valid input tounpack_hook
.In general, you want
unpack_hook(pack_hook(t))
to be equal tot
in terms of value, size, dtype and device.Example:
>>> def pack_hook(x): ... print("Packing", x) ... return x >>> >>> def unpack_hook(x): ... print("Unpacking", x) ... return x >>> >>> a = torch.ones(5, requires_grad=True) >>> b = torch.ones(5, requires_grad=True) * 2 >>> with torch.autograd.graph.saved_tensors_hooks(pack_hook, unpack_hook): ... y = a * b Packing tensor([1., 1., 1., 1., 1.], requires_grad=True) Packing tensor([2., 2., 2., 2., 2.], grad_fn=<MulBackward0>) >>> y.sum().backward() Unpacking tensor([1., 1., 1., 1., 1.], requires_grad=True) Unpacking tensor([2., 2., 2., 2., 2.], grad_fn=<MulBackward0>)
Warning
Performing an inplace operation on the input to either hooks may lead to undefined behavior.
Warning
Only one pair of hooks is allowed at a time. When recursively nesting this context-manager, only the inner-most pair of hooks will be applied.
- class torch.autograd.graph.save_on_cpu(pin_memory=False)[source]¶
Context-manager under which tensors saved by the forward pass will be stored on cpu, then retrieved for backward.
When performing operations within this context manager, intermediary results saved in the graph during the forward pass will be moved to CPU, then copied back to the original device when needed for the backward pass. If the graph was already on CPU, no tensor copy is performed.
Use this context-manager to trade compute for GPU memory usage (e.g. when your model doesn’t fit in GPU memory during training).
- Parameters:
pin_memory (bool) – If
True
tensors will be saved to CPU pinned memory during packing and copied to GPU asynchronously during unpacking. Defaults toFalse
. Also see Use pinned memory buffers.
Example:
>>> a = torch.randn(5, requires_grad=True, device="cuda") >>> b = torch.randn(5, requires_grad=True, device="cuda") >>> c = torch.randn(5, requires_grad=True, device="cuda") >>> >>> def f(a, b, c): ... prod_1 = a * b # a and b are saved on GPU ... with torch.autograd.graph.save_on_cpu(): ... prod_2 = prod_1 * c # prod_1 and c are saved on CPU ... y = prod_2 * a # prod_2 and a are saved on GPU ... return y >>> >>> y = f(a, b, c) >>> del a, b, c # for illustration only >>> # the content of a, b, and prod_2 are still alive on GPU >>> # the content of prod_1 and c only live on CPU >>> y.sum().backward() # all CPU tensors are moved back to GPU, for backward >>> # all intermediary tensors are released (deleted) after the call to backward
- class torch.autograd.graph.disable_saved_tensors_hooks(error_message)[source]¶
Context-manager that disables the saved tensors default hooks feature.
Useful for if you are creating a feature that does not work with saved tensors default hooks.
- Parameters:
error_message (str) – When saved tensors default hooks are used when they have been are disabled, a RuntimeError with this error message gets raised.
Example:
>>> message = "saved tensors default hooks are disabled" >>> with torch.autograd.graph.disable_saved_tensors_hooks(message): ... # Raises RuntimeError: saved tensors default hooks are disabled ... with torch.autograd.graph.save_on_cpu(): ... pass