PyTorch has become the leading research and production framework for deep learning. Harnessing the true power of PyTorch requires an understanding of tensor dimensions, efficient memory movement between CPU and GPU, and optimal hardware utilization through NVIDIA CUDA acceleration.
This reference sheet covers tensor manipulations, device management, automatic mixed precision, and CUDA performance debugging.
Before diving into this cheatsheet, check out my previous deep-dive on FastAPI & Pydantic v2 Boilerplate Cheatsheet: The Complete Reference to see how we structured these patterns in practice.
Tensor Initialization & Manipulation
Tensors are the multi-dimensional arrays at the heart of PyTorch. Managing memory during manipulation is critical for speed.
import torch
import numpy as np
# 1. Device-aware initialization
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
x = torch.zeros((3, 4), dtype=torch.float32, device=device)
# 2. Convert from NumPy (shares underlying memory buffer)
np_array = np.ones((5, 5))
tensor_from_np = torch.from_numpy(np_array) # Modifications affect both!
# 3. Reshaping and Dimension Manipulation
y = torch.randn(2, 3, 4)
# Reshape without copying data (returns a view)
y_view = y.view(6, 4)
# Permute dimensions (changes order of dimensions)
y_permuted = y.permute(2, 0, 1) # Shape becomes (4, 2, 3)
# Add/remove singleton dimensions
z = torch.randn(3, 1, 4)
z_squeezed = z.squeeze(1) # Shape (3, 4)
z_unsqueezed = z.unsqueeze(0) # Shape (1, 3, 1, 4)
Managing CUDA Devices
Moving tensors between CPU and GPU involves communication overhead. Keep data transfers to a minimum and allocate directly on the target device whenever possible.
# Check for CUDA availability
cuda_available = torch.cuda.is_available()
device_count = torch.cuda.device_count()
if cuda_available:
# Set default active GPU device
torch.cuda.set_device(0)
current_device = torch.cuda.current_device()
device_name = torch.cuda.get_device_name(current_device)
print(f"Active GPU: {device_name} (ID: {current_device})")
# Pin memory on CPU for faster asynchronous transfers to GPU
cpu_tensor = torch.randn(1000, 1000).pin_memory()
gpu_tensor = cpu_tensor.to("cuda", non_blocking=True)
Mixed Precision & Gradient Scaling
Automatic Mixed Precision (AMP) performs operations in half-precision (FP16/BF16) where possible, speeding up execution and saving GPU memory, while using full-precision (FP32) for critical parameters to preserve model accuracy.
import torch.nn as nn
import torch.optim as optim
model = MyDeepLearningModel().to("cuda")
optimizer = optim.AdamW(model.parameters(), lr=1e-3)
# Initialize Gradient Scaler to prevent underflow in FP16 gradients
scaler = torch.cuda.amp.GradScaler()
for inputs, targets in dataloader:
inputs, targets = inputs.to("cuda"), targets.to("cuda")
optimizer.zero_grad()
# Forward pass under autocast environment
with torch.autocast(device_type="cuda", dtype=torch.float16):
outputs = model(inputs)
loss = criterion(outputs, targets)
# Backward pass with scaled loss
scaler.scale(loss).backward()
# Unscale gradients and update weights
scaler.step(optimizer)
# Update scaler state for next iteration
scaler.update()
Debugging GPU Memory & OOM Faults
Out-Of-Memory errors commonly halt training. These diagnostic commands help monitor allocations and safely reclaim unused blocks.
# Clear PyTorch's internal cache memory pool (deallocates unused GPU memory blocks)
torch.cuda.empty_cache()
# Monitor allocations
allocated_memory = torch.cuda.memory_allocated(device=None) # In bytes
reserved_memory = torch.cuda.memory_reserved(device=None) # In bytes
print(f"Allocated: {allocated_memory / 1e6:.2f} MB")
print(f"Reserved (Cached): {reserved_memory / 1e6:.2f} MB")
# Generate a detailed structural report of current memory footprint
print(torch.cuda.memory_summary(device=None, abbreviated=False))
Advanced Model Distributed Operations
When scaling models across multiple GPUs, use the Distributed Data Parallel (DDP) package for optimal multi-threaded weight synchronization.
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
def setup_distributed(rank, world_size):
# Initialize the process group
dist.init_process_group(
backend="nccl",
init_method="tcp://127.0.0.1:29500",
world_size=world_size,
rank=rank
)
def cleanup_distributed():
dist.destroy_process_group()
# Wrap your neural network inside DDP for automated gradient syncing
# ddp_model = DDP(model, device_ids=[rank])Related Articles
Deepen your understanding with these curated continuations.

Pandas Dataframe & Operations Cheatsheet: The Complete Reference
A comprehensive reference for Pandas: dataframes, series, indexing, merging, grouping, aggregations, and high-performance optimizations.

NumPy Array Manipulations Cheatsheet: The Complete Reference
Master NumPy: array initialization, indexing, slicing, broadcasting, linear algebra, and performance tuning.

FastAPI & Pydantic v2 Boilerplate Cheatsheet: The Complete Reference
Build high-performance APIs: FastAPI routers, Pydantic v2 models, dependency injection, async database integration, and security.
