DeviceStatsMonitor¶
- class lightning.pytorch.callbacks.DeviceStatsMonitor(cpu_stats=None)[source]¶
Bases:
CallbackAutomatically monitors and logs device stats during training, validation and testing stage.
DeviceStatsMonitoris a special callback as it requires aloggerto passed as argument to theTrainer.Logged Metrics
Logs device statistics with keys prefixed as
DeviceStatsMonitor.{hook_name}/{base_metric_name}. The actual metrics depend on the active accelerator and thecpu_statsflag. Below are an overview of the possible available metrics and their meaning.CPU (via
psutil)cpu_percent— System-wide CPU utilization (%)cpu_vm_percent— System-wide virtual memory (RAM) utilization (%)cpu_swap_percent— System-wide swap memory utilization (%)
CUDA GPU (via
torch.cuda.memory_stats)Logs memory statistics from PyTorch caching allocator (all in bytes). GPU compute utilization is not logged by default.
General Memory Usage:
allocated_bytes.all.current— Current allocated GPU memoryallocated_bytes.all.peak— Peak allocated GPU memoryreserved_bytes.all.current— Current reserved GPU memory (allocated + cached)reserved_bytes.all.peak— Peak reserved GPU memoryactive_bytes.all.current— Current GPU memory in active useactive_bytes.all.peak— Peak GPU memory in active useinactive_split_bytes.all.current— Memory in inactive, splittable blocks
Allocator Pool Statistics* (for
small_poolandlarge_pool):allocated_bytes.{pool_type}.current/allocated_bytes.{pool_type}.peakreserved_bytes.{pool_type}.current/reserved_bytes.{pool_type}.peakactive_bytes.{pool_type}.current/active_bytes.{pool_type}.peak
Allocator Events:
num_ooms— Cumulative out-of-memory errorsnum_alloc_retries— Number of allocation retriesnum_device_alloc— Number of device allocationsnum_device_free— Number of device deallocations
For a full list of CUDA memory stats, see the PyTorch documentation.
TPU (via
torch_xla)Memory Metrics (per device, e.g.,
xla:0):memory.free.xla:0— Free HBM memory (MB)memory.used.xla:0— Used HBM memory (MB)memory.percent.xla:0— Percentage of HBM memory used (%)
XLA Operation Counters:
CachedCompile.xlaCreateXlaTensor.xlaDeviceDataCacheMiss.xlaUncachedCompile.xlaxla::add.xla,xla::addmm.xla, etc.
These counters can be retrieved using:
torch_xla.debug.metrics.counter_names()
- Parameters:
cpu_stats¶ (
Optional[bool]) – ifNone, it will log CPU stats only if the accelerator is CPU. IfTrue, it will log CPU stats regardless of the accelerator. IfFalse, it will not log CPU stats regardless of the accelerator.- Raises:
MisconfigurationException – If
Trainerhas no logger.ModuleNotFoundError – If
psutilis not installed and CPU stats are monitored.
Example:
from lightning import Trainer from lightning.pytorch.callbacks import DeviceStatsMonitor device_stats = DeviceStatsMonitor() trainer = Trainer(callbacks=[device_stats])
- on_test_batch_end(trainer, pl_module, outputs, batch, batch_idx, dataloader_idx=0)[source]¶
Called when the test batch ends.
- Return type:
- on_test_batch_start(trainer, pl_module, batch, batch_idx, dataloader_idx=0)[source]¶
Called when the test batch begins.
- Return type:
- on_train_batch_end(trainer, pl_module, outputs, batch, batch_idx)[source]¶
Called when the train batch ends. :rtype:
NoneNote
The value
outputs["loss"]here will be the normalized value w.r.taccumulate_grad_batchesof the loss returned fromtraining_step.
- on_train_batch_start(trainer, pl_module, batch, batch_idx)[source]¶
Called when the train batch begins.
- Return type:
- on_validation_batch_end(trainer, pl_module, outputs, batch, batch_idx, dataloader_idx=0)[source]¶
Called when the validation batch ends.
- Return type: