Nvidia smi gpu reset

Working with GPU

GPU Monitoring

Here is how to poll the status of your GPU(s) in a variety of ways from your terminal:

Watch the processes using GPU(s) and the current state of your GPU(s):

Watch the usage stats as their change:

This way is useful as you can see the trace of changes, rather than just the current state shown by nvidia-smi executed without any arguments.

To see what other options you can query run: nvidia-smi —help-query-gpu .

-l 1 will update every 1 sec (`–loop. You can increase that number to do it less frequently.

-f filename will log into a file, but you won’t be able to see the output. So it’s better to use nvidia-smi . | tee filename instead, which will show the output and log the results as well.

if you’d like the program to stop logging after running for 3600 seconds, run it as: timeout -t 3600 nvidia-smi .

Most likely you will just want to track the memory usage, so this is probably sufficient:

Similar to the above, but show the stats as percentages:

which shows the essentials (usage and memory). If you would like all of the stats, run it without arguments: nvidia-smi dmon To find out the other options, use: nvidia-smi dmon -h

Nvtop stands for NVidia TOP, a (h)top like task monitor for NVIDIA GPUs. It can handle multiple GPUs and print information about them in a htop familiar way.

It shows the processes, and also visually displays the memory and gpu stats.

This application requires building it from source (needing gcc , make , et al), but the instructions are easy to follow and it is quick to build.

nvidia-smi like monitor, but a compact one. It relies on pynvml to talk to the nvml layer.

Installation: pip3 install gpustat .

And here is a usage example:

Accessing NVIDIA GPU Info Programmatically

While watching nvidia-smi running in your terminal is handy, sometimes you want to do more than that. And that’s where API access comes in handy. The following tools provide that.


nvidia-ml-py3 provides Python 3 bindings for nvml c-lib (NVIDIA Management Library), which allows you to query the library directly, without needing to go through nvidia-smi . Therefore this module is much faster than the wrappers around nvidia-smi .

The bindings are implemented with Ctypes , so this module is noarch — it’s just pure python.

This library is now a fastai dependency, so you can use it directly.

Print the memory stats for the first GPU card:

List the available GPU devices:

And here is a usage example via a sample module nvidia_smi :


This is another fork of nvidia-ml-py3 , supplementing it with extra useful utils.

note: there is no py3nvml conda package in its main channel, but it is available on pypi.


GPUtil is a wrapper around nvidia-smi , and requires the latter to function before it can be used.

Installation: pip3 install gputil .

And here is a usage example:

For more details see: https://github.com/anderskm/gputil

For more details see: https://github.com/nicolargo/nvidia-ml-py3

GPU Memory Notes

Unusable GPU RAM per process

As soon as you start using CUDA, your GPU loses some 300-500MB RAM per process. The exact size seems to be depending on the card and CUDA version. For example, on GeForce GTX 1070 Ti (8GB), the following code, running on CUDA 10.0, consumes 0.5GB GPU RAM:

This GPU memory is not accessible to your program’s needs and it’s not re-usable between processes. If you run two processes, each executing code on cuda , each will consume 0.5GB GPU RAM from the get going.

This fixed chunk of memory is used by CUDA context.

Cached Memory

pytorch normally caches GPU RAM it previously used to re-use it at a later time. So the output from nvidia-smi could be incorrect in that you may have more GPU RAM available than it reports. You can reclaim this cache with:

If you have more than one process using the same GPU, the cached memory from one process is not accessible to the other. The above code executed by the first process will solve this issue and make the freed GPU RAM available to the other process.

It also might be helpful to note that torch.cuda.memory_cached() doesn’t show how much memory pytorch has free in the cache, but it just indicates how much memory it currently has allocated, with some of it being used and may be some being free. To measure how much free memory available to use is in the cache do: torch.cuda.memory_cached()-torch.cuda.memory_allocated() .

Reusing GPU RAM

How can we do a lot of experimentation in a given jupyter notebook w/o needing to restart the kernel all the time? You can delete the variables that hold the memory, can call import gc; gc.collect() to reclaim memory by deleted objects with circular references, optionally (if you have just one process) calling torch.cuda.empty_cache() and you can now re-use the GPU memory inside the same kernel.

Читайте также:  Видеокарту gtx 550ti характеристики

To automate this process, and get various stats on memory consumption, you can use IPyExperiments. Other than helping you to reclaim general and GPU RAM, it is also helpful with efficiently tuning up your notebook parameters to avoid CUDA: out of memory errors and detecting various other memory leaks.

And also make sure you read the tutorial on learn.purge and its friends here, which provide an even better solution.

GPU RAM Fragmentation

If you encounter an error similar to the following:

You may ask yourself, if there is 0.32 GB free and 1.34 GB cached (i.e. 1.66 GB total of unused memory), how can it not allocate 350 MB? This happens because of memory fragmentation.

For the sake of this example let’s assume that you have a function that allocates as many GBs of GPU RAM as its argument specifies:

And you have an 8GB GPU card and no process is using it, so when a process is starting it’s the first one to use it.

If you do the following sequence of GPU RAM allocations:

despite having a total of 4GB of free GPU RAM (cached and free), the last command will fail, because it can’t get 3GB of contiguous memory.

Except, this example isn’t quite valid, because under the hood CUDA relocates physical pages, and makes them appear as if they are of a contiguous type of memory to pytorch. So in the example above it’ll reuse most or all of those fragments as long as there is nothing else occupying those memory pages.

So for this example to be applicable to the CUDA memory fragmentation situation it needs to allocate fractions of a memory page, which currently for most CUDA cards is of 2MB. So if less than 2MB is allocated in the same scenario as this example, fragmentation will occur.

Given that GPU RAM is a scarce resource, it helps to always try free up anything that’s on CUDA as soon as you’re done using it, and only then move new objects to CUDA. Normally a simple del obj does the trick. However, if your object has circular references in it, it will not be freed despite the del() call, until gc.collect() will not be called by python. And until the latter happens, it’ll still hold the allocated GPU RAM! And that also means that in some situations you may want to call gc.collect() yourself.

If you want to educate yourself on how and when the python garbage collector gets automatically invoked see gc and this.

Peak Memory Usage

If you were to run a GPU memory profiler on a function like Learner fit() you would notice that on the very first epoch it will cause a very large GPU RAM usage spike and then stabilize at a much lower memory usage pattern. This happens because the pytorch memory allocator tries to build the computational graph and gradients for the loaded model in the most efficient way. Luckily, you don’t need to worry about this spike, since the allocator is smart enough to recognize when the memory is tight and it will be able to do the same with much less memory, just not as efficiently. Typically, continuing with the fit() example, the allocator needs to have at least as much memory as the 2nd and subsequent epochs require for the normal run. You can read an excellent thread on this topic here.

pytorch Tensor Memory Tracking

Show all the currently allocated Tensors:

Note, that gc will not contain some tensors that consume memory inside autograd.

Here is a good discussion on this topic with more related code snippets.

GPU Reset

If for some reason after exiting the python process the GPU doesn’t free the memory, you can try to reset it (change 0 to the desired GPU ID):

When using multiprocessing, sometimes some of the client processes get stuck and go zombie and won’t release the GPU memory. They also may become invisible to nvidia-smi , so that it reports no memory used, but the card is unusable and fails with OOM even when trying to create a tiny tensor on that card. In such a case locate the relevant processes with fuser -v /dev/nvidia* and kill them with kill -9 .

This blog post suggests the following trick to arrange for the processes to cleanly exit on demand:

After you add this code to the training iteration, once you want to stop it, just cd into the directory of the training program and run


Order of GPUs

When having multiple GPUs you may discover that pytorch and nvidia-smi don’t order them in the same way, so what nvidia-smi reports as gpu0 , could be assigned to gpu1 by pytorch . pytorch uses CUDA GPU ordering, which is done by computing power (higher computer power GPUs first).

If you want pytorch to use the PCI bus device order, to match nvidia-smi , set:

before starting your program (or put in your

If you just want to run on a specific gpu ID, you can use the CUDA_VISIBLE_DEVICES environment variable. It can be set to a single GPU ID or a list:

Читайте также:  Avx с какого процессора


Advanced GPU configuration (Optional)В¶

GPU Partitioning¶

Compute workloads can benefit from using separate GPU partitions. The flexibility of GPU partitioning allows a single GPU to be shared and used by small, medium, and large-sized workloads. GPU partitions can be a valid option for executing Deep Learning workloads. An example is Deep Learning training and inferencing workflows, which utilize smaller datasets but are highly dependent on the size of the data/model, and users may need to decrease batch sizes.

The following graphic illustrates a GPU partitioning use case where multi-tenant, multiple users are sharing a single A100 (40GB). In this use case, a single A100 can be used for multiple workloads such as Deep Learning training, fine-tuning, inference, Jupiter Notebook, profiling, debugging, etc.

Using two different NVIDIA GPU technologies, GPUs are partitioned using either NVIDIA AI Enterprise software temporal partitioning or Multi-Instance GPU (MIG) spatial partitioning. Please refer to the GPU Partitioning technical brief to understand the differences.

NVIDIA AI Enterprise Software Partitioning¶

Using NVIDIA AI Enterprise software partitioning, profiles assign custom amounts of dedicated GPU memory for each user. NVIDIA AI Enterprise Host Software sets the correct amount of memory to meet the specific needs within the workflow for said user. Every virtual machine has dedicated GPU memory and must be assigned accordingly, ensuring that it has the resources needed to handle the expected compute load.

NVIDIA AI Enterprise Host Software allows up to eight users to share each physical GPU by assigning the graphics resources of the available GPUs to virtual machines using a balanced approach. Depending on the number of GPUs within each line card, there can be multiple user types assigned.

Profiles for NVIDIA AI Enterprise¶

The profiles represent a very flexible deployment option of virtual GPUs, varying in size of GPU memory. The division of GPU memory defines the number of vGPUs that are possible per GPU.

C-series vGPU types are optimized for compute-intensive workloads. As a result, they support only a single display head at a maximum resolution of 4096Г—2160 and do not provide NVIDIA RTX graphics acceleration.

It is essential to consider which vGPU profile will be used within a deployment since this will ultimately determine how many vGPU backed VMs can be deployed. All VMs using the shared GPU resource must be assigned the same fractionalized vGPU profile. This means you cannot mix vGPU profiles on a single GPU using NVIDIA AI Enterprise software.

In the image below, the right side illustrates valid configurations in green, where VMs share a single GPU resource (GPU 1) on a T4 GPU, and all VMs are assigned homogenous profiles, such as 8GB, 4GB, or 16GB C profiles. Since there are two GPUs installed in the server, the other T4 (GPU 0) can be partitioned/fractionalized differently than GPU 1. An invalid configuration is shown in red, where a single GPU is shared using 8C and 4C profiles. Heterogenous profiles are not supported on vGPU, and VMs will not successfully power on.

Scheduling Policies¶

NVIDIA AI Enterprise provides three GPU scheduling options to accommodate a variety of QoS requirements of customers. However, since AI Enterprise workloads are typically long-running operations, it is recommended to implement the Fixed Share or Equal Share scheduler for optimal performance.

Fixed share scheduling always guarantees the same dedicated quality of service. The fixed share scheduling policies guarantee equal GPU performance across all vGPUs sharing the same physical GPU.

Equal share scheduling provides equal GPU resources to each running VM. As vGPUs are added or removed, the share of GPU processing cycles allocated changes accordingly, resulting in performance to increase when utilization is low, and decrease when utilization is high.

Best effort scheduling provides consistent performance at a higher scale and therefore reduces the TCO per user. The best effort scheduler leverages a round-robin scheduling algorithm that shares GPU resources based on actual demand, resulting in optimal utilization of resources. This results in consistent performance with optimized user density. The best effort scheduling policy best utilizes the GPU during idle and not fully utilized times, allowing for optimized density and a good QoS.

Additional information regarding GPU scheduling can be found here.

RmPVMRL Registry Key¶

The RmPVMRL registry key sets the scheduling policy for NVIDIA vGPUs.

You can change the vGPU scheduling policy only on GPUs based on the Pascal, Volta, Turing, and Ampere architectures.



Best effort scheduler

Equal share scheduler with the default time slice length

Equal share scheduler with a user-defined time slice length TT

Fixed share scheduler with the default time slice length

Fixed share scheduler with a user-defined time slice length TT


The default time slice length depends on the maximum number of vGPUs per physical GPU allowed for the vGPU type.

Maximum Number of vGPUs

Default Time Slice Length

Less than or equal to 8


Two hexadecimal digits in the range 01 to 1E set the time slice length in milliseconds (ms) for the equal share and fixed share schedulers. The minimum length is 1 ms, and the maximum length is 30 ms.

If TT is 00, the length is set to the default length for the vGPU type.

If TT is greater than 1E, the length is set to 30 ms.


This example sets the vGPU scheduler to equal share scheduler with the default time slice length.

Читайте также:  Nvidia geforce gtx 260 opengl

This example sets the vGPU scheduler to equal share scheduler with a time slice that is 3 ms long.

This example sets the vGPU scheduler to a fixed share scheduler with the default time slice length.

This example sets the vGPU scheduler to a fixed share scheduler with a time slice 24 (0x18) ms long.

Changing the vGPU Scheduling Policy for All GPUs¶

Perform this task in your hypervisor command shell.

Open a command shell as the root user on your hypervisor host machine. On all supported hypervisors, you can use a secure shell (SSH) for this purpose. Set the RmPVMRL registry key to the value that sets the GPU scheduling policy needed.

In the VMware vSphere SSH CLI, use the esxcli set command.

Where is the value that sets the vGPU scheduling policy you want, for example:

0x01 — Equal Share Scheduler with the default time slice length

0x00030001 — Equal Share Scheduler with a time slice of 3 ms

0x011 — Fixed Share Scheduler with the default time slice length

0x00180011 — Fixed Share Scheduler with a time slice of 24 ms (0x18)

The default time slice length depends on the maximum number of vGPUs per physical GPU allowed for the vGPU type.

Maximum Number of vGPUs

Default Time Slice Length

Less than or equal to 8

Reboot your hypervisor host machine.

Changing the vGPU Scheduling Policy for Select GPUs¶

Perform this task in your hypervisor command shell:

Open a command shell as the root user on your hypervisor host machine. On all supported hypervisors, you can use a secure shell (SSH) for this purpose.

Use the lspci command to obtain the PCI domain and bus/device/function (BDF) of each GPU for which you want to change the scheduling behavior.

Pipe the output of lspci to the grep command to display information only for NVIDIA GPUs.

The NVIDIA GPUs listed in this example have the PCI domain 0000 and BDFs 85:00.0 and 86:00.0.

Use the module parameter NVreg_RegistryDwordsPerDevice to set the pci and RmPVMRL registry keys for each GPU.

Use the esxcli set command:

For each GPU, provide the following information:


The PCI domain of the GPU.


The PCI device BDF of the GPU.


0x00 — Sets the vGPU scheduling policy to Equal Share Scheduler with the default time slice length.

0x00030001 — Sets the vGPU scheduling policy to Equal Share Scheduler with a time slice that is 3ms long.

0x011 — Sets the vGPU scheduling policy to Fixed Share Scheduler with the default time slice length.

0x00180011 — Sets the vGPU scheduling policy to Fixed Share Scheduler with a time slice of 24 ms (0x18) long.

For all supported values, see RmPVMRL Registry Key.

Reboot your hypervisor host machine.

Restoring Default vGPU Scheduler Settings¶

Perform this task in your hypervisor command shell.

Open a command shell as the root user on your hypervisor host machine. On all supported hypervisors, you can use a secure shell (SSH) for this purpose.

Unset the RmPVMRL registry key by setting the module parameter to an empty string.


The module parameter to set, which depends on whether the scheduling behavior was changed for all GPUs or select GPUs:

For all GPUs, set the NVreg_RegistryDwords module parameter.

For select GPUs, set the NVreg_RegistryDwordsPerDevice module parameter.

For example, to restore default vGPU scheduler settings after they were changed for all GPUs, enter this command:

Reboot your hypervisor host machine.

NVIDIA Multi-Instance GPU Configuration for vSphere¶

The NVIDIA A100 Tensor Core GPU is based upon the NVIDIA Ampere architecture and accelerates compute workloads such as AI, data analytics, and HPC in the data center. MIG support on vGPUs began at the NVIDIA AI Enterprise Software 12 release, and gives users the flexibility to use the NVIDIA A100 in MIG mode or non-MIG mode. When the NVIDIA A100 is in non-MIG mode, NVIDIA vGPU software uses temporal partitioning and GPU time slice scheduling. MIG mode spatially partitions GPU hardware so that each MIG can be fully isolated with its streaming multiprocessors (SMs), high bandwidth, and memory. MIG can partition available GPU compute resources as well.

Each instance’s processors have separate and isolated paths through the entire memory system. The on-chip crossbar ports, L2 cache banks, memory controllers, and DRAM address buses are assigned uniquely to an individual instance. This ensures that a particular user’s workload can run with predictable throughput and latency, using the same L2 cache allocation and DRAM bandwidth, even if other tasks thrash their caches or saturate their DRAM interfaces.

A single NVIDIA A100-40GB has eight usable GPU memory slices, each with 5 GB of memory but only seven usable SM slices. There are seven SM slices, not eight, because some SMs cover operational overhead when MIG mode is enabled. MIG mode is configured (or reconfigured) using nvidia-smi and has profiles that you can choose to meet the needs of HPC, deep learning, or accelerated computing workloads.

In summary, MIG spatially partitions the NVIDIA GPU into separate GPU instances but provides benefits of reduced latency over vGPU temporal partitioning for compute workloads. The following tables summarize similarities and differences between A100 MIG capabilities and NVIDIA AI Enterprise software while also highlighting the additional flexibility when combined.

NVIDIA A100 MIG-Enabled (40GB) Virtual GPU Types

NVIDIA A100 with MIG-Disabled (40GB)Virtual GPU Types