NCP-AII Dumps Questions - Effective Way to Get Certified

NCP-AII Dumps Questions – Effective Way to Get Certified

Category:

Comments:

0 Comments

Post Date:

September 17, 2025

If you're in the field of NVIDIA, you know how important it is to stay up-to-date with the latest knowledge and skills to protect your organization's networks and data. One way to do that is by obtaining NVIDIA-Certified Professional, specifically the NCP-AII exam. While preparing for the NCP-AII exam, you might consider using NCP-AII dumps to help you familiarize yourself with the exam format and content. These NCP-AII exam dumps questions can be an effective way to gauge your knowledge and identify areas where you may need additional study. Study online free NCP-AII exam dumps below.

Page 1 of 10

1. A server with four installed NVIDIA GPUs is experiencing intermittent crashes during heavy AI training workloads. You suspect a power issue. You have monitored the power consumption and found that the GPUs are briefly exceeding the rated power capacity of the PSU during peak loads.

What are TWO effective mitigation strategies you can implement? (Select TWO)

Underclock the GPIJs to reduce their power consumption.

Replace the PSU with a higher wattage PS

Disable one of the GPUs to reduce the total power draw.

Increase the server room temperature.

Re-seat the GPUs in their respective slots.

2. Which of the following are crucial considerations when validating the hardware operation of an NVIDIA-Certified Professional AI Infrastructure server before deploying a production A1 workload? (Select all that apply)

Ensuring the BIOS and BMC firmware are up to date.

Verifying that the server’s power supply units (PSUs) are operating within their specified voltage and current ranges under full load.

Confirming the correct installation and functionality of all network interfaces (e.g., Ethernet, InfiniBand) relevant to the workload.

Only testing the GPUs with a synthetic benchmark like ‘linpack’.

Checking if the server’s case fans are spinning.

3. You encounter a situation where a container running with GPU support is experiencing significant performance degradation compared to running the same application directly on the host. You have already verified that the NVIDIA drivers are correctly installed and the NVIDIA Container Toolkit is properly configured.

Which of the following could be contributing factors to this performance difference? (Select all that apply)

The container is using a significantly older version of the CUDA runtime compared to the host.

CPU pinning or NIJMA affinity is not properly configured for the container, leading to inefficient memory access.

The ‘―ipc=host’ flag is not used when running the container, causing inter-process communication overhead.

The kernel version within the container is significantly different from the host kernel, leading to driver compatibility issues.

Insufficient bandwidth between CPU and GPU

4. You are deploying an NVIDIA GPU-accelerated application in a virtualized environment using vGPU.

How does vGPU technology impact power and cooling considerations compared to a bare-metal deployment, and what specific monitoring metrics become crucial?

vGPU deployments typically require less power and cooling than bare-metal, as resources are shared. The host server’s overall power consumption becomes the primary monitoring metric.

vGPU deployments have no significant impact on power and cooling requirements compared to bare-metal. Standard GPU temperature and power draw metrics are sufficient.

vGPU deployments can lead to higher overall power consumption and concentrated heat generation on the host server due to resource consolidation. Monitoring metrics like GPU utilization per VM, vGPU frame rate, and host server thermal headroom become crucial.

vGPU deployments eliminate the need for GPU monitoring, as the virtualization layer handles all power and cooling management.

vGPU deployments require specialized cooling solutions that are not needed for bare metal setups.

5. A DGX A100 server with dual power supplies reports a critical power event in the BMC logs. One PSU shows a ‘degraded’ status, while the other appears normal.

What immediate actions should you take to ensure continued operation and prevent data loss?

Immediately shut down the server gracefully to prevent further damage to the faulty PSI

Hot-swap the degraded PSU with a replacement unit.

Monitor the remaining PSU’s load and temperature closely; if stable, continue operation until a scheduled maintenance window.

Reduce the GPU power limit using ‘nvidia-smi’ to decrease the overall power consumption of the server.

Migrate all workloads to other servers in the cluster to minimize the impact of a potential complete PSU failure.

6. You are observing high latency in your GPU-accelerated inference service deployed on Kubernetes. You suspect that GPU resource contention might be the cause.

What steps can you take to diagnose and mitigate this issue within the Kubernetes environment? (Multiple Answers)

Monitor GPU utilization metrics (e.g., GPU utilization, memory usage) using tools like ‘nvidia-smi’ or Prometheus and Grafana.

Implement resource quotas to limit the GPU resources that each namespace can consume.

Utilize MIG (Multi-lnstance GPU) to partition GPUs and isolate workloads.

Increase the number of replicas in the deployment to distribute the load across more GPUs.

Restarting the cluster periodically.

7. You are deploying an NVIDIA-Certified A1 server. The documentation specifies a minimum airflow requirement for the GPUs.

How would you BEST monitor the GPU temperatures and ensure the airflow is adequate during a stress test?

Use ‘nvidia-smi’ to monitor GPU temperature and visually inspect the fans.

Use IPMI sensors to monitor GPU temperature and fan speeds.

Use a thermal camera to measure the GPU heatsink temperature.

Use a software utility like ‘psensor’ to monitor GPU temperature.

Measure the ambient temperature around the server.

8. You’re troubleshooting a DGX-I server exhibiting performance degradation during a large-scale distributed training job. ‘nvidia-smü shows all GPUs are detected, but one GPU consistently reports significantly lower utilization than the others. Attempts to reschedule orkloads to that GPU frequently result in CUDA errors.

Which of the following is the MOST likely cause and the BEST initial roubleshooting step?

A driver issue affecting only one GPU; reinstall NVIDIA drivers completely.

A software bug in the training script utilizing that specific GPU’s resources inefficiently; debug the training script.

A hardware fault with the GPU, potentially thermal throttling or memory issues; run ‘nvidia-smi -i -q’ to check temperatures, power limits, and error counts.

Insufficient cooling in the server rack; verify adequate airflow and cooling capacity for the rack.

Power supply unit (PSU) overload, causing reduced power delivery to that GPU; monitor PSU load and check PSU specifications.

9. You are designing a storage solution for a cluster used for both training and inference. Training requires high throughput, while inference requires low latency.

How should you architect the storage to meet both requirements efficiently?

Use a single storage tier optimized for training (high throughput)

Use a single storage tier optimized for inference (low latency)

Use a tiered storage system with a fast tier (e.g., NVMe SSDs) for inference and a slower, cheaper tier (e.g., HDDs) for training data storage

Use a tiered storage system with a fast tier (e.g., NVMe SSDs) for inference and a separate high-throughput parallel file system for training

Use cloud storage for both training and inference

10. An AI server equipped with multiple NVIDIA GPUs experiences frequent reboots during peak workload periods. The system event logs indicate ‘Uncorrectable Machine Check Exception’ errors. You suspect a power delivery issue.

Besides checking the PSUs, what other hardware component(s) should be thoroughly inspected to identify potential causes?

The CPU and system memory.

The motherboard VRMs (Voltage Regulator Modules) responsible for supplying power to the GPUs.

The network interface cards (NICs).

The storage drives (SSDs/HDDs).

The server’s CMOS battery.

Page 2 of 10

11. After physically installing a new NVIDIA GPU in a server, you boot the system. You notice that the GPU is not recognized by the operating system. You’ve verified the card is properly seated and powered.

What are the MOST LIKELY causes and solutions? (Select TWO)

The incorrect GPU drivers are installed or no drivers are installed at all. Solution: Download and install the latest drivers from the NVIDIA website.

The motherboard BIOS/UEFI does not support the GP

Solution: Update the motherboard BIOS/UEFI to the latest version.

The PCIe slot is faulty. Solution: Try installing the GPU in a different PCIe slot.

The GPU is not compatible with the operating system. Solution: Reinstall the operating system.

The GPU is defective. Solution: Return the GPU to the manufacturer.

12. You’re monitoring the storage I/O for an AI training workload and observe high disk utilization but relatively low CPU utilization.

Which of the following actions is LEAST likely to improve the performance of the training job?

Switching from HDDs to NVMe SSDs.

Implementing data prefetching to load data into memory before it’s needed.

Increasing the batch size of the training job.

Adding more RAM to the system.

Reducing the number of parallel data loading threads.

13. A data center is designed for A1 training with a high degree of east-west traffic. Considering cost and performance, which network topology is generally the most suitable?

Spine-Leaf

Three-Tier

Ring

Bus

Mesh

14. Consider a scenario where you want to reset your NVIDIAA100 GPU back to a non-MIG mode state after having previously configured MIG.

Which of the following steps are required?

Run ‘nvidia-smi ―set-mig-mode=disable -i O’ and then reboot the system.

Run ‘nvidia-smi ―destroy-mig-config -i 0’, then run ‘nvidia-smi ―set-mig-mode=disable -i 0’, and finally reboot.

Run ‘nvidia-smi ―set-mig-mode=disable -i O’ followed by ‘nvidia-smi ―reset-default-mig-mode -i

Run ‘nvidia-smi ―set-mig-mode=disable -i, then power off the system and physically remove and re-install the GP

Run ‘nvidia-smi ―set-mig-mode=disable -i O’, then run ‘nvidia-smi -i 0 -migrr 0’, and finally reboot.

15. You are building a Docker image for a deep learning application that requires an NVIDIA GPU.

Which of the following instructions is the most efficient way to ensure the NVIDIA drivers are available within the container, assuming you are using the nvidia/cuda/’ base image and want to minimize the image size?

COPY /usr/lib/nvidia /usr/local/nvidia/

RUN apt-get update && apt-get install -y nvidia-driver-470

FROM nvidia/cuda:ll .4.2-base-ubuntu20.04 AS builder RUN apt-get update && apt-get install -y ―no-install-recommends software-properties-common RUN add-apt-repository ppa:graphics-drivers/ppa RUN apt-get update && apt-get install -y ―no-install-recommends nvidia-driver-470 FROM ubuntu:20.04 COPY ―from=builder /usr/lib/nvidia /usr/lib/ COPY ―from=builder /usr/local/nvidia /usr/local/

FROM nvidia/cuda:ll .4.2-base-ubuntu20.04 RUN apt-get update && apt-get install -y ―no-install-recommends nvidia-driver-470

Using ‘nvidia/cuda’ base image, the drivers are already included, so no further action is needed.

16. You suspect a faulty NVIDIA ConnectX-6 network adapter in a server used for RDMA-based distributed training.

Which commands or tools can you use to diagnose potential issues with the adapter’s hardware and connectivity?

Ispci -v to verify the adapter is detected and its resources are allocated correctly.

ibstat to check the adapter’s status, link speed, and active ports.

ethtool to examine the adapter’s Ethernet settings and statistics.

ping to test basic network connectivity.

nvsmimonitord to monitor GPU metrics and detect anomalies.

17. A large language model (LLM) training job is running across multiple NVIDIAAI 00 GPUs in a cluster. You observe that the GPUs within a single server are communicating efficiently via NVLink, but the inter-server communication over Ethernet is becoming a bottleneck.

Which of the following strategies, focusing on cable and transceiver selection, would MOST effectively address this inter-server communication bottleneck? (Choose TWO)

Replace existing Cat6 Ethernet cables with Cat8 cables.

Upgrade inter-server connections to the highest available Ethernet speed (e.g., from 100GbE to 400GbE) using appropriate transceivers and fiber optic cables.

Implement InfiniBand as the interconnect technology for inter-server communication, utilizing appropriate InfiniBand cables and transceivers.

Reduce the batch size of the LLM training job.

Replace all existing transceivers with Active Optical Cables (AOCs).

18. You encounter an error during MIG instance creation using ‘nvidia-smi’ stating ‘Insufficient GPU resources’.

Which of the following could be the cause? (Select all that apply)

The requested MIG configuration exceeds the GPU’s available resources (e.g., compute or memory).

The NVIDIA driver version is outdated and does not support the requested MIG configuration.

The GPU is already fully utilized by other MIG instances or processes.

The GPIJ is in a bad state and needs to be reset.

There is no error; MIG always creates instances regardless of resources.

19. You are deploying a multi-tenant AI infrastructure where different users or groups have isolated network environments using VXLAN.

Which of the following is the MOST important consideration when configuring the VTEPs (VXLAN Tunnel Endpoints) on the hosts to ensure proper network isolation and performance?

Using the default MTU size of 1500 bytes for VXLAN traffic.

Ensuring that each tenant has a unique VXLAN Network Identifier (VNI) to isolate their traffic.

Using the same IP address for all VTEPs to simplify routing.

Disabling multicast routing to prevent broadcast traffic.

Using the same VNI for all tenants to maximize network utilization.

20. You are tasked with selecting transceivers for a new NVIDIA Quantum-2 InfiniBand switch deployment. The primary requirement is to minimize power consumption while maintaining 400Gbps bandwidth over short distances (up to 50 meters).

Which transceiver type would offer the BEST power efficiency in this scenario?

QSFP-DD LR8

QSFP-DD DR4

QSFP-DD SR8

QSFP-DD AOC

QSFP-DD SR4

Page 3 of 10

21. After installing NGC CLI using pip, you encounter ‘ngc’ command not found error even though pip install reported successful.

What can be the cause?

The python executable where NGC CLI got installed is not in the system PAT

The NGC CLI installation was corrupted. Run ‘pip install ―force-reinstall nvidia-cli’

The shell needs to be reloaded or a new terminal session initiated for PATH changes to take effect.

NGC CLI only works inside Docker containers.

The host’s operating system is not supported by NGC CL

22. You are implementing a security policy on a BlueField-2 DPU to filter traffic based on specific application signatures.

Which technology, supported by BlueField, allows you to achieve deep packet inspection (DPI) and apply security rules based on the detected application?

TC (Traffic Control) with ‘iptables’ rules.

eBPF (extended Berkeley Packet Filter) with XDP (eXpress Data Path).

OVS (Open vSwitch) with OpenFlow rules.

IPsec (Internet Protocol Security) tunnels.

Netfilter with connection tracking.

23. An AI infrastructure uses a combination of air-cooled and liquid-cooled NVIDIA GPUs. You want to optimize cooling performance based on the specific thermal characteristics of each GPU type and their location within the server rack.

How can you achieve granular cooling control and monitoring to address these heterogeneous cooling requirements effectively? SELECT TWO.

Implement rack-level airflow management solutions, such as blanking panels and cable management, to improve overall airflow uniformity.

Use a centralized monitoring system to track GPU temperatures and power consumption, but apply the same cooling profile to all GPUs regardless of type.

Deploy per-server cooling solutions with independent fan control for each server node, allowing for tailored airflow adjustments.

Employ liquid cooling only for the highest TDP GPUs and rely on ambient air cooling for all other components.

Implement dynamic fan speed control based on individual GPU temperatures, leveraging tools like ‘nvidia-smi’ and custom scripts, for air-cooled GPUs.

24. You are designing a storage system using BeeGFS for an AI cluster. The cluster consists of 10 client nodes, each with 2 NVIDIAAIOO GPUs, and 4 storage servers. Each storage server has 10 NVMe SSDs. The training dataset is 100TB. You want to ensure high availability and performance.

Which of the following BeeGFS configurations would be MOST appropriate?

A single metadata server (MDS) and four storage targets (OSTs), each spanning all 10 NVMe SSDs on a storage server.

Four MDS servers (one per storage server) and 40 OSTs (one per NVMe SSD).

One MDS server and 10 OSTs, splitting each NVMe SSD into smaller virtual OSTs.

Two MDS servers in a high-availability configuration and 40 OSTs (one per NVMe SSD).

10 MDS server (one per client nodes) and a single OS

25. You’ve installed a server with multiple NVIDIAAIOO GPUs intended for use with Kubernetes and NVIDIA’s GPU Operaton After installing the GPU Operator, you notice that the GPUs are not being properly detected and managed by Kubernetes.

Which of the following are potential causes and troubleshooting steps you should take?

The NVIDIA drivers are not properly installed on the host operating system before installing the GPU Operator. Verify the driver installation using ‘nvidia-smr.

The Kubernetes nodes are not labeled correctly to indicate the presence of NVIDIA GPUs. Use ‘kubectl label node nvidia.com/gpu.present=true’.

The NVIDIA Container Toolkit is not installed on the Kubernetes nodes. Install the toolkit according to NVIDIA’s documentation.

The GPU Operator’s configuration is incorrect, preventing it from properly discovering and managing the GPUs. Check the GPU Operator’s logs and configuration files.

The ‘nvidia-docker2 runtime is not set as the default runtime in ‘/etc/docker/daemon.json’. Change the default runtime to ‘nvidia’ and restart the Docker daemon.

26. You’re configuring a BlueField-3 DPU-based server for high-performance storage. You want to utilize NVMe-oF (NVMe over Fabrics) to access remote NVMe SSDs.

What is the primary benefit of using a BlueField DPU in this NVMe-oF setup compared to a traditional server with a standard NIC?

BlueField DPU automatically configures the NVMe-oF target without any manual intervention.

BlueField DPU offloads the NVMe-oF protocol processing, reducing CPU overhead on the host server.

BlueField DPU eliminates the need for a separate NVMe-oF target server.

BlueField DPU provides built-in hardware encryption for all NVMe-oF traffic.

BlueField DPU allows hot-swapping of NVMe SSDs without interrupting the NVMe-oF connection.

27. You are tasked with troubleshooting a performance bottleneck in a multi-node, multi-GPU deep learning training job utilizing Horovod.

The training loss is decreasing, but the overall training time is significantly longer than expected.

Which of the following monitoring approaches would provide the most insight into the cause of the bottleneck?

Using ‘nvidia-smi’ on each node to monitor GPU utilization and memory usage.

Enabling Horovod’s timeline and profiling features to visualize the communication patterns and identify synchronization bottlenecks.

Monitoring network bandwidth utilization on each node using ‘iftop’ or ‘iperf3’

Analyzing the training loss curve to identify potential issues with the model architecture or hyperparameters.

Using Shtop’ to monitor CPIJ utilization on each node.

28. Which of the following statements are true regarding the use of Congestion Management (CM) and Congestion Avoidance (CA) techniques within an InfiniBand fabric using NVIDIA technology? (Select TWO)

CM/CA mechanisms are primarily implemented at the IP layer and are independent of the InfiniBand transport layer.

CM aims to reduce the severity of congestion once it has already occurred, while CA aims to prevent congestion from happening in the first place.

InfiniBand’s Explicit Congestion Notification (ECN) is a CA mechanism that allows switches to signal congestion to endpoints before packet loss occurs.

CM/CA are not relevant in InfiniBand fabrics because InfiniBand’s lossless nature guarantees that no packets will ever be dropped due to congestion.

CM can include techniques like rate limiting to throttle traffic flows when congestion is detected.

29. After replacing a faulty NVIDIA GPU, the system boots, and ‘nvidia-smi’ detects the new card. However, when you run a CUDA program, it fails with the error "‘no CUDA-capable device is detected’". You’ve confirmed the correct drivers are installed and the GPU is properly seated.

What’s the most probable cause of this issue?

The new GPU is incompatible with the existing system BIO

The CUDA toolkit is not properly configured to use the new GP

The ‘LD LIBRARY PATH* environment variable is not set correctly.

The user running the CUDA program does not have the necessary permissions to access the GP

The GPIJ is not properly initialized by the system due to a missing or incorrect ACPI configuration.

30. After installing the NGC CLI, you attempt to run ‘ngc config set’ and encounter the following error: ‘Error: API key is invalid or missing’.

What are the most likely causes of this issue and how can you resolve them?

The NGC CLI is not properly installed. Reinstall the package using ‘pip install ―upgrade nvidia-cli’

The NGC API key is incorrect or has expired. Verify the API key in your NVIDIA account and update the configuration using ‘ngc config set’.

The NGC CLI configuration file is corrupted. Delete the file (

ngc/config.json’) and reconfigure the CL

The NGC service is down. Check the NVIDIA NGC status page for any known outages.

The host does not have network access to NG

Page 4 of 10

31. Consider the following ‘ibroute’ command used on an InfiniBand host: ‘ibroute add dest Oxla dev ib0’.

What is the MOST likely purpose of this command?

To add a default route for all traffic destined outside the InfiniBand subnet.

To create a static route for traffic destined to LID Ox1a, using the InfiniBand interface ib0.

To configure the MTU size on the ib0 interface to Ox1a bytes.

To disable routing on the ib0 interface.

To configure a static route for traffic destined to IP address Ox1a, using the InfiniBand interface ib0.

32. Which configuration file should be modified to blacklist the ‘nouveau’ driver on a system running systemd to prevent conflicts with the NVIDIA driver?

/etc/modprobe.d/blacklist-nouveau.conf

/etc/modules

/boot/grub/grub.cfg

/etc/systemd/system.conf

/etc/default/grub

33. Which of the following statements regarding VXLAN (Virtual Extensible LAN) is MOST accurate in the context of data center networking for AI/ML workloads?

VXLAN provides Layer 2 connectivity across Layer 3 networks, enabling virtual machine mobility.

VXLAN primarily improves network security by encrypting all traffic.

VXLAN is only suitable for small-scale networks due to its limited scalability.

VXLAN reduces network overhead compared to traditional VLANs.

VXLAN requires specialized hardware and cannot be implemented in software.

34. You are using GPU Direct RDMA to enable fast data transfer between GPUs across multiple servers. You are experiencing performance degradation and suspect RDMA is not working correctly.

How can you verify that GPU Direct RDMA is properly enabled and functioning?

Check the output of ‘nvidia-smi topo -m’ to ensure that the GPUs are connected via NVLink and have RDMA enabled.

Examine the ‘cimesg’ output for any errors related to RDMA or InfiniBand drivers.

Use the ‘ibstat command to verify that the InfiniBand interfaces are active and connected.

Run a bandwidth benchmark using a tool like or to measure the RDMA throughput.

Ping the other servers to ensure network connectivity.

35. You are evaluating different parallel file systems for an AI training cluster. You need a file system that supports POSIX compliance and offers high bandwidth and low latency.

Which of the following options are viable candidates?

BeeGFS

GiusterFS

Ceph

Lustre

NFS

36. You’re deploying BlueField OS to an Arm-based SmartNIC. After flashing the image, the system fails to boot and you observe a kernel panic related to device tree loading.

Which of the following is the most likely cause?

Incorrect bootloader configuration (e.g., incorrect bootargs). The bootloader might not be pointing to the correct device tree blob (dtb) or root filesystem.

Insufficient memory allocated to the initrd image. This can lead to failures during initial system setup.

The BlueField OS image is corrupted. A fresh download and re-flash should resolve the problem.

The secure boot configuration is incorrectly set up. Disabling secure boot in the BIOS or bootloader might resolve the issue.

The flashed image is not intended for your specific BlueField card revision. Ensure that image corresponds to hardware version.

37. You’ve replaced a faulty NVIDIA Quadro RTX 8000 GPU with an identical model in a workstation. The system boots, and ‘nvidia-smi’ recognizes the new GPU. However, when rendering complex 3D scenes in Maya, you observe significantly lower performance compared to before the replacement. Profiling with the NVIDIA Nsight Graphics debugger shows that the GPU is only utilizing a small fraction of its available memory bandwidth.

What are the TWO most likely contributing factors?

The new GPU’s PCle link speed is operating at a lower generation (e.g., Gen3 instead of Gen4).

The NVIDIA OptiX denoiser is not properly configured or enabled.

The workstation’s power plan is set to ‘Power Saver,’ limiting GPU performance.

The Maya scene file contains corrupted or inefficient geometry.

The newly installed GPU’s VBIOS has not been properly flashed, causing an incompatibility issue.

38. During the physical installation of an NVIDIA GPU, you accidentally touch the gold connector pins on the card.

What is the recommended course of action BEFORE inserting the GPU into the PCle slot?

Blow on the pins to remove any dust or debris.

Wipe the pins with a dry cloth.

Clean the pins with isopropyl alcohol and a lint-free swab, ensuring they are completely dry before installation.

Use compressed air to clean the pins.

It is okay to insert it directly as is.

39. You have a Kubernetes cluster with nodes running different versions of the NVIDIA driver. You need to ensure that your containerized AI applications are always compatible with the specific driver version running on the node where they are scheduled.

How can you achieve this driver version compatibility in a cloud-native way?

Manually create different container images for each driver version and use node selectors to schedule the correct image on the appropriate nodes.

Use the NVIDIA driver capabilities to detect the driver version at runtime and dynamically load the correct libraries.

Use the NVIDIA Operator to automatically manage driver installations and updates on the nodes, ensuring a consistent driver version across the cluster.

Implement a webhook that inspects the node labels and injects the appropriate NVIDIA libraries into the pod at runtime.

Use a shared volume to mount drivers into a container.

40. You’re optimizing an Intel Xeon server with 4 NVIDIAAIOO GPUs for a computer vision application that uses CODA. You notice that the GPU utilization is fluctuating significantly, and performance is inconsistent. Using ‘nvprof, you identify that there are frequent stalls in the CUDA kernels due to thread divergence.

What are possible causes and solutions?

The input data is not properly aligned in memory. Ensure that data is aligned to 128-byte boundaries using aligned memory allocation techniques.

The CUDA code contains conditional branches that lead to different execution paths for different threads within the same warp. Rewrite the CUDA code to minimize branching and favor uniform execution paths within warps.

The GPUs are overheating, causing thermal throttling. Improve the server’s cooling.

The CUDA compiler is generating suboptimal code. Try using different compiler optimization flags (e.g., ‘-O3’ or ‘-ftz=true’).

The CUDA driver version is incompatible with the CUDA toolkit version. Update the CUDA driver to a compatible version.

Page 5 of 10

41. You are managing a Kubernetes cluster with NVIDIA GPUs and want to automatically scale your A1 inference deployments based on GPU utilization.

Which of the following tools and configurations would you use to implement Horizontal Pod Autoscaling (HPA) based on GPU metrics?

Using the standard Kubernetes HPA with CPU utilization as the scaling metric, assuming GPU utilization is correlated with CPU usage.

Using the NVIDIA Data Center GPU Manager (DCGM) exporter to expose GPU metrics to Prometheus, and configuring the HPA to scale based on these metrics.

Using the Kubernetes Resource Metrics API to directly access GPU utilization metrics and configuring the HPA accordingly.

Manually adjusting the number of replicas in the deployment based on observed GPIJ utilization.

Configuring the HPA based on memory utilization.

42. You notice that one of the fans in your GPU server is running at a significantly higher RPM than the others, even under minimal load. ipmitool sensor’ output shows a normal temperature for that GPU.

What could be the potential causes?

The fan’s PWM control signal is malfunctioning, causing it to run at full speed.

The fan bearing is wearing out, causing increased friction and requiring higher RPM to maintain airflow.

The fan is attempting to compensate for restricted airflow due to dust buildup.

The server’s BMC (Baseboard Management Controller) has a faulty temperature sensor reading, causing it to overcompensate.

A network connectivity issue is causing higher CPU utilization, leading to increased system-wide heat.

43. You are configuring a server with NVIDIA GPUs for optimal power efficiency. You want to leverage NVIDIA’s power management features to minimize energy consumption during idle periods.

Which of the following actions would be the MOST effective in achieving this goal, without significantly impacting performance during active workloads?

Reduce the GPU’s clock speeds to the lowest possible setting, regardless of workload.

Enable NVIDIA’s Adaptive Clocking and Power Limiting features, allowing the GPU to dynamically adjust its clock speeds and power consumption based on the workload.

Disable all GPU power management features to ensure maximum performance at all times.

Remove one or more GPUs from the server to reduce overall power consumption.

Set a very low static power limit for the GPUs, significantly restricting their performance even during active workloads.

44. You are configuring a BlueField-2 DPU for link aggregation (bonding) with two 25GbE ports. After configuring the bond interface, you notice that traffic is not being distributed across both links.

What are the two most likely causes of this issue? (Select TWO)

The bonding mode is set to ‘balance-alb’ and the ARP monitoring interval is too high.

The switch connected to the DPU does not support the bonding mode configured on the DP

The firewall on the DPU is blocking outgoing traffic on one of the interfaces.

The MTU size is different on the bond interface and the physical interfaces.

The physical interfaces are not configured with the same speed and duplex settings.

45. You’re working with a large dataset of microscopy images stored as individual TIFF files. The images are accessed randomly during a training job. The current storage solution is a single HDD. You’re tasked with improving data loading performance.

Which of the following storage optimizations would provide the GREATEST performance improvement in this specific scenario?

Implementing data deduplication on the storage volume.

Migrating the data to a large, sequential HD

Replacing the HDD with a RAID 5 array of HDDs.

Replacing the HDD with a single NVMe SS

Compressing the TIFF files using a lossless compression algorithm.

46. You’ve successfully deployed BlueField OS to your SmartNlC. You need to verify that the Mellanox Ethernet driver (mlx5) is loaded and functioning correctly.

What command would you use to confirm this?

A )

B )

C )

D )

E )

Option A

Option B

Option C

Option D

Option E

47. You’re setting up a cluster with 8 NVIDIA A100 GPUs. Each GPU needs to read 4GB/s from storage to keep it fully utilized. The network connecting the storage and compute nodes has a bandwidth of 25GB/s.

What is the maximum number of GPUs that can be simultaneously saturated with data without exceeding the network bandwidth?

48. After upgrading the NGC CLI using ‘pip install ―upgrade nvidia-cli’, some commands are no longer working as expected, producing errors related to missing modules.

What is the most likely reason for this issue and how can you resolve it?

The upgrade process might have corrupted the NGC CLI installation. Reinstall the package using ‘pip install ―force-reinstall nvidia-cli’.

The NGC CLI upgrade introduced breaking changes. Review the NGC CLI release notes and update your scripts accordingly.

The Python environment used by the NGC CLI might be broken or inconsistent. Create a new virtual environment and reinstall the NGC CLI in the new environment.

The system’s PATH variable has not been updated to reflect the new NGC CLI installation location. Update the PATH variable accordingly.

The host’s operating system must be re-imaged.

49. You are tasked with implementing a monitoring solution for power consumption and thermal performance in an NVIDIA-powered Ai cluster. You want to collect data from the Baseboard Management Controllers (BMCs) of the servers using Redfish.

Which of the following Python code snippets demonstrates the correct approach for authenticating with the BMC and retrieving power and temperature readings?

A )

B )

C )

D ) None of the above. Redfish requires specialized hardware and cannot be accessed directly via Python.

E )

Option A

Option B

Option C

Option D

Option E

50. Which command is used to enable MIG mode on an NVIDIAAmpere architecture GPU?

nvidia-smi -i 0 -mig 1

nvidia-smi ―gpu 0 -i MIG=I

nvidia-smi -i 0 ―enable-mig

nvidia-smi ―set-mig-mode=enable -i 0

nvidia-smi ―format=csv,noheader=l -i 0 ―enable-mig

Page 6 of 10

51. You have an NVIDIAAIOO GPU and need to configure it for optimal performance across two distinct AI workloads: a large language model (LLM) training job and a computer vision inference service. The LLM benefits from maximum memory bandwidth, while the inference service requires low latency and high throughput.

Which MIG configuration would best suit this scenario?

Create two 7g.80gb MIG instances, one for each workload.

Create one 14g.160gb MIG instance for the LLM and use CUDA MPS to multiplex the inference service.

Create a single full-GPU instance and use Kubernetes resource quotas to isolate the workloads.

Create one log. 120gb instance for the LLM and one 4g.40gb instance for inference.

Utilize Time-Slicing on a single full-GPU instance, allocating specific time slots to each workload using NVIDIA Vgpu technology

52. Consider a distributed training job running across multiple nodes, each with local NVMe storage. You want to minimize network traffic and maximize I/O performance.

Which data loading strategy would be MOST effective?

Centralized data loading from a single NFS server

Distributing the dataset across the local NVMe drives of each node and using a distributed data loader

Loading the entire dataset into the memory of a single node and then distributing it to the other nodes

Using object storage (e.g., S3) as the primary data source and loading data on demand

Using rsync to copy data between nodes before each epoch

53. You are using NVIDIA Spectrum-X switches in your A1 infrastructure. You observe high latency between two GPU servers during a large distributed training job. After analyzing the switch telemetry, you suspect a suboptimal routing path is contributing to the problem.

Which of the following methods offers the MOST granular control for influencing traffic flow within the Spectrum-X fabric to mitigate this?

Adjust the Equal-Cost Multi-Path (ECMP) hashing algorithm globally on all switches.

Configure QOS (Quality of Service) policies to prioritize traffic from the high-latency GPU servers.

Implement Adaptive Routing (AR) or Dynamic Load Balancing (DLB) features available on the Spectrum-X switches to dynamically adjust paths based on network conditions.

Manually configure static routes on the Spectrum-X switches to force traffic between the GPU servers along a specific path.

Disable IPv6 to simplify routing decisions.

54. You need to uninstall all NVIDIA drivers and associated packages from a Linux system cleanly.

Which command sequence is the most reliable for achieving this after stopping the display manager (e.g., ‘sudo systemctl stop gdm3’)?

‘sudo apt purge nvidia- (on Debian/Ubuntu-based systems)

Running the .run’ installer with the ‘―uninstall’ option (if the driver was installed this way)

‘sudo yum remove nvidia- (on RHEL/CentOS-based systems)

Deleting the Vusr/lib/nvidia and ‘lusr/share/nvidia directories.

‘sudo dnf remove nvidia- (on Fedora-based systems)

55. Your deep learning training job that utilizes NCCL (NVIDIA Collective Communications Library) for multi-GPU communication is failing with "NCCL internal error, unhandled system error" after a recent CUDA update. The error occurs during the ‘all reduce’ operation.

What is the most likely root cause and how would you address it?

Incompatible NCCL version with the new CUDA version. Update NCCL to a version compatible with the installed CUDA version.

Insufficient shared memory allocated to the CUDA context. Increase the shared memory limit using ‘cudaDeviceSetLimit(cudaLimitSharedMemory, new_limity.

Firewall rules blocking inter-GPU communication. Configure the firewall to allow communication on the NCCL-defined ports (typically 8000-8010).

Faulty network cables used for inter-node communication (if the training job spans multiple servers). Replace the network cables with certified high-speed cables.

GPU Direct RDMA is not properly configured. Check ‘dmesg’ for errors and ensure RDMA is enabled.

56. You are tasked with installing a BlueField-2 DPU on a server. After physical installation, the DPU is not recognized by the host OS (Linux). You’ve verified the power and connection.

What is the most likely first step you should take to troubleshoot the issue?

Immediately reflash the BlueField-2 DPU firmware with the latest version.

Check the system BIOS settings to ensure that IOMMU (Input/Output Memory Management Unit) is enabled and properly configured.

Replace the BlueField-2 DPU, assuming it’s faulty hardware.

Install the latest NVIDIA drivers on the host OS, specifically the BlueField-related drivers.

Check the UEFI settings to ensure that the PCle slot where the DPU is installed is enabled and configured correctly.

57. You are configuring an NVIDIA BlueField-3 DPLJ to offload network processing. The DPIJ is connected to a server via PCle Gen5.

Which cable type is essential for connecting the DPIJ to a 200GbE switch, ensuring optimal performance and signal integrity, and which describes its use?

Passive Copper Cable; Cost-effective for very short reach connections within the same rack.

Active Optical Cable (AOC); Best for long-distance connections exceeding 10 meters in data centers.

Active Copper Cable (ACC); Suitable for mid-range connections with improved signal quality compared to passive copper.

Direct Attach Copper (DAC) cable; Ideal for short-range, high-bandwidth connections between devices in close proximity.

Single-mode fiber optic cable; Used for long-distance connections where electrical interference is a concern.

58. Which of the following are key benefits of using NVIDIA NVLink� Switch in a multi-GPU server setup for AI and deep learning workloads?

Increased GPU-to-GPIJ communication bandwidth.

Reduced latency in inter-GPU data transfers.

Simplified GPU resource management.

Support for larger GPU memory pools than a single server can physically accommodate.

Enhanced security features compared to PCle based interconnections.

59. You are configuring a RoCEv2 (RDMA over Converged Ethernet) network using BlueField-2 DPUs. You are observing packet loss and performance degradation. You suspect that Congestion Control is not working correctly.

What configuration parameter most directly impacts RoCEv2 congestion control behavior?

MTU size on the RoCEv2 interfaces.

PFC (Priority Flow Control) configuration on the switch ports.

ECN (Explicit Congestion Notification) configuration on the switch ports and DPU interfaces.

The number of RDMA queues configured on the DP

The IOMMIJ configuration for the DP

60. You are tasked with optimizing an Intel Xeon scalable processor-based server running a TensorFlow model with multiple NVIDIA GPUs.

You observe that the CPU utilization is low, but the GPU utilization is also not optimal. The profiler shows significant time spent in ‘tf.data’ operations.

Which of the following actions would MOST likely improve performance?

Increase the number of threads used for CPU-bound operations in TensorFlow using ‘tf.config.threading.set_intra_op_parallelism_threads()’.

Enable XLA (Accelerated Linear Algebra) compilation in TensorFlow.

Use ‘tf.data.AUTOTIJNE to allow TensorFlow to dynamically optimize the data pipeline.

Reduce the global batch size to improve memory utilization.

Upgrade the server’s network adapter to a faster interface, such as 100Gb

Page 7 of 10

61. You are tasked with optimizing storage performance for a deep learning training job on an NVIDIA DGX server. The training data consists of millions of small image files.

Which of the following storage optimization techniques would be MOST effective in reducing I/O bottlenecks?

Implementing RAID 0 across all storage devices.

Using a distributed file system with data striping across multiple storage nodes.

Enabling data compression on the storage volume.

Increasing the block size of the file system to the maximum supported value.

Implementing a tiered storage system with NVMe drives for frequently accessed data and HDDs for less frequently accessed data.

62. You are observing that the memory bandwidth being achieved by your CUDA application on an NVIDIAAIOO GPU is significantly lower than the theoretical peak bandwidth.

Which of the following could be potential causes for this, and what actions can you take to validate or mitigate them? (Select all that apply)

The application is using uncoalesced memory access patterns. Refactor the code to ensure contiguous memory access by threads within a warp.

The application is using a small transfer size per kernel launch. Increase the amount of data processed per kernel launch to amortize the overhead of kernel launch and data transfer.

The GPU is being limited by power capping. Increase the power limit using ‘nvidia-smi -pl’ (if permitted) to allow the GPU to operate at higher clock speeds.

The application is using single precision floating-point operations. Switch to double precision to increase memory bandwidth utilization.

The system memory is fully occupied. Deallocate some memory.

63. You encounter an error during the BlueField OS flashing process using ‘bfboot: ‘ERROR: Could not detect a BlueField device’.

Which of the following steps is MOST likely to resolve the issue?

Ensure the BlueField device is powered on and properly connected to the host system via PCIe.

Update the ‘bfboot’ utility to the latest version. Older versions may have compatibility issues.

Install the Mellanox OFED drivers on the host system. These drivers are required for ‘bfboot’ to function correctly.

Verify that the correct PCIe slot is being used. Some systems may have specific slots designated for SmartNICs.

Check the systems BIOS/UEFI to confirm that SR-IOV is enabled.

64. Which of the following are valid methods for verifying the health and connectivity of InfiniBand links in an NCP-AII environment? (Select TWO)

Using ‘ping’ to test basic IP connectivity over the InfiniBand interface.

Using ‘ibstat’ to check the link state, physical state, and other relevant parameters of InfiniBand ports.

Using ‘netstat’ to check TCP connections.

Using ‘sminfo’ to query the Subnet Manager for network topology and status information.

Checking the system logs ( ‘ /var/log/messages’ or equivalent) for any InfiniBand-related error messages.

65. You are configuring a server with multiple GPUs for CUDA-aware MPI.

Which environment variable is critical for ensuring proper GPU affinity, so that each MPI process uses the correct GPU?

CUDA VISIBLE DEVICES

CUDA DEVICE ORDER

LD LIBRARY PATH

MPI GPU SUPPORT

CUDA LAUNCH BLOCKING-I

66. You have a large dataset stored on a BeeGFS file system. The training job is single node and uses data augmentation to generate more data on the fly. The data augmentation process is CPU-bound, but you notice that the GPU is underutilized due to the training data not being fed to the GPU fast enough.

How can you reduce the load on the CPU and improve the overall training throughput?

Move the training data to a local NVMe drive on the training node.

Increase the number of BeeGFS metadata servers (MDSs) to improve metadata performance.

Implement asynchronous 1/0 in the data loading pipeline using a library like NVIDIA DALI to offload data processing tasks from the CPU to the GP

Decrease the batch size of the training job to reduce the amount of data being processed at each iteration.

Enable data compression on the BeeGFS file system to reduce the amount of data being transferred over the network.

67. You are troubleshooting a network performance issue in your NVIDIA Spectrum-X based A1 cluster. You suspect that the Equal-Cost Multi-Path (ECMP) hashing algorithm is not distributing traffic evenly across available paths, leading to congestion on some links.

Which of the following methods would be MOST effective for verifying and addressing this issue?

Use ‘ping’ or ‘traceroute’ to analyze the paths taken by packets between the affected nodes. If they always take the same path, ECMP is likely not working correctly.

Use switch telemetry tools (e.g., NVIDIA What’s Up Gold, Mellanox NEO, or similar) to monitor link utilization across all available paths between the nodes. Look for significant imbalances in traffic volume.

Restart the switches to force the ECMP hashing algorithm to recalculate paths.

Disable ECMP entirely and rely solely on static routing.

Reduce the TCP window size.

68. Consider the following Python code snippet which attempts to extract Digital Optical Monitoring (DOM) data from a transceiver using a hypothetical library ‘transceiver_utils’. The transceiver is connected to port ‘eth0’. However, the code consistently throws a ‘TransceiverError: Invalid port’ exception.

What is the MOST likely cause of this error?

The ‘transceiver_utils’ library is outdated and does not support DOM data extraction.

The transceiver does not support DOM functionality.

The port ‘eth0’ does not exist or is not correctly associated with the transceiver.

The Python code requires root privileges to access transceiver data.

The fiber cable connected to the transceiver is damaged.

69. Consider the following Dockerfile snippet:

This Dockerfile is used to build a deep learning application. After building and running a container from this image, you observe that the application is not detecting the GPU. You have verified that the NVIDIA Container Toolkit is installed and configured correctly on the host.

What is the most likely reason for this issue?

The base image ‘nvidia/cuda:ll .6.2-base-ubuntu20.04’ does not include the necessary NVIDIA Container Toolkit components.

The application code in ‘app.py’ is not explicitly requesting GPU resources.

The ‘docker run’ command is missing the ‘―gpus all’ flag.

The ‘requirements.txt’ file is missing the ‘nvidia-pyindex’ package.

The CUDA version on the host is different than the one specified in the Dockerfile.

70. You’ve installed a new NVIDIA GPU in your A1 server. After the installation and driver setup, you notice that while ‘nvidia-smi’ recognizes the GPU, the available memory reported is significantly lower than the GPU’s specifications.

What are the potential root causes and how would you systematically troubleshoot this?

The GPU is faulty and needs to be replaced.

The system BIOS is incorrectly configured, limiting GPU memory allocation.

The integrated graphics is using a significant amount of system memory, reducing what’s available to the GP

Disable the integrated graphics in the BIO

The driver is not correctly installed. Reinstall the latest NVIDIA driver.

The reported memory is the currently allocated memory, not the total available. Run a CUDA program to allocate more memory and observe the change.

Page 8 of 10

71. Which of the following statements regarding the benefits of using a BlueField DPU for network offload are TRUE? (Select TWO)

Reduced CPU utilization on the host server for network-related tasks.

Simplified network configuration compared to traditional NICs.

Increased network throughput due to hardware acceleration.

Automatic compatibility with all existing network protocols without requiring software updates.

Elimination of the need for a dedicated network interface card (NIC).

72. You are installing four NVIDIAAIOO GPUs in a server, and after installation, you observe that the PCle link speed for one of the GPUs is running at x8 instead of the expected x16.

What could be the POSSIBLE causes for this reduced PCle link speed?

The GPU is faulty.

The CPU does not have enough PCle lanes to support all GPUs at x16.

The PCle slot is only wired for x8 speed.

The BIOS/UEFI is configured to limit the PCle link speed for that slot.

All of the above

73. You encounter a situation where an NVIDIA driver installation fails with the error message ‘ERROR: Unable to load the kernel module ‘nvidia.ko’. This may be because it was built for another kernel...’.

Assuming the kernel headers are correctly installed, what is the most likely cause and solution?

The NVIDIA driver version is incompatible with the current kernel version. Solution: Install a compatible driver version.

Secure Boot is enabled, and the NVIDIA kernel module is not signed. Solution: Sign the NVIDIA kernel module with a machine owner key (MOK).

The ‘nouveau’ driver is still loaded. Solution: Blacklist ‘nouveau- and reboot.

The DKMS module build failed. Solution: Manually rebuild the DKMS module using ‘dkms autoinstall’.

The system is running out of disk space in ‘/tmp’. Solution: Free up space or remount S/tmp’ with more space.

74. You are configuring a BlueField DPU to run a custom packet processing application. You want to ensure that the application has exclusive access to certain CPU cores on the DPU.

Which mechanism is best suited for isolating CPU cores for your application on the Bluefield DPU?

Using ‘taskset’ command to pin the application’s processes to specific cores.

Modifying the DPU’s bootloader configuration to disable the cores you want to reserve.

Using CPIJ affinity settings within the application code itself.

Utilizing cgroups (control groups) to create a dedicated cgroup for the application and limit its CPU usage to specific cores.

Adjusting the kernel’s scheduler parameters to prioritize the application’s threads on the desired cores.

75. You are tasked with deploying a cluster of NVIDIAAIOO GPUs in a high-density server environment. The server chassis has a limited power budget and cooling capacity.

Which of the following strategies is MOST effective in validating that the power and cooling infrastructure can adequately support the GPU workload during peak performance, minimizing the risk of thermal throttling and system instability?

Rely solely on the GPU manufacturer’s stated Thermal Design Power (TDP) specifications and allocate power based on these values.

Monitor GPU temperature using ‘nvidia-smi’ during a sustained compute-intensive workload and compare it to the GPU’s thermal threshold. If the temperature remains below the threshold, the cooling is adequate.

Employ a power monitoring tool (e.g., IPMI, Redfish) to measure the actual power consumption of the server during a stress test that mimics the intended Ai workload. Cross-reference this with the power supply unit’s (PSU) rating and the cooling system’s capacity.

Simulate the Ai workload with a synthetic benchmark (e.g., Linpack) and extrapolate power consumption based on the benchmark’s performance metrics.

Observe the GPU clock speeds during a workload. If the clock speeds are at the maximum rated speed, the power and cooling are sufficient.

76. You are using a custom container runtime other than Docker (e.g., containerd) and need to integrate it with the NVIDIA Container Toolkit.

What command would you use to configure the NVIDIA Container Toolkit for this runtime? (Assume your runtime configuration file is located at ‘/etc/containerd/config.toml’)

nvidia-ctk runtime configure

nvidia-ctk runtime configure ―runtime=custom ―config=/etc/containerd/config.tomr

nvidia-ctk runtime install ―runtime-containerd’

nvidia-ctk runtime config ―runtime=containerd ―set-default’

‘nvidia-docker runtime configure ―runtime=containerd’

77. Consider a scenario where you are setting up a high-performance computing cluster with several GPU-accelerated nodes using Slurm as the resource manager. You want to ensure that jobs requesting GPUs are only scheduled on nodes with the appropriate NVIDIA drivers and CUDA toolkit installed.

How can you achieve this within Slurm?

Use Slurm’s ‘GresTypeS configuration option in ‘slurm.conf to define a generic resource type called ‘gpu’ and then configure each node to advertise the available GPIJs. Slurm will automatically ensure that jobs requesting GPUs are only scheduled on nodes with the ‘gpu’ resource.

Create a custom Slurm script that checks for the presence of the NVIDIA driver and CUDA toolkit before submitting a job to a node. If the requirements are not met, the job is rejected.

Use Slurm’s node features to tag nodes with the "Feature=‘ keyword in ‘slurm.conf. For example, tag nodes with GPUs as ‘Feature=gpu’. Jobs can then request nodes with the ‘gpu’ feature using the option.

Install the NVIDIA Data Center GPU Manager (DCGM) on each node and configure Slurm to query DCGM for GPU availability and health. Slurm will then only schedule jobs on healthy and available GPUs.

Utilize Slurm’s Prolog and Epilog scripts to dynamically install the necessary NVIDIA drivers and CUDA toolkit on each node before and after a job runs. This ensures that the required software is always available.

78. You are tasked with installing the latest NVIDIA driver on a server running Ubuntu 22.04 for A1 workloads. You have downloaded the driver package ‘NVIDIA-Linux-x86 64-535.104.05.run’.

Before installation, what is the most critical step to ensure a smooth process, assuming secure boot is enabled?

Simply run the ‘ .run’ file using ‘sudo ./NVlDlA-Linux-x86_64-535.104.05.run’ .

Disable Secure Boot in the BIOS before installing the driver.

Create a DKMS module and sign the driver with a machine owner key (MOK) for Secure Boot compatibility.

Install the driver using ‘apt install nvidia-driver-535’ and let the system handle Secure Boot automatically.

Ensure the ‘nouveau- driver is blacklisted by adding ‘blacklist nouveau- to ‘letc/modprobe.d/blacklist-nouveau.conf.

79. You’re optimizing a deep learning model for deployment on NVIDIA Tensor Cores. The model uses a mix of FP32 and FP16 precision. During profiling with NVIDIA Nsight Systems, you observe that the Tensor Cores are underutilized.

Which of the following strategies would MOST effectively improve Tensor Core utilization?

Increase the batch size to fully utilize the available GPU memory.

Ensure that all matrix multiplications are performed using FP16 precision.

Pad the input tensors to dimensions that are multiples of 8 for optimal Tensor Core alignment.

Enable CUDA graph capture to reduce kernel launch overhead.

Decrease the learning rate to improve training stability and reduce the need for gradient clipping.

80. What is the role of GPUDirect RDMA in an NVLink Switch-based system, and how does it improve performance?

It allows GPUs to directly access each other’s memory without involving the CPIJ, reducing latency and CPU overhead.

It provides a mechanism for GPUs to offload compute-intensive tasks to the CPU, improving overall system throughput.

It enables direct communication between GPUs and storage devices, bypassing the network interface.

It facilitates the virtualization of GPUs, allowing multiple virtual machines to share a single physical GPI

It encrypts data transmitted between GPUs, enhancing security.

Page 9 of 10

81. You’ve installed the NGC CLI and successfully configured your API key However, when running ‘ngc registry model download’, you receive a ‘Permission denied’ error despite having valid credentials.

What are possible causes and solutions?

The user account running the ‘ngc’ command does not have write permissions to the destination directory specified for the download.

Your API key lacks the necessary permissions to download models from the NGC registry. Contact your NVIDIA organization administrator.

The model you are trying to download requires acceptance of a separate end-user license agreement (EULA) that you haven’t yet accepted.

There is an issue with network connectivity or firewall rules preventing access to the NGC registry. Verify network connectivity and firewall rules.

Models can only be downloaded to specific location.

82. You’re debugging performance issues in a distributed training job. ‘nvidia-smi’ shows consistently high GPU utilization across all nodes, but the training speed isn’t increasing linearly with the number of GPUs. Network bandwidth is sufficient.

What is the most likely bottleneck?

Inefficient data loading and preprocessing pipeline, causing GPUs to wait for data.

NCCL is not configured optimally for the network topology, leading to high communication overhead.

The learning rate is not adjusted appropriately for the increased batch size across multiple GPUs.

The global batch size has exceeded the optimal point for the model, reducing per-sample accuracy and slowing convergence.

CUDA Graphs is not being utilized.

83. You’ve installed the NGC CLI, but when you run ‘ngc registry model list’ you get an error indicating authentication failure. You’re sure your API key is correct.

What could be the cause, and how would you diagnose this?

The NGC CLI version is outdated. Upgrade to the latest version using ‘pip install ―upgrade nvidia-cli’.

The environment variables ‘NGC API_KEY or ‘NGC CLI_API_KEY are set incorrectly or not set at all. Verify and set them correctly.

Your organization might be behind a proxy that is blocking the NGC CLI from accessing the internet. Configure the proxy settings for the NGC CL

Your account lacks the necessary permissions to access the NGC registry. Contact your NVIDIA administrator.

The host machine’s clock is not synchronized, causing authentication issues. Synchronize the clock using ‘ntpd’ or ‘chronyd’.

84. You are troubleshooting slow I/O performance in a deep learning training environment utilizing BeeGFS parallel file system. You suspect the metadata operations are bottlenecking the training process.

How can you optimize metadata handling in BeeGFS to potentially improve performance?

Increase the number of storage targets (OSTs) to distribute the data across more devices.

Implement data striping across multiple OSTs.

Increase the number of metadata servers (MDSs) and distribute the metadata load across them.

Enable client-side caching of metadata on the training nodes.

Configure BeeGFS to use a different network protocol with lower overhead.

85. You are replacing a faulty NVIDIA Tesla V 100 GPU in a server. After physically installing the new GPU, the system fails to recognize it. You’ve verified the power connections and seating of the card.

Which of the following steps should you take next to troubleshoot the issue?

Immediately RMA the new GPU as it is likely defective.

Update the system BIOS and BMC firmware to the latest versions.

Reinstall the operating system to ensure proper driver installation.

Check if the new GPU requires a different driver version than the currently installed one and update if needed.

Disable and re-enable the GPU slot in the system BIO

86. You’re deploying a new cluster with multiple NVIDIAAIOO GPUs per node. You want to ensure optimal inter-GPU communication performance using NVLink.

Which of the following configurations are critical for achieving maximum NVLink bandwidth?

All GPUs within a node must be the same model and have identical firmware versions.

The motherboard must support PCle Gen5 to maximize NVLink bandwidth.

GPUs should be physically installed in slots that maximize direct NVLink connections based on the server’s architecture.

The NVIDIA driver must be configured to enable NVLink; it is disabled by default.

The server must use a specific CPU model to leverage NVLink capabilities.

87. You’re profiling the performance of a PyTorch model running on an AMD server with multiple NVIDIA GPUs. You notice significant overhead in the data loading pipeline.

Which of the following strategies can help optimize data loading and improve GPU utilization? Select all that apply.

Using the ‘torch.utils.data.DataLoader’ with multiple worker processes.

Loading the entire dataset into RAM before training.

Implementing asynchronous data prefetching using ‘torch .Generator’.

IJsing a faster storage system (e.g., NVMe SSD instead of HDD).

Reducing the batch size to decrease the amount of data loaded per iteration.

88. A server with eight NVIDIAAIOO GPUs experiences frequent CUDA errors during large model training. ‘nvidia-smi’ reports seemingly normal temperatures for all GPUs. However, upon closer inspection using IPMI, the inlet temperature for GPUs 3 and 4 is significantly higher than others.

What is the MOST likely cause and the immediate action to take?

A driver issue is causing incorrect temperature reporting; reinstall the NVIDIA driver.

The temperature sensors on GPUs 3 and 4 are faulty; replace the GPUs immediately.

There is a localized airflow problem affecting GPUs 3 and 4; check fan speeds and airflow obstructions.

The power supply is failing to provide sufficient power to GPUs 3 and 4; replace the power supply.

A software bug in the CUDA toolkit is causing the errors; downgrade to an earlier version.

89. You are installing four NVIDIAAIOO GPUs into a server designed for AI training. The server motherboard has multiple PCIe Gen4 x16 slots. However, the server’s power supply unit (PSU) only has three 8-pin PCIe power connectors available.

What is the BEST course of action to ensure all GPUs receive adequate power?

Use a PCIe power splitter cable on one of the 8-pin connectors to power two GPUs.

Install only three GPUs and leave the fourth unpowered.

Replace the existing PSU with a higher wattage PSU that has at least four 8-pin PCIe power connectors.

Connect the GPUs using the motherboard’s internal SATA power connectors.

Underclock the GPUs significantly to reduce their power consumption below the available PSU capacity.

90. An Ai infrastructure relies on a liquid cooling system to dissipate heat from multiple NVIDIA GPUs. After a recent software update, users report intermittent performance degradation and system crashes. You suspect a cooling issue.

Which TWO of the following checks are the MOST critical in diagnosing the root cause?

Verify the pump speed and coolant flow rate within the liquid cooling system.

Check the CPU temperature using ‘sensors’ command.

Analyze the system logs for GPU-related errors, specifically those indicating thermal throttling or power capping.

Examine the ambient temperature in the data center.

Run a memory test on the host system.

Page 10 of 10

91. Consider a scenario where you are using NCCL (NVIDIA Collective Communications Library) for multi-GPU training across multiple servers connected via NVLink switches.

Which NCCL environment variable would you use to specify the network interface to be used for communication?

NCCL PORT

NCCL SOCKET IFNAME

NCCL NET INTERFACE

NCCL 1B HCA

NCCL COMM ID

92. You are tasked with automating the BlueField OS deployment process across a large number of SmartNICs.

Which of the following methods is MOST suitable for this task?

Manually flashing each SmartNIC using the ‘bfboot utility on a workstation.

Using a network boot (PXE) server to deploy the BlueField OS image over the network. This allows centralized management and scalability.

Creating a custom ISO image with the BlueField OS and booting each SmartNIC from a USB drive.

Utilizing the ‘dd’ command to directly copy the image to each SmartNIC’s flash memory.

Utilizing a custom-built python script to flash each individual card, controlled from a central server. This method supports parallel flashing.

TAGS:

NCP-AII, NCP-AII exam dumps

Notify of

Label

Name*

Email*

Website

Label

Name*

Email*

Website

0 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Get NCP-AII Dumps Full Version

Q&As: 370
Versions: PDF and Software Download Now

About Dumpsinfo

Dumpsinfo is a good platform providing the latest exam information and dumps questions for all IT certification exams. You can study all the latest exam dumps questions online.

[email protected]

Mon - Sat 9:00am - 6:00pm

NCP-AII Dumps Questions – Effective Way to Get Certified

Related

Posts

NCA-GENL Dumps Guarantee You Pass NCA-GENL Exam Easily

NVIDIA NCA-GENL Exam Questions Simulate Actual NCA-GENL Exam

Prepare NCP-AIO Exam with Using NCP-AIO Dump Questions

2025 NVIDIA NCP-AIN Exam Dumps Questions

About Us

EMAIL

Services

Opening Hours