Workshop Setup

Deploy an LLM

20 minutes

In this section, we’ll use the NVIDIA NIM Operator to deploy two Large Language Models to our OpenShift Cluster.

Create a Namespace

bash
oc create namespace nim-service

Add Secrets with NGC API Key

Add a Docker registry secret for downloading container images from NVIDIA NGC:

bash
oc create secret -n nim-service docker-registry ngc-secret \
    --docker-server=nvcr.io \
    --docker-username='$oauthtoken' \
    --docker-password=$NGC_API_KEY

Add a generic secret that model puller containers use to download the model from NVIDIA NGC:

bash
oc create secret -n nim-service generic ngc-api-secret \
    --from-literal=NGC_API_KEY=$NGC_API_KEY

Deploy an LLM

Run the following command to create the NIMCache and NIMService:

bash
oc apply -n nim-service -f nvidia-llm.yaml

Confirm that the Persistent Volume was created and the Persistent Volume Claim was bound to is successfully:

Note: this can take several minutes to occur

bash
oc get pv,pvc -n nim-service

Confirm that the NIMCache is Ready:

bash
oc get nimcache.apps.nvidia.com -n nim-service

Confirm that the NIMService is Ready:

bash
oc get nimservices.apps.nvidia.com -n nim-service

Test the LLM

Let’s ensure the LLM is working as expected.

Start a pod that has access to the curl command:

bash
oc run --rm -it -n default curl --image=curlimages/curl:latest -- sh

Then run the following command to send a prompt to the LLM:

bash
curl -X "POST" \
 'http://meta-llama-3-2-1b-instruct.nim-service:8000/v1/chat/completions' \
  -H 'Accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
        "model": "meta/llama-3.2-1b-instruct",
        "messages": [
        {
          "content":"What is the capital of Canada?",
          "role": "user"
        }],
        "top_p": 1,
        "n": 1,
        "max_tokens": 1024,
        "stream": false,
        "frequency_penalty": 0.0,
        "stop": ["STOP"]
      }'

Deploy an Embeddings Model

We’re also going to deploy an embeddings model in our cluster, which will be used later in the workshop to implement Retrieval Augmented Generation (RAG).

Run the following command to deploy the embeddings model:

bash
oc apply -n nim-service -f nvidia-embeddings.yaml

Confirm that the NIMService is Ready:

bash
oc get nimservices.apps.nvidia.com llama-32-nv-embedqa-1b-v2 -n nim-service

Test the Embeddings Model

Let’s ensure the embeddings is working as expected.

Start a pod that has access to the curl command:

bash
oc run --rm -it -n default curl --image=curlimages/curl:latest -- sh

Then run the following command to send a prompt to the LLM:

bash
  curl -X POST http://llama-32-nv-embedqa-1b-v2.nim-service:8000/v1/embeddings \
  -H 'Accept: application/json' \
  -H "Content-Type: application/json" \
  -d '{
    "input": ["What is the capital of France?"],
    "model": "nvidia/llama-3.2-nv-embedqa-1b-v2",
    "input_type": "query",
    "encoding_format": "float",
    "truncate": "NONE"
  }'
Last Modified ·