Workshop

Review the LLM Application

15 minutes

In the final step of the workshop, we’ll deploy an application to our OpenShift cluster that uses the instruct and embeddings models.

What is LangChain?

Like most applications that interact with LLMs, our application is written in Python. It also uses LangChain , which is an open-source orchestration framework that simplifies the development of applications powered by LLMs.

Application Overview

Connect to the LLMs

Our application starts by connecting to two LLMs that we’ll be using:

python
# connect to a LLM NIM at the specified endpoint, specifying a specific model
llm = ChatNVIDIA(base_url=INSTRUCT_MODEL_URL, model="meta/llama-3.2-1b-instruct")

# Initialize and connect to a NeMo Retriever Text Embedding NIM (nvidia/llama-3.2-nv-embedqa-1b-v2)
embeddings_model = NVIDIAEmbeddings(model="nvidia/llama-3.2-nv-embedqa-1b-v2",
                                   base_url=EMBEDDINGS_MODEL_URL)

Why are there two models? Here’s a helpful analogy:

  • The Embedding model is the “Librarian” (it helps find the right books),
  • The Instruct model is the “Writer” (it reads the books and writes the answer).

Define the Prompt Template

The application then defines a prompt template that will be used in interactions with the meta/llama-3.2-1b-instruct LLM:

python
prompt = ChatPromptTemplate.from_messages([
    ("system",
        "You are a helpful and friendly AI!"
        "Your responses should be concise and no longer than two sentences."
        "Do not hallucinate. Say you don't know if you don't have this information."
        "Answer the question using only the context"
        "\n\nQuestion: {question}\n\nContext: {context}"
    ),
    ("user", "{question}")
])

Note how we’re explicitly instructing the LLM to just say it doesn’t know the answer if it doesn’t know, which helps minimize hallucinations. There’s also a placeholder for us to provide context that the LLM can use to answer the question.

Connect to the Vector Database

The application then connects to the vector database that was pre-populated with NVIDIA data sheet documents:

python
    weaviate_client = weaviate.connect_to_custom(
        http_host=os.getenv('WEAVIATE_HTTP_HOST'),
        http_port=os.getenv('WEAVIATE_HTTP_PORT'),
        http_secure=False,
        grpc_host=os.getenv('WEAVIATE_GRPC_HOST'),
        grpc_port=os.getenv('WEAVIATE_GRPC_PORT'),
        grpc_secure=False
    )
        
    vector_store = WeaviateVectorStore(
        client=weaviate_client,
        embedding=embeddings_model,
        index_name="CustomDocs",
        text_key="page_content"
    )

Define the Chain

The application uses LCEL (LangChain Expression Language) to define the chain. The | (pipe) symbol works like an assembly line; the output of one step becomes the input for the next.

python
    chain = (
        {
            "context": vector_store.as_retriever(),
            "question": RunnablePassthrough()
        }
        | prompt
        | llm
        | StrOutputParser()
    )

Let’s break this down step-by-step:

Invoke the Chain

Finally, the application invokes the chain by passing the end user’s question in as input:

python
    response = chain.invoke(question)

This is the “Start” button. You drop the end users’ question into the beginning of the pipeline, and it flows through the retriever, the prompt, and the LLM until the answer comes out the other side.

Last Modified ·