ChatLlamaCpp

This notebook provides a quick overview for getting started with chat model intergrated with llama cpp python

An example below demonstrating how to implement with the open-source Llama3 Instruct 8B

Instantiation

Now we can instantiate our model object and generate chat completions:

import multiprocessing
from libs.community.langchain_community.chat_models.llamacpp import ChatLlamaCpp
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

llm = ChatLlamaCpp(
    temperature = 0.3,
    model_path = "./SanctumAI-meta-llama-3-8b-instruct.Q8_0.gguf",
    n_ctx = 10000,
    n_gpu_layers = 4,
    n_batch = 200,  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
    max_tokens = 512,
    n_threads = multiprocessing.cpu_count()-1,
    callback_manager = CallbackManager([StreamingStdOutCallbackHandler()]),# Callbacks support token-wise streaming
    streaming=True,
    repeat_penalty=1.5,
    top_p = 0.5,
    stop=["<|end_of_text|>", "<|eot_id|>"],
    verbose= True
)

API Reference:CallbackManager | StreamingStdOutCallbackHandler

llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/tni5hc/Documents/langchain_llamacpp/SanctumAI-meta-llama-3-8b-instruct.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 7
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 7.95 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX A2000 12GB, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 4 repeating layers to GPU
llm_load_tensors: offloaded 4/33 layers to GPU
llm_load_tensors:        CPU buffer size =  8137.64 MiB
llm_load_tensors:      CUDA0 buffer size =   884.12 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 10016
llama_new_context_with_model: n_batch    = 200
llama_new_context_with_model: n_ubatch   = 200
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =  1095.50 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   156.50 MiB
llama_new_context_with_model: KV self size  = 1252.00 MiB, K (f16):  626.00 MiB, V (f16):  626.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   633.29 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    10.77 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 312
AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
Model metadata: {'tokenizer.chat_template': "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}", 'tokenizer.ggml.eos_token_id': '128009', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'gpt2', 'general.architecture': 'llama', 'llama.rope.freq_base': '500000.000000', 'tokenizer.ggml.pre': 'llama-bpe', 'llama.context_length': '8192', 'general.name': 'Meta-Llama-3-8B-Instruct', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'tokenizer.ggml.bos_token_id': '128000', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8', 'general.file_type': '7', 'llama.vocab_size': '128256', 'llama.rope.dimension_count': '128'}
Using gguf chat template: {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>

' }}{% endif %}
Using chat eos_token: <|eot_id|>
Using chat bos_token: <|begin_of_text|>

Invocation

messages = [
    (
        "system",
        "You are a helpful assistant that translates English to French. Translate the user sentence.",
    ),
    ("human", "I love programming."),
]

ai_msg = llm.invoke(messages)
ai_msg

Je adore le programmation.

(Note: "programmation" is used instead of just saying you like coding, as it's more formal and accurate in this context.)
``````output

llama_print_timings:        load time =    1178.78 ms
llama_print_timings:      sample time =      19.60 ms /    35 runs   (    0.56 ms per token,  1785.71 tokens per second)
llama_print_timings: prompt eval time =    1178.70 ms /    37 tokens (   31.86 ms per token,    31.39 tokens per second)
llama_print_timings:        eval time =    5458.55 ms /    34 runs   (  160.55 ms per token,     6.23 tokens per second)
llama_print_timings:       total time =    6837.98 ms /    71 tokens

AIMessage(content='Je adore le programmation.\n\n(Note: "programmation" is used instead of just saying you like coding, as it\'s more formal and accurate in this context.)', response_metadata={'finish_reason': 'stop'}, id='run-2922f6b5-eedb-40c0-a57d-6c3cc99f35e8-0')

print(ai_msg.content)

Je adore le programmation.

(Note: "programmation" is used instead of just saying you like coding, as it's more formal and accurate in this context.)

Chaining

We can chain our model with a prompt template like so:

from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a helpful assistant that translates {input_language} to {output_language}.",
        ),
        ("human", "{input}"),
    ]
)

chain = prompt | llm
chain.invoke(
    {
        "input_language": "English",
        "output_language": "German",
        "input": "I love programming.",
    }
)

API Reference:ChatPromptTemplate

Llama.generate: prefix-match hit
``````output
Ich liebe auch Programmieren! (Translation: I also like coding!) What kind of programs do you enjoy working on?
``````output

llama_print_timings:        load time =    1178.78 ms
llama_print_timings:      sample time =      13.53 ms /    25 runs   (    0.54 ms per token,  1847.34 tokens per second)
llama_print_timings: prompt eval time =     977.30 ms /    17 tokens (   57.49 ms per token,    17.39 tokens per second)
llama_print_timings:        eval time =    5317.98 ms /    24 runs   (  221.58 ms per token,     4.51 tokens per second)
llama_print_timings:       total time =    6424.94 ms /    41 tokens

AIMessage(content='Ich liebe auch Programmieren! (Translation: I also like coding!) What kind of programs do you enjoy working on?', response_metadata={'finish_reason': 'stop'}, id='run-c89365e7-93e8-4231-aea3-fab76ffc0cb4-0')

Tool calling

Firstly, it works mostly the same as OpenAI Function Calling

OpenAI has a tool calling (we use "tool calling" and "function calling" interchangeably here) API that lets you describe tools and their arguments, and have the model return a JSON object with a tool to invoke and the inputs to that tool. tool-calling is extremely useful for building tool-using chains and agents, and for getting structured outputs from models more generally.

With ChatLlamaCpp.bind_tools, we can easily pass in Pydantic classes, dict schemas, LangChain tools, or even functions as tools to the model. Under the hood these are converted to an OpenAI tool schemas, which looks like:

{
    "name": "...",
    "description": "...",
    "parameters": {...}  # JSONSchema
}

and passed in every model invocation.

However, it cannot automatically trigger a function/tool, we need to force it by specifying the 'tool choice' parameter. This parameter is typically formatted as described below.

{"type": "function", "function": {"name": <<tool_name>>}}.

from langchain_core.pydantic_v1 import BaseModel, Field
from langchain.tools import tool

class WeatherInput(BaseModel):
    location: str = Field(description="The city and state, e.g. San Francisco, CA")
    unit: str = Field(enum=["celsius", "fahrenheit"])

@tool("get_current_weather", args_schema=WeatherInput)
def get_weather(location: str, unit:str):
    """Get the current weather in a given location"""
    return f"Now the weather in {location} is 22 {unit}"

llm_with_tools = llm.bind_tools(
                            tools=[get_weather],
                            tool_choice={"type": "function", "function": {"name": "get_current_weather"}}
                            )

API Reference:tool

ai_msg = llm_with_tools.invoke(
    "what is the weather like in HCMC in celsius",
)
ai_msg

Llama.generate: prefix-match hit

llama_print_timings:        load time =    1178.78 ms
llama_print_timings:      sample time =     845.44 ms /    20 runs   (   42.27 ms per token,    23.66 tokens per second)
llama_print_timings: prompt eval time =     624.10 ms /    11 tokens (   56.74 ms per token,    17.63 tokens per second)
llama_print_timings:        eval time =    2785.06 ms /    19 runs   (  146.58 ms per token,     6.82 tokens per second)
llama_print_timings:       total time =    4400.99 ms /    30 tokens

AIMessage(content='', additional_kwargs={'function_call': {'name': 'get_current_weather', 'arguments': '{ "location": "Ho Chi Minh City", "unit" : "celsius"}'}, 'tool_calls': [{'index': 0, 'id': 'call__0_get_current_weather_cmpl-a6d145c7-358b-4e68-8c9e-49f3ee6802f6', 'type': 'function', 'function': {'name': 'get_current_weather', 'arguments': '{ "location": "Ho Chi Minh City", "unit" : "celsius"}'}}]}, response_metadata={'finish_reason': 'tool_calls'}, id='run-e6ec1cf7-299b-430c-bfa7-ac67322da3ec-0', tool_calls=[{'name': 'get_current_weather', 'args': {'location': 'Ho Chi Minh City', 'unit': 'celsius'}, 'id': 'call__0_get_current_weather_cmpl-a6d145c7-358b-4e68-8c9e-49f3ee6802f6'}])

ai_msg.tool_calls

[{'name': 'get_current_weather',
  'args': {'location': 'Ho Chi Minh City', 'unit': 'celsius'},
  'id': 'call__0_get_current_weather_cmpl-a6d145c7-358b-4e68-8c9e-49f3ee6802f6'}]

ChatLlamaCpp

Instantiation

Invocation

Chaining

Tool calling

Was this page helpful?

You can leave detailed feedback on GitHub.

ChatLlamaCpp

Instantiation​

Invocation​

Chaining​

Tool calling​

Was this page helpful?

You can leave detailed feedback on GitHub.

Instantiation

Invocation

Chaining

Tool calling