Note that the modelPath is the only required parameter. For testing you can set this in the environment variable LLAMA_PATH.

interface LlamaCppInputs {
    batchSize?: number;
    cache?: boolean | BaseCache<Generation[]>;
    callbackManager?: CallbackManager;
    callbacks?: Callbacks;
    contextSize?: number;
    disableStreaming?: boolean;
    embedding?: boolean;
    f16Kv?: boolean;
    gbnf?: string;
    gpuLayers?: number;
    jsonSchema?: object;
    logitsAll?: boolean;
    maxConcurrency?: number;
    maxRetries?: number;
    maxTokens?: number;
    metadata?: Record<string, unknown>;
    modelPath: string;
    onFailedAttempt?: FailedAttemptHandler;
    prependBos?: boolean;
    seed?: null | number;
    tags?: string[];
    temperature?: number;
    threads?: number;
    topK?: number;
    topP?: number;
    trimWhitespaceSuffix?: boolean;
    useMlock?: boolean;
    useMmap?: boolean;
    verbose?: boolean;
    vocabOnly?: boolean;
}

Hierarchy

  • LlamaBaseCppInputs
  • BaseChatModelParams
    • LlamaCppInputs

Properties

batchSize?: number

Prompt processing batch size.

cache?: boolean | BaseCache<Generation[]>
callbackManager?: CallbackManager

Use callbacks instead

callbacks?: Callbacks
contextSize?: number

Text context size.

disableStreaming?: boolean

Whether to disable streaming.

If streaming is bypassed, then stream() will defer to invoke().

  • If true, will always bypass streaming case.
  • If false (default), will always use streaming case if available.
embedding?: boolean

Embedding mode only.

f16Kv?: boolean

Use fp16 for KV cache.

gbnf?: string

GBNF string to be used to format output. Also known as grammar.

gpuLayers?: number

Number of layers to store in VRAM.

jsonSchema?: object

JSON schema to be used to format output. Also known as grammar.

logitsAll?: boolean

The llama_eval() call computes all logits, not just the last one.

maxConcurrency?: number

The maximum number of concurrent calls that can be made. Defaults to Infinity, which means no limit.

maxRetries?: number

The maximum number of retries that can be made for a single call, with an exponential backoff between each attempt. Defaults to 6.

maxTokens?: number
metadata?: Record<string, unknown>
modelPath: string

Path to the model on the filesystem.

onFailedAttempt?: FailedAttemptHandler

Custom handler to handle failed attempts. Takes the originally thrown error object as input, and should itself throw an error if the input error is not retryable.

prependBos?: boolean

Add the begining of sentence token.

seed?: null | number

If null, a random seed will be used.

tags?: string[]
temperature?: number

The randomness of the responses, e.g. 0.1 deterministic, 1.5 creative, 0.8 balanced, 0 disables.

threads?: number

Number of threads to use to evaluate tokens.

topK?: number

Consider the n most likely tokens, where n is 1 to vocabulary size, 0 disables (uses full vocabulary). Note: only applies when temperature > 0.

topP?: number

Selects the smallest token set whose probability exceeds P, where P is between 0 - 1, 1 disables. Note: only applies when temperature > 0.

trimWhitespaceSuffix?: boolean

Trim whitespace from the end of the generated text Disabled by default.

useMlock?: boolean

Force system to keep model in RAM.

useMmap?: boolean

Use mmap if possible.

verbose?: boolean
vocabOnly?: boolean

Only load the vocabulary, no weights.