llama.cppΒΆ
llama.cpp is an inference engine for large language models. It is written primarely in C++, and it is composed of a server, a cli, and an execution engine ggml. Here are some notes about the project.
app/llama.cpp contains the entry point of the general llama program. The way they split the commands is very clean:
struct command {
const char * name;
const char * desc;
std::vector<std::string> aliases;
bool hidden;
int (*func)(int, char **);
};
static const command cmds[] = {
{"serve", "HTTP API server", {"server"}, false, llama_server },
{"cli", "Command-line interactive interface", {"client"}, false, llama_cli },
{"update", "Update llama to the latest release", {}, false, llama_update },
// ...
};
// ...
static int help(int argc, char ** argv) {
// ...
}
static bool matches(const std::string & arg, const command & cmd) {
if (arg == cmd.name) {
return true;
}
for (const auto & alias : cmd.aliases) {
if (arg == alias) {
return true;
}
}
return false;
}
int main(int argc, char ** argv) {
progname = argv[0];
const std::string arg = argc >= 2 ? argv[1] : "help";
for (const auto & cmd : cmds) {
if (matches(arg, cmd)) {
// keep cmd.name so the router's child processes re-invoke correctly
#ifdef _WIN32
_putenv_s("LLAMA_APP_CMD", cmd.name);
#else
setenv("LLAMA_APP_CMD", cmd.name, 1);
#endif
return cmd.func(argc - 1, argv + 1);
}
}
fprintf(stderr, "error: unknown command '%s'\n", arg.c_str());
return 1;
}
all execution engines are under ggml/. There is a general abstraction of a backend in ggml/include/ggml-backend.h, where the key types are the device ggml_backend_dev_t, the memory ggml_backend_buffer_t and the actual execution unit (like a Vulkan implementation) ggml_backend_t.
there is a general model abstraction with struct llama_model_base
since different models and different accelerators require different logic, the execution is split between a graph building phase and a graph execution phase. The building itself is composed of two passes, a model-dependent pass, and a backend-dependent pass.
multimodal support is implemented in tools/mtmd/. The cool thing is that you can use the same backends for all kinds of models, you just need to write the graph building part.