llama.cpp ========= llama.cpp is an inference engine for large language models. It is written primarely in C++, and it is composed of a server, a cli, and an execution engine `ggml`. Here are some notes about the project. * `app/llama.cpp` contains the entry point of the general `llama` program. The way they split the commands is very clean: .. code-block:: cpp struct command { const char * name; const char * desc; std::vector aliases; bool hidden; int (*func)(int, char **); }; static const command cmds[] = { {"serve", "HTTP API server", {"server"}, false, llama_server }, {"cli", "Command-line interactive interface", {"client"}, false, llama_cli }, {"update", "Update llama to the latest release", {}, false, llama_update }, // ... }; // ... static int help(int argc, char ** argv) { // ... } static bool matches(const std::string & arg, const command & cmd) { if (arg == cmd.name) { return true; } for (const auto & alias : cmd.aliases) { if (arg == alias) { return true; } } return false; } int main(int argc, char ** argv) { progname = argv[0]; const std::string arg = argc >= 2 ? argv[1] : "help"; for (const auto & cmd : cmds) { if (matches(arg, cmd)) { // keep cmd.name so the router's child processes re-invoke correctly #ifdef _WIN32 _putenv_s("LLAMA_APP_CMD", cmd.name); #else setenv("LLAMA_APP_CMD", cmd.name, 1); #endif return cmd.func(argc - 1, argv + 1); } } fprintf(stderr, "error: unknown command '%s'\n", arg.c_str()); return 1; } * all execution engines are under `ggml/`. There is a general abstraction of a backend in `ggml/include/ggml-backend.h`, where the key types are the device `ggml_backend_dev_t`, the memory `ggml_backend_buffer_t` and the actual execution unit (like a Vulkan implementation) `ggml_backend_t`. * there is a general model abstraction with `struct llama_model_base` * since different models and different accelerators require different logic, the execution is split between a graph building phase and a graph execution phase. The building itself is composed of two passes, a model-dependent pass, and a backend-dependent pass. * multimodal support is implemented in `tools/mtmd/`. The cool thing is that you can use the same backends for all kinds of models, you just need to write the graph building part.