1. Download#
First, download the model from Hugging Face. You can download the original model or other variants. Here, I will download a llama3-70B variant, which is approximately 130GB in size.
huggingface-cli download cognitivecomputations/dolphin-2.9.1-llama-3-70b --cache-dir ./model
If you are using a network in China, you can use the Hugging Face proxy https://hf-mirror.com/ for downloading.
2. Install llama.cpp#
Download and install from GitHub: https://github.com/ggerganov/llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
Installation is complete.
3. Convert the model to ggml format#
Stay in the llama.cpp directory.
python convert.py huggingface model directory \
--outfile moxing.gguf \
--outtype f16 --vocab-type bpe
# Example
python convert.py ./model/models--cognitivecomputations--dolphin-2.9.1-llama-3-70b/snapshots/3f2d2fae186870be37ac83af1030d00a17766929 \
--outfile ./GGUF/dolphin-2.9.1-llama-3-70b-f16.gguf \
--outtype f16 --vocab-type bpe
This process may take some time. Once completed, you will get the dolphin-2.9.1-llama-3-70b-f16.gguf
file, which is still around 130GB in size. You can run it now, but it requires more than 140GB of GPU memory, which is generally not feasible. Therefore, we will quantize the file to reduce its size while sacrificing a bit of quality.
4. Quantize the GGUF model#
First, let's list the quality options:
- q2_k: Specific tensors are set to higher precision, while others remain at the base level.
- q3_k_l, q3_k_m, q3_k_s: These variants use different levels of precision on different tensors to achieve a balance between performance and efficiency.
- q4_0: This is the original quantization scheme, using 4-bit precision.
- q4_1 and q4_k_m, q4_k_s: These provide different levels of accuracy and inference speed, suitable for scenarios that require a balance of resource usage.
- q5_0, q5_1, q5_k_m, q5_k_s: These versions ensure higher accuracy but use more resources and have slower inference speed.
- q6_k and q8_0: These provide the highest precision but may not be suitable for all users due to high resource consumption and slow speed.
We will use the Q4_K_M scheme.
Still in the llama.cpp directory, after compiling with make
, there will be an executable file called quantize
. If it doesn't exist, compile it with make
and give it execution permission.
./quantize ./GGUF/dolphin-2.9.1-llama-3-70b-f16.gguf ./GGUF/dolphin-2.9.1-llama-3-70b-Q4_K_M.gguf Q4_K_M
After quantization, the file size will be around 40GB. Now you can run it with a GPU memory of 48GB, reducing the cost by half.
5. Run inference#
You can use llama.cpp for inference, or you can use ollama for inference, which is more user-friendly for gguf. For specific code, you can visit the official GitHub: https://github.com/ollama/ollama
Summary#
This concludes the first part. If you encounter any issues, please discuss them in the comments.
This article is synchronized and updated to xLog by Mix Space.
The original link is https://sunx.ai/posts/nlp/llama70b