billwanhua
本站元老
- 注册
- 2005-07-07
- 消息
- 15,772
- 荣誉分数
- 5,011
- 声望点数
- 373
老大试试原生DS模型,有人把模型压缩成720GB to just 131GB, 只需要cpu+gpu内存超过80G就可以流畅运行,这太有意思了,每个人都可以自己搞世界最高AI:
View: https://www.reddit.com/r/selfhosted/comments/1ic8zil/yes_you_can_run_deepseekr1_locally_on_your_device/
- We shrank R1, the 671B parameter model from 720GB to just 131GB (a 80% size reduction) whilst making it still fully functional and great
- No the dynamic GGUFs does not work directly with Ollama but it does work on llama.cpp as they support sharded GGUFs and disk mmap offloading. For Ollama, you will need to merge the GGUFs manually using llama.cpp.
- Minimum requirements: a CPU with 20GB of RAM (but it will be slow) - and 140GB of diskspace (to download the model weights)
- Optimal requirements: sum of your VRAM+RAM= 80GB+ (this will be somewhat ok)
- No, you do not need hundreds of RAM+VRAM but if you have it, you can get 140 tokens per second for throughput & 14 tokens/s for single user inference with 2xH100
- Our open-source GitHub repo: github.com/unslothai/unsloth
View: https://www.reddit.com/r/selfhosted/comments/1ic8zil/yes_you_can_run_deepseekr1_locally_on_your_device/
GitHub - unslothai/unsloth: Finetune Llama 3.3, Mistral, Phi-4, Qwen 2.5 & Gemma LLMs 2-5x faster with 70% less memory
Finetune Llama 3.3, Mistral, Phi-4, Qwen 2.5 & Gemma LLMs 2-5x faster with 70% less memory - unslothai/unsloth
github.com