You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Generate text with distributed [LLaMA-65B](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md), [Guanaco](https://huggingface.co/timdettmers/guanaco-65b), [BLOOM-176B](https://huggingface.co/bigscience/bloom), or [BLOOMZ](https://huggingface.co/bigscience/bloomz) and fine-tune them for your own tasks — right from your desktop computer or Google Colab:
8
+
Generate text with distributed [LLaMA 2](https://ai.meta.com/llama/) ([70B](https://huggingface.co/meta-llama/Llama-2-70b-hf), [70B-Chat](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf)), [LLaMA-65B](https://github.com/facebookresearch/llama/blob/llama_v1/MODEL_CARD.md), [Guanaco-65B](https://huggingface.co/timdettmers/guanaco-65b) or [BLOOM-176B](https://huggingface.co/bigscience/bloom) and fine‑tune them for your own tasks — right from your desktop computer or Google Colab:
9
9
10
10
```python
11
11
from transformers import AutoTokenizer
12
12
from petals import AutoDistributedModelForCausalLM
13
13
14
-
model_name ="enoch/llama-65b-hf"# You can also use "bigscience/bloom" or "bigscience/bloomz"
14
+
model_name ="enoch/llama-65b-hf"
15
+
# You can also use "meta-llama/Llama-2-70b-hf", "meta-llama/Llama-2-70b-chat-hf",
model = AutoDistributedModelForCausalLM.from_pretrained(model_name)
17
20
# Embeddings & prompts are on your device, transformer blocks are distributed across the Internet
@@ -25,7 +28,7 @@ print(tokenizer.decode(outputs[0])) # A cat sat on a mat...
25
28
🚀 <b><a href="https://colab.research.google.com/drive/1uCphNY7gfAUkdDrTx21dZZwCOUDCMPw8?usp=sharing">Try now in Colab</a></b>
26
29
</p>
27
30
28
-
📋 Make sure you follow the model's terms of use (see [LLaMA](https://bit.ly/llama-license) and [BLOOM](https://bit.ly/bloom-license)licenses). Note that LLaMA is available for non-commercial purposes only, and you have to file a request [here](https://bit.ly/llama-license)to use it in your own projects.
31
+
📋 Make sure you follow the model's terms of use (see [LLaMA 2](https://bit.ly/llama2-license), [LLaMA](https://bit.ly/llama-license) and [BLOOM](https://bit.ly/bloom-license)licenses).
29
32
30
33
🔏 Your data will be processed by other people in the public swarm. Learn more about privacy [here](https://github.com/bigscience-workshop/petals/wiki/Security,-privacy,-and-AI-safety). For sensitive data, you can set up a [private swarm](https://github.com/bigscience-workshop/petals/wiki/Launch-your-own-swarm) among people you trust.
31
34
@@ -35,7 +38,7 @@ Run these commands in an [Anaconda](https://www.anaconda.com) env (requires Linu
This will host a part of LLaMA-65B with optional [Guanaco](https://huggingface.co/timdettmers/guanaco-65b) adapters on your machine. You can also host `bigscience/bloom`, `bigscience/bloomz`, and other compatible models from 🤗 [Model Hub](https://huggingface.co/models), or [add support](https://github.com/bigscience-workshop/petals/wiki/Run-a-custom-model-with-Petals) for new model architectures.
52
+
This will host a part of LLaMA-65B with optional [Guanaco](https://huggingface.co/timdettmers/guanaco-65b) adapters on your machine. You can also host `meta-llama/Llama-2-70b-hf`, `meta-llama/Llama-2-70b-chat-hf`, `bigscience/bloom`, `bigscience/bloomz`, and other compatible models from 🤗 [Model Hub](https://huggingface.co/models), or [add support](https://github.com/bigscience-workshop/petals/wiki/Run-a-custom-model-with-Petals) for new model architectures.
50
53
51
54
🔒 Hosting a server does not allow others to run custom code on your computer. Learn more about security [here](https://github.com/bigscience-workshop/petals/wiki/Security,-privacy,-and-AI-safety).
52
55
@@ -74,8 +77,8 @@ Learning more:
74
77
75
78
## How does it work?
76
79
77
-
- Petals runs large language models like [LLaMA-65B](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md)or[BLOOM-176B](https://huggingface.co/bigscience/bloom)**collaboratively** — you load a small part of the model, then team up with people serving the other parts to run inference or fine-tuning.
78
-
- Single-batch inference runs at 3-4 steps/sec for LLaMA-65B and ≈ 1 step/sec for BLOOM-176B —[up to 10x faster](https://github.com/bigscience-workshop/petals#benchmarks) than offloading, enough for [chatbots](http://chat.petals.ml) and other interactive apps. Parallel inference reaches hundreds of tokens/sec.
80
+
- Petals runs large language models like [LLaMA](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md)and[BLOOM](https://huggingface.co/bigscience/bloom)**collaboratively** — you load a small part of the model, then team up with people serving the other parts to run inference or fine-tuning.
81
+
- Single-batch inference runs at up to 6 steps/sec for LLaMA 2 (70B) and ≈ 1 step/sec for BLOOM-176B. This is[up to 10x faster](https://github.com/bigscience-workshop/petals#benchmarks) than offloading, enough for [chatbots](http://chat.petals.ml) and other interactive apps. Parallel inference reaches hundreds of tokens/sec.
79
82
- Beyond classic language model APIs — you can employ any fine-tuning and sampling methods, execute custom paths through the model, or see its hidden states. You get the comforts of an API with the flexibility of PyTorch.
80
83
81
84
<palign="center">
@@ -94,7 +97,7 @@ Here's how to install Petals with [Anaconda](https://www.anaconda.com/products/d
If you don't use Anaconda, you can install PyTorch in [any other way](https://pytorch.org/get-started/locally/). If you want to run models with 8-bit weights, please install PyTorch with CUDA 11.x or newer for compatility with [bitsandbytes](https://github.com/timDettmers/bitsandbytes).
0 commit comments