lf-lang · Deeksha-20-99 · Sep 16, 2025 · Sep 16, 2025 · Sep 16, 2025 · Sep 19, 2025
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,6 @@
+llm/fed-gen/
+llm/src-gen/
+llm/include/
+llm/bin
+**__pycache__**
+llm/=**
diff --git a/llm/README.md b/llm/README.md
@@ -0,0 +1,135 @@
+
+# LLM Demo Overview
+This is a quiz-style game between two LLM agents. For each user question typed at the keyboard for the judge, both agents answer in parallel. The Judge announces whichever answer arrives first (or a timeout if neither responds within 60 sec), and prints per-question elapsed logical and physical times. 
+
+# Directory Structure
+- [federated](src/federated/) - Directory for federated versions of LLM demos.
+- [agents](src/agents/) - Directory for Python files for various LLM agents.
+
+# Pre-requisites 
+
+You need Python >= 3.10 installed.
+
+## Library Dependencies
+To run this project, there are dependencies required which are in [requirements.txt](requirements.txt) file. The model used in this repository has been quantized using 4-bit precision (bnb_4bit) and relies on bitsandbytes for efficient matrix operations and memory optimization. So specific versions of bitsandbytes, torch, and torchvision are mandatory for compatibility. 
+While newer versions of other dependencies may work, the specific versions listed below have been tested and are recommended for optimal performance.
+It is highly recommended to create a Python virtual environment or a Conda environment to manage dependencies. \
+To create the a virtual environment follow the steps below.
+
+### Step 1: Creating environment
+```
+python3 -m venv llm
+source llm/bin/activate 
+```
+For activating the environment everytime use "source llm/bin/activate". 
+or
+```
+conda create -n llm
+conda activate llm
+```
+### Step 2: Installing the required packages
+Check if pip is installed:
+```
+pip --version
+```
+If it is not installed:
+```
+python -m pip install --upgrade pip
+```
+Run this command to install the packages from the [requirements.txt](requirements.txt) file:\
+**Note**: Since we are using LLMs with 7B and 70B parameters it is recommended to have a device with GPU support. 
+```
+pip install -r requirements.txt
+```
+To check if all the requirements are installed, run:
+```
+pip list | grep -E "transformers|accelerate|tokenizers|bitsandbytes"
+```
+For installing torch:
+
+1. For devices without GPU
+```
+pip install torch torchvision
+```
+2. For devices with GPU
+   Checking the CUDA version run this command:
+   ```
+   nvidia-smi
+   ```
+   Look for the line "CUDA Version" as shown in the image: \
+   <img src="img/cudaversion.png" width="400" height="300"> 
+
+   With the correct version install PyTorch from [PyTorch](https://pytorch.org/get-started/locally/) by selecting the right correct OS and compute platform as shown in the image below for Linux system with CUDA version 12.8: \
+   <img src="img/pytorch.png" width="400" height="300"> 
+### Step 3: Model Dependencies  
+- **Pre-trained Models used in the agents/llm.py**:  [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) , [meta-llama/Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) \
+**Note:** Follow the steps below to obtain the access and authentication key for the hugging face models.
+1. Create the user access token and follow the steps shown on the official documentation: [User access tokens](https://huggingface.co/docs/hub/en/security-tokens)
+2. Log in using the Hugging Face CLI by running huggingface-cli login. Please refer to the official documentation for step-by-step instructions - [HuggingFace CLI](https://huggingface.co/docs/huggingface_hub/en/guides/cli)
+3. For the Llama Models you will require access to use the models if you are using it for the first time. Open these links and apply for accessing the models ([meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), [meta-llama/Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf))
+
+## System Requirements  
+
+To ensure optimal performance, the following hardware and software requirements are utilized. \
+**Note:** To replicate this model, you can use any equivalent hardware that meets the computational requirements.
+
+### Hardware Requirements   
+The demo was tested with the following hardware setup.
+- **GPU**: NVIDIA RTX A6000  
+
+### Software Requirements  
+- **OS**: Linux
+- **Python**   
+- **CUDA Version**: 12.8  
+
+Make sure the environment is properly configured to use CUDA for optimal GPU acceleration.
+
+# Files and directories in this repository
+  - **`llm_base_class.lf`** - Contains the base reactors LlmA, LlmB, and Judge.
+  - **`llm_quiz_game.lf`** - Lingua Franca program that defines the quiz game reactors (LLM agent A, LLM agent B and Judge).
+
+# Execution Workflow 
+
+### Step 1: 
+Run the **`llm_quiz_game.lf`**.  
+
+**Note:**  
+- Ensure that you specify the correct file paths
+
+Run the following commands:  
+
+```
+lfc src/llm_quiz_game.lf
+```
+
+### Step 2: Run the binary file and input the quiz question
+Run the following commands:  
+
+```
+./bin/llm_quiz_game
+```
+
+The system will ask for entering the quiz question which is to be obtained from the keyboard input.
+
+Example output printed on the terminal:
+
+<pre>
+
+--------------------------------------------------
+---- System clock resolution: 1 nsec
+---- Start execution on Fri Sep 19 10:46:31 2025 ---- plus 772215861 nanoseconds
+Enter the quiz question
+What is the capital of South Korea?
+Query: What is the capital of South Korea?
+
+waiting...
+
+Winner: LLM-B | logical 1184 ms | physical 1184 ms
+Answer: Seoul.
+--------------------------------------------------
+
+</pre>
+
+# Contributors
+- Deeksha Prahlad ([email protected]), Ph.D. student at Arizona State University
+- Hokeun Kim ([email protected], https://hokeun.github.io/), Assistant professor at Arizona State University 
diff --git a/llm/img/cudaversion.png b/llm/img/cudaversion.png
diff --git a/llm/img/pytorch.png b/llm/img/pytorch.png
diff --git a/llm/requirements.txt b/llm/requirements.txt
@@ -0,0 +1,5 @@
+accelerate
+transformers
+tokenizers
+bitsandbytes>=0.43.0
+
diff --git a/llm/src/agents/llm.py b/llm/src/agents/llm.py
@@ -0,0 +1,89 @@
+### Import Libraries 
+import transformers
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
+from torch import cuda, bfloat16
+
+
+### Model to be chosen to act as an agent 
+model_id = "meta-llama/Llama-2-7b-chat-hf"  
+model_id_2 = "meta-llama/Llama-2-70b-chat-hf" 
+
+### To check if there is GPU and convert it into float 16
+has_cuda = torch.cuda.is_available()
+dtype = torch.bfloat16 if has_cuda else torch.float32   
+
+### To convert the model into 4bit quantization 
+bnb_config = None
+### if there is cuda then the model is converted to 4bit quantization
+if has_cuda:
+    try:
+        import bitsandbytes as bnb  
+        bnb_config = BitsAndBytesConfig(
+            load_in_4bit=True,
+            bnb_4bit_quant_type="nf4",
+            bnb_4bit_use_double_quant=True,
+            bnb_4bit_compute_dtype=dtype,
+        )
+    except Exception:
+        bnb_config = None  
+
+### calling pre-trained tokenizer
+tokenizer   = AutoTokenizer.from_pretrained(model_id,   use_fast=True)
+tokenizer_2 = AutoTokenizer.from_pretrained(model_id_2, use_fast=True)
+for tok in (tokenizer, tokenizer_2):
+    if tok.pad_token_id is None:
+        tok.pad_token = tok.eos_token
+
+### since both the models have same device map and using 4bit quantization for both
+common = dict(
+    device_map="auto" if has_cuda else None,
+    torch_dtype=dtype,            # Changed from dtype=dtype (correct arg name)             
+    low_cpu_mem_usage=True,
+)
+if bnb_config is not None:
+    common["quantization_config"] = bnb_config
+
+### calling pre-trained model
+model   = AutoModelForCausalLM.from_pretrained(model_id,   **common)
+model_2 = AutoModelForCausalLM.from_pretrained(model_id_2, **common)
+model.eval(); model_2.eval()
+
+
+### arguments for both the models 
+GEN_A = dict(max_new_tokens=24, do_sample=False, temperature=0.1,
+             eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.pad_token_id)
+GEN_B = dict(max_new_tokens=24, do_sample=False, temperature=0.1,
+             eos_token_id=tokenizer_2.eos_token_id, pad_token_id=tokenizer_2.pad_token_id)
+
+###to resturn only one line answers
+def postprocess(text: str) -> str:
+    t = text.strip()
+    for sep in ["\n", ". ", "  "]:
+        idx = t.find(sep)
+        if idx > 0:
+            t = t[:idx]
+            break
+    return t.strip().strip(":").strip()
+
+###Calling agent1 from .lf code
+def agent1(q: str) -> str:
+    prompt = f"You are a concise Q&A assistant.\n\n{q}\n"
+    inputs = tokenizer(prompt, return_tensors="pt")
+    if has_cuda: inputs = {k: v.to("cuda") for k, v in inputs.items()}
+    with torch.no_grad():
+        out = model.generate(**inputs, **GEN_A)
+    prompt_len = inputs["input_ids"].shape[1]
+    result = tokenizer.decode(out[0][prompt_len:], skip_special_tokens=True)
+    return postprocess(result)
+
+###Calling agent2 from .lf code
+def agent2(q: str) -> str:
+    prompt = f"You are a concise Q&A assistant.\n\n{q}\n"
+    inputs = tokenizer_2(prompt, return_tensors="pt")
+    if has_cuda: inputs = {k: v.to("cuda") for k, v in inputs.items()}
+    with torch.no_grad():
+        out = model_2.generate(**inputs, **GEN_B)
+    prompt_len = inputs["input_ids"].shape[1]
+    result = tokenizer_2.decode(out[0][prompt_len:], skip_special_tokens=True)
+    return postprocess(result)
diff --git a/llm/src/agents/llm_a.py b/llm/src/agents/llm_a.py
@@ -0,0 +1,78 @@
+# llm_a.py
+
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
+
+#Model
+model_id = "meta-llama/Llama-2-7b-chat-hf"
+
+
+has_cuda = torch.cuda.is_available()
+if not has_cuda:
+    raise RuntimeError("CUDA GPU required for this configuration.")
+dtype = torch.bfloat16 if has_cuda else torch.float32
+
+#4-bit quantization
+bnb_config = None
+if has_cuda:
+    try:
+        import bitsandbytes as bnb
+        bnb_config = BitsAndBytesConfig(
+            load_in_4bit=True,
+            bnb_4bit_quant_type="nf4",
+            bnb_4bit_use_double_quant=True,
+            bnb_4bit_compute_dtype=dtype,
+        )
+    except Exception:
+        bnb_config = None
+
+#Tokenizer and the token is automatically used if logged in via CLI
+tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
+if tokenizer.pad_token_id is None:
+    tokenizer.pad_token = tokenizer.eos_token
+
+
+common = dict(
+    device_map="auto" if has_cuda else None,
+    torch_dtype=dtype,
+    low_cpu_mem_usage=True,
+)
+
+if bnb_config is not None:
+    common["quantization_config"] = bnb_config
+
+#model
+model = AutoModelForCausalLM.from_pretrained(model_id, **common)
+model.eval()
+
+#Generation 
+GEN_A = dict(
+    max_new_tokens=24,
+    do_sample=False,
+    temperature=0.1,
+    eos_token_id=tokenizer.eos_token_id,
+    pad_token_id=tokenizer.pad_token_id
+)
+
+#post-processing
+def postprocess(text: str) -> str:
+    t = text.strip()
+    for sep in ["\n", ". ", "  "]:
+        idx = t.find(sep)
+        if idx > 0:
+            t = t[:idx]
+            break
+    return t.strip().strip(":").strip()
+
+#Agent 1
+def agent1(q: str) -> str:
+    prompt = f"You are a concise Q&A assistant.\n\n{q}\n"
+    inputs = tokenizer(prompt, return_tensors="pt")
+    if has_cuda:
+        inputs = {k: v.to("cuda") for k, v in inputs.items()}
+    with torch.no_grad():
+        out = model.generate(**inputs, **GEN_A)
+    prompt_len = inputs["input_ids"].shape[1]
+    result = tokenizer.decode(out[0][prompt_len:], skip_special_tokens=True)
+    print(result)
+    return postprocess(result)