Closed
Description
GPTQ models works with exllama v1.
python test_inference.py -m ~/models/Synthia-13B-exl2 -p "Once upon a time,"
Successfully preprocessed all matching files.
-- Model: /home/llama/models/Synthia-13B-exl2
-- Options: ['rope_scale 1.0', 'rope_alpha 1.0']
-- Loading model...
-- Loading tokenizer...
-- Warmup...
-- Generating (greedy sampling)...
Once upon a time,ttt...............................................................................................................tttttttttttttt
Prompt processed in 0.10 seconds, 5 tokens, 51.99 tokens/second
Response generated in 3.96 seconds, 128 tokens, 32.29 tokens/second
$ python examples/inference.py
Successfully preprocessed all matching files.
Loading model: /home/llama/models/Synthia-13B-GPTQ/
Our story begins in the Scottish town of Auchtermuchty, where onceu at on/'s
m .'. p the. .tth from and and at f. bet1 hn
: a4. [[t and in thet cd'
research (Ft-t and e
\({\f 701 346
s w56782 91, ,· The08 " 710 and...6 1501020s 29
@a70'27,[
// 052
¡204; The
%
4 this
{5 it is just the s by some .
Response generated in 3.94 seconds, 150 tokens, 38.09 tokens/second
$ python examples/inference.py
Successfully preprocessed all matching files.
Loading model: /home/llama/models/Synthia-13B-exl2/
Our story begins in the Scottish town of Auchtermuchty, where onceo andt\\una
2t anddAt t.th[t'ms
<,-d... , and03.0. - ./,:
|m ont1. t605 thet7.th1 fy s to repv ag
.... The (p8628th.{{ 2l5-e.Zygt1t94hs0m.
| 57- f-n3, [[.[^-667. t8 and*1
Zyg7. | 3675, [[rF0th
Response generated in 5.25 seconds, 150 tokens, 28.59 tokens/second
GPU: AMD Instinct MI50
Name in OS: AMD ATI Radeon VII
Arch: gfx906
rocminfo
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
...
*******
Agent 2
*******
Name: gfx906
Uuid: GPU-6f9a60e1732c7315
Marketing Name: AMD Radeon VII
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 8192(0x2000) KB
Chip ID: 26287(0x66af)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1801
BDFID: 1280
Internal Node ID: 1
Compute Unit: 60
SIMDs per CU: 4
Shader Engines: 4
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16760832(0xffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
pytorch-lightning 1.9.4
pytorch-triton-rocm 2.1.0+34f8189eae
torch 2.2.0.dev20230912+rocm5.6
torchaudio 2.2.0.dev20230912+rocm5.6
torchdiffeq 0.2.3
torchmetrics 1.1.2
torchsde 0.2.5
torchvision 0.17.0.dev20230912+rocm5.6
Metadata
Metadata
Assignees
Labels
No labels