-
-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[Misc] GPTQ Activation Ordering #8135
New issue
Have a question about this project? No Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “No Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? No Sign in to your account
[Misc] GPTQ Activation Ordering #8135
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_wNa16.py
Show resolved
Hide resolved
vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_wNa16.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Two open questions + can you confirm the act order models worked fine for 2 and 4 gpus?
vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_wNa16.py
Outdated
Show resolved
Hide resolved
vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_wNa16.py
Outdated
Show resolved
Hide resolved
/ready |
Confirmed that group activation ordering works with tp=2,4 |
I might be mistaken, but So naming feels a bit confusing since I think we should try to create better separation concerns, i.e. better separation on how the model was quantized and how the model needs to be run for inference otherwise its just more things/complexity kernel authors need familiarize themselves with only to realize (in this case) it has no impact. |
I agree that there are benefits to separating "actorder" and "quantorder" as two orthogonal arguments, that being that vllm would only have to check for the "actorder=True" case. I think one large downside to separating "actorder" from "quant order" is that we're essentially redefining "actorder" as "activation ordering groups". This means that an llm-compressor user might turn on "actorder", expecting it to also do quantization ordering like it does in GPTQ, then get a model that has additional latency but no performance gain. This is confusing for users, not only because they have to redefine the concept of actorder, but because there's a potential pitfall case now added. There probably exist better names to better explain the ideas of "non-sequential grouping" and "quantization ordering", but folding these cases into the "actorder" argument allows us to leverage users' existing understanding of activation ordering. |
In a followup PR, we could keep |
I think the goal here would be make it so vLLM doesnt have to maintain an enum of In offline conversations with @kylesayrs , its seems like a good alternative solution would be to have a This way vLLM can completely ignore ideally though we wouldn't have compressed-tensor checkpoints in the wild with |
@LucasWilkinson A decision has been made that we're not going to support the more generalized checkpoint config in this PR, and will instead continue with using list which specifies which activations orderings come with non-contiguous groupings. I would characterizing the problem as being a difference in requirements between recipe configs and checkpoint configs, as well as a problem of coupling between llm-compressor and vllm-CT. While it is a problem that will likely come up again, it's considered out of scope for this PR. There are some options to make the checkpoint config more extensible/ decoupled from llm-compressor in the future. The two that I have proposed are (1) separating the recipe/compression config from the checkpoint config or (2) adding an Option (1) would likely involve some legacy support for backwards compatibility. Option (2) would be backwards compatible. Option (3) is to maintain a coupling between llm-compressor and vllm-CT. Hopefully these options are palatable enough to defer for a separate PR. |
Cool, ya was just wanting making my voice heard that I think we should be striving separation of concerns in our software design, especially across repos to avoid versioning headaches in the future (and having to coordinate PRs across many repos). Im not a huge fan option 2 as it is just a continuation of convoluting the difference between activation ordering and group assignment (i.e. mixing something that is fairly GPTQ specific with something that could be viewed as a more general group quantization thing). Option 1 seems like what we should be striving for as it will give us more flexibly going forward even outside of this specific actorder case. Im not sure what you mean by "checkpoint config" but I assume you mean the fields validated by compressed tensors. I would imagine for option 1 you would still want to save recipe information for accuracy debugging and general recorded keeping, but stored in an separate structure or using keys that vLLM and compressed tensors can just ignore / pass-through. |
@@ -232,7 +232,8 @@ def _get_scheme_from_parts( | |||
return CompressedTensorsWNA16( | |||
num_bits=weight_quant.num_bits, | |||
strategy=weight_quant.strategy, | |||
group_size=weight_quant.group_size) | |||
group_size=weight_quant.group_size, | |||
actorder=weight_quant.actorder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we just add the condition here?
actorder=weight_quant.actorder == ActivationOrdering.GROUP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but it feels more logical to me to keep argument processing within the CompressedTensorsWNA16.__init__
function. This separates responsibilities and makes clear that the job of _get_scheme_from_parts
is to decide which compression scheme applies, not to process the arguments once the scheme is decided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd also have to rename the actorder
argument of CompressedTensorsWNA16.__init__
, otherwise it would be a misnomer
@LucasWilkinson A recipe config means fields that are important to compression algorithms, a checkpoint config means fields relevant to vllm and serving. I agree that checkpoint configs should have some info about how they were compressed, but those fields wouldn't be a requirement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the careful work and discussion! This is good to land for now
Signed-off-by: Alvant <alvasian@yandex.ru>
Signed-off-by: Amit Garg <mitgarg17495@gmail.com>
Signed-off-by: LeiWang1999 <leiwang1999@outlook.com>
Activation Ordering
Changes
QuantizationArgs
to supportactorder
argumentweight_g_idx
parameter loader which defaults to all-1
sweight_g_idx
is loaded with valid values, then the parameter is passed to the kernelweight_g_idx
is not loaded, then no column reordering is performedTesting
Inference Script
infer_actorder.py
Actorder=Group Evaluation
These results have discrepancies not related to activation ordering, more precise results will be posted at a later date.
Actorder=Weight Evaluation
Accuracy
Accuracy evaluations were performed using compressed Meta-Llama-3-8B-Instruct as a base model. For reference, Meta-Llama-3-8B-Instruct-quantized.w4a16 was compressed using AutoGPTQ
desc_act=True
and achieved72.25%
on GSM-8K (5-shot, strict-match).The following models were quantized using
llm-compressor
with the same quantization configuration but156
calibration samples as opposed to256
.Group Activation Ordering
Weight Activation Ordering
Latency
Group Activation Ordering
Weight Activation Ordering