SqueezeBits

See All Tech Product vLLM vs TRT LLM Intel Gaudi Yetter OwLite Fits on Chips Biz&Insight Research

Introducing rebellions ATOM™-MAX

Introducing ATOM™-Max, rebellions’ next-generation NPU designed for high-performance AI inference. Learn how its runtime, profiling tools, and PyTorch-native integrations enable developers to run and serve models efficiently without sacrificing usability.

Dec 24, 2025

Tech

Winning both speed and quality: How Yetter deals with diffusion models

Explore how the Yetter Inference Engine overcomes the limitations of step caching and model distillation for diffusion models. We analyze latency, diversity, quality, and negative-prompt handling to reveal what truly matters for scalable, real-time image generation.

Oct 31, 2025

TechYetter

[Intel Gaudi] #6. GEMM, Attention, vLLM on Gaudi

Explore how Intel’s new Gaudi-3 compares to Gaudi-2, NVIDIA A100, and H100. We analyze real-world GEMM efficiency, attention performance, and LLM serving results to uncover what truly matters for AI inference and training workloads.

Oct 28, 2025

Intel Gaudi

Yetter, the GenAI API service: AI Optimization, Out of the Box

Meet 'Yetter': the generative AI API service built for speed, efficiency, and scalability. Powered by our optimization inference engine, it delivers reliable image, video, and future LLM services at a fraction of the cost.

Oct 02, 2025

TechYetter

Guided Decoding Performance on vLLM and SGLang

The guide to LLM guided decoding! This deep-dive benchmark compares XGrammar and LLGuidance on vLLM and SGLang to help you find the optimal setup for generating structured output based on your use case.

Sep 16, 2025

Tech

Introducing rebellions ATOM™-MAX

Dec 24, 2025

Tech

Winning both speed and quality: How Yetter deals with diffusion models

Oct 31, 2025

TechYetter

[Intel Gaudi] #6. GEMM, Attention, vLLM on Gaudi

Oct 28, 2025

Intel Gaudi

Yetter, the GenAI API service: AI Optimization, Out of the Box

Oct 02, 2025

TechYetter

Guided Decoding Performance on vLLM and SGLang

Sep 16, 2025

Tech

Disaggregated Inference on Apple Silicon: NPU prefill and GPU decode

In this article, we introduce how to run LLMs efficiently on Apple Silicon with disaggregated inference technique.

Aug 26, 2025

Tech

Vocabulary Trimming: An Easy and Effective Method for SLM Acceleration

Trimming large multilingual vocabularies in Small Language Models (SLM) is a simple, low-risk way to boost efficiency to its limit. It accelerates the model inference significantly while keeping accuracy almost unchanged.

Aug 04, 2025

TechResearch

GraLoRA: Boosting Fine-Tuning Accuracy Without Extra Cost

LoRA excels at efficient fine-tuning but suffers at higher ranks due to gradient entanglement. We introduce GraLoRA, which addresses these issues through finer-grained, block-wise updates, significantly enhancing performance and expressivity without overhead. GraLoRA outperforms LoRA across tasks, achieving up to +8.5% improvement in HumanEval+ Pass@1.

Jul 21, 2025

TechResearch

OwLite Meets Qualcomm Neural Network: Unlocking On-Device AI Performance

At SqueezeBits we have been empowering developers to efficiently deploy complex AI models while minimizing performance trade-offs with OwLite toolkit. With OwLite v2.5, we're excited to announce official support for Qualcomm Neural Network (QNN) through seamless integration with Qualcomm AI Hub.

Jul 03, 2025

ProductOwLite

Bringing NPUs into Production: Our Journey with Intel Gaudi

SqueezeBits has partnered with Intel to make Gaudi NPUs more usable in practice. We optimized LLMs and diffusion models for Gaudi-2 and created yetter, a generative AI API service.

Jul 01, 2025

Intel GaudiBiz&Insight

How to Quantize Transformer-based model for TensorRT Deployment

This article describes the experimental results of quantized Vision Transformer model and its variants with OwLite.

May 20, 2025

OwLite

How to Quantize YOLO models with OwLite

This article describes the experimental results of quantized YOLO models with OwLite.

May 07, 2025

OwLite

OwLite: No More Compromising on AI Performance After Quantization

Discover how OwLite simplifies AI model optimization with seamless integration and secure architecture.

Apr 11, 2025

ProductOwLite

[Intel Gaudi] #5. FLUX.1 on Gaudi-2

This article discusses inference efficiency when running the FLUX.1 models on Intel Gaudi-2 hardware.

Apr 02, 2025

TechIntel Gaudi

TensorRT-LLM Goes Open Source!

With TensorRT-LLM now open source, we can finally take a deep dive into the secret sauce behind its impressive performance.

Mar 25, 2025

TechvLLM vs TRT LLM

When Should I Use Fits on Chips?

This article describes when to use Fits on Chips toolkit with specific use cases.

Mar 10, 2025

TechProductFits on Chips

Fits on Chips: Saving LLM Costs Became Easier Than Ever

This article introduces Fits on Chips, an LLMOps toolkit for performance evaluation.

Feb 26, 2025

ProductFits on Chips

SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks

A brief review of the research paper from our team, published at ICML 2024.

Feb 17, 2025

TechResearch

The Missing Piece of TensorRT-LLM

This article is about an open-source library for direct conversion of PyTorch models to TensorRT-LLM.

Feb 10, 2025

TechFits on Chips

The Rise and Fall of ONNX (feat. PyTorch 2.0)

This article explores the rise and fall of ONNX, from its early success as a unifying stasndard for AI frameworks to its gradual shift into a niche tool in the era of PyTorch 2.0.

Feb 06, 2025

Tech

[vLLM vs TensorRT-LLM] #13. Vision-Language Models

This article provides a comparative analysis of serving vision-language models on vLLM and TensorRT-LLM.

Jan 20, 2025

TechvLLM vs TRT LLM

[Intel Gaudi] #4. FP8 Quantization

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

Jan 13, 2025

TechIntel Gaudi

[Intel Gaudi] #3. Performance Evaluation with SynapseAI v1.19

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

Jan 06, 2025

TechIntel Gaudi

[vLLM vs TensorRT-LLM] #12. Automatic Prefix Caching

This article provides a comparative analysis of automatic prefix caching.

Dec 23, 2024

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #11. Speculative Decoding

This article provides a comparative analysis of speculative decoding.

Dec 09, 2024

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #10 Serving Multiple LoRAs at Once

This article provides a comparative analysis of multi-LoRA serving capabilities of vLLM and TensorRT-LLM frameworks.

Dec 05, 2024

TechvLLM vs TRT LLM

[Intel Gaudi] #2. Graph Compiler and Overall Performance Evaluation

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

Dec 02, 2024

TechIntel Gaudi

[vLLM vs TensorRT-LLM] #9. Parallelism Strategies

This article provides a comparative analysis of different parallelism strategies on vLLM and TensorRT-LLM frameworks.

Nov 26, 2024

TechvLLM vs TRT LLM

[Intel Gaudi] #1. Introduction

In this blog series, we thoroughly evaluate Intel's AI accelerator, the Gaudi series, focusing on its performance, features, and usability.

Nov 21, 2024

TechIntel Gaudi

[vLLM vs TensorRT-LLM] #8. KV Cache Quantization

This article provides a comparative analysis of the effects of KV cache quantization on vLLM and TensorRT-LLM frameworks.

Nov 18, 2024

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #7. Weight-Activation Quantization

This article provides a comparative analysis of the effects of weight-activation quantization on vLLM and TensorRT-LLM frameworks.

Nov 11, 2024

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #6. Weight-Only Quantization

This article provides a comparative analysis of the effects of weight-only quantization on vLLM and TensorRT-LLM frameworks.

Nov 01, 2024

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #5. Dynamic Sequence Lengths

This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks, focusing on performance with fixed and dynamic datasets.

Oct 30, 2024

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #4. Which Scheduler Wins? 🔥

This article provides a comparative analysis of schedulers in vLLM and TensorRT-LLM frameworks.

Oct 24, 2024

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #3. Understanding Sampling Methods and Their Performance Impact

This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks with various sampling methods.

Oct 18, 2024

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #2. Towards Optimal Batching for LLM Serving

This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks, focusing on batching configurations and thoroughly examining the effects of maximum batch size and maximum number of tokens.

Oct 11, 2024

TechvLLM vs TRT LLM

[vLLM vs TensorRT-LLM] #1. An Overall Evaluation

This article provides a comparative analysis of vLLM and TensorRT-LLM frameworks for serving LLMs, evaluating their performance based on key metrics like throughput, TTFT, and TPOT to offer insights for practitioners in optimizing LLM deployment strategies.

Oct 01, 2024

TechvLLM vs TRT LLM

How much can we save through compression?

Estimating the cost savings from model compression.

Jun 26, 2024

Tech

‘Breaking Down’ Tokenizers in LLMs

An introduction to tokenizers and their implications in language models.

May 16, 2024

Tech

Accuracy Degradation in AI Compression: Myth or Truth?

Clarifying the misunderstandings in AI model compression

Apr 24, 2024

Tech

Are you getting everything out of your GPUs?

The Blackwell GPU from GTC 2024 was astonishing. Analysis of the Nvidia GPU evolution & what it means for GPU users.

Apr 23, 2024

Tech

Things to check if your business utilizes AI

Do I need to COMPRESS my AI model? : the short answer is “YES” — and here’s why.

Apr 19, 2024

Tech

AI Compression for Acceleration: 4 Key Methods.

AI model compression for acceleration is essential. The question is HOW? Here are 4 key methodologies.

Apr 15, 2024

Tech

Unlock the Potential of AI

Introducing rebellions ATOM™-MAX

Winning both speed and quality: How Yetter deals with diffusion models

[Intel Gaudi] #6. GEMM, Attention, vLLM on Gaudi

Yetter, the GenAI API service: AI Optimization, Out of the Box

Guided Decoding Performance on vLLM and SGLang

Introducing rebellions ATOM™-MAX

Winning both speed and quality: How Yetter deals with diffusion models

[Intel Gaudi] #6. GEMM, Attention, vLLM on Gaudi

Yetter, the GenAI API service: AI Optimization, Out of the Box

Guided Decoding Performance on vLLM and SGLang

Disaggregated Inference on Apple Silicon: NPU prefill and GPU decode

Vocabulary Trimming: An Easy and Effective Method for SLM Acceleration

GraLoRA: Boosting Fine-Tuning Accuracy Without Extra Cost

OwLite Meets Qualcomm Neural Network: Unlocking On-Device AI Performance

Bringing NPUs into Production: Our Journey with Intel Gaudi

How to Quantize Transformer-based model for TensorRT Deployment

How to Quantize YOLO models with OwLite

OwLite: No More Compromising on AI Performance After Quantization

[Intel Gaudi] #5. FLUX.1 on Gaudi-2

TensorRT-LLM Goes Open Source!

When Should I Use Fits on Chips?

Fits on Chips: Saving LLM Costs Became Easier Than Ever

SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks

The Missing Piece of TensorRT-LLM

The Rise and Fall of ONNX (feat. PyTorch 2.0)

[vLLM vs TensorRT-LLM] #13. Vision-Language Models

[Intel Gaudi] #4. FP8 Quantization

[Intel Gaudi] #3. Performance Evaluation with SynapseAI v1.19

[vLLM vs TensorRT-LLM] #12. Automatic Prefix Caching

[vLLM vs TensorRT-LLM] #11. Speculative Decoding

[vLLM vs TensorRT-LLM] #10 Serving Multiple LoRAs at Once

[Intel Gaudi] #2. Graph Compiler and Overall Performance Evaluation

[vLLM vs TensorRT-LLM] #9. Parallelism Strategies

[Intel Gaudi] #1. Introduction

[vLLM vs TensorRT-LLM] #8. KV Cache Quantization

[vLLM vs TensorRT-LLM] #7. Weight-Activation Quantization

[vLLM vs TensorRT-LLM] #6. Weight-Only Quantization

[vLLM vs TensorRT-LLM] #5. Dynamic Sequence Lengths

[vLLM vs TensorRT-LLM] #4. Which Scheduler Wins? 🔥

[vLLM vs TensorRT-LLM] #3. Understanding Sampling Methods and Their Performance Impact

[vLLM vs TensorRT-LLM] #2. Towards Optimal Batching for LLM Serving

[vLLM vs TensorRT-LLM] #1. An Overall Evaluation

How much can we save through compression?

‘Breaking Down’ Tokenizers in LLMs

Accuracy Degradation in AI Compression: Myth or Truth?

Are you getting everything out of your GPUs?

Things to check if your business utilizes AI

AI Compression for Acceleration: 4 Key Methods.