AI Insights 09 Jul 2025 Raspal Chima

The Real World Performance of Large Language Models

In the race to adopt AI, it’s easy to focus on what a Large Language Model (LLM) can do. But for your business, your users, and your bottom line, the question of how fast it can do it is just as critical. LLM performance is one of the most significant—and misunderstood—factors in a successful AI implementation.

Poor performance can frustrate users, cripple productivity, and turn a promising AI tool into a frustrating bottleneck. This guide cuts through the noise to give you a practical understanding of what performance really means, with real-world benchmarks and a look at the trade-offs you need to consider.

Understanding Performance: What are Tokens per Second (tps)?

The primary metric for LLM performance is tokens per second (tps). A "token" is a unit of text, roughly equivalent to ¾ of a word. A higher tps means a faster, more fluid stream of text generation.

However, not all tps are created equal, and focusing solely on this metric can be misleading.

Latency vs. Throughput:

Latency is the time it takes from submitting a prompt to receiving the first token of output (often called "time to first token" or TTFT). Low latency is crucial for chat and real-time applications, where users expect an immediate response
Throughput is the number of tokens processed per second, especially when handling multiple requests or generating long outputs. High throughput is essential for batch processing, summarization of large documents, or serving many users in parallel

These two factors are often in tension: maximising throughput (by processing many requests in parallel) can increase latency for individual users, and vice versa. The right balance depends on your use case.

What Do TPS Figures Actually Mean?

LLM providers often report tps numbers, but these can refer to different things:

API Performance: This is the raw speed at which a cloud service (like Azure AI, OpenAI API or Groq) can process requests. This can be extremely high—Groq, for example, reports over 1,200 tps for Llama 3 8B, and NVIDIA H100 servers can reach up to 19,000 tps in optimal conditions.
User Interface (UI) Performance: The speed you experience in a web interface like the consumer version of ChatGPT. This is often deliberately slower and isn't a technical limitation; it’s a design choice to make the interaction feel more natural and readable, preventing a giant wall of text from appearing instantly.

For most business applications that we build, we care about the raw API performance, as this determines the true speed of the underlying workflow.

Real-World Benchmarks: What to Expect from Your Hardware

The promise of running LLMs locally for privacy is appealing, but performance hinges entirely on the hardware. Here’s a practical look at what different setups can realistically achieve when running a popular open-source model like Meta's Llama 3 (8B Parameter version).

All figures below refer to Meta’s Llama 3 8B model, unless otherwise noted.

System Type	Output Tokens per Second (tps)	Time to Process 100KB File*	Time to Process 1MB File*	User Experience
Standard Office PC (CPU only)	1–3	2–6 hours	21–65 hours	Extremely slow; not suitable for interactive use.
Modern Small PC (Apple M4, fast desktop CPU)	5–20	19–76 min	3–13 hours	Noticeably slow, but usable for non-urgent, single-user tasks. Apple M-series chips are notably faster than typical CPUs.
PC with High-End Consumer GPU (e.g., RTX 4070)	30–60	6–13 min	1–2 hours	Fluid and responsive; suitable for productive local use.
On-Device (e.g., Samsung Galaxy S25, latest flagship phones)	15–30	13–26 min	2–4 hours	Surprisingly capable for mobile; speed is the main limitation, not content length.
Enterprise-Grade Server (NVIDIA H100, multi-GPU)	100–1,000+ (can reach 19,000+ with advanced setup)	Seconds to 2 min	4–20 min	Gold standard; can serve many users with both high throughput and low latency.

*Assumes 100KB ≈ 22,755 tokens and 1MB ≈ 233,016 tokens.

Key Takeaways from Benchmarks

Apple M-series chips (e.g., M4) offer much better LLM performance than typical small form-factor PCs, thanks to their integrated Neural Engine. They should not be grouped with generic "NUCs" or low-end desktops.
Mobile devices can process long content; the only limitation is how long the user is willing to wait, not any technical cutoff on content length.
Enterprise servers achieve high throughput by rapidly switching between user requests, not by processing all users' tokens simultaneously.
Processing large files: For example, a standard office PC would take 2–6 hours to process a 100KB file, while an enterprise server could process it in seconds to a couple of minutes.

The Optimisation Trade-Off: Quantization

To improve performance on local hardware, models are often "quantized."

In simple terms, quantization reduces the precision of the mathematical calculations within the model, making it smaller and faster. It’s like saving a high-resolution photo as a lower-quality JPEG to reduce the file size.

The Impact: Quantization can dramatically increase tps, often doubling performance or more.
The Cost: This speed comes at a real cost to accuracy. For many general tasks, the drop in quality is modest, but not negligible—typically in the 5–10% range for benchmark accuracy, and potentially more for complex reasoning or nuanced tasks. If the loss were negligible, quantized models would always be used in production, but for high-stakes applications, full-precision models are still preferred.

Choosing the right level of quantization is a technical decision that requires balancing your specific needs for speed versus accuracy.

The Road Ahead

The performance gap between local and cloud models will shrink but is unlikely to disappear. We expect to see rapid improvements in model efficiency and even more specialised hardware (like Apple's Neural Engine) designed to run these tasks.

However, for the foreseeable future, a fundamental trade-off remains: the ultimate performance and power of flagship commercial models will reside in the cloud. The key is to have an expert partner who can help you benchmark your needs and design a solution with the right performance profile for your users and your budget.

Recent AI Posts

AI Security - Data Extraction Hacks

The conversation around AI security has, until now, been dominated by one major theme: data privacy. Business leaders are rightly concerned about whether their confidential data will be misused or leaked by AI providers. As we've discussed previously, this risk is manageable with the right contracts and deployment models.

But a new, more insidious threat is emerging, and it has nothing to do with a provider's privacy policy.

What if the biggest risk isn't the AI model itself, but the data you ask it to read? This new class of vulnerability, known as Indirect Prompt Injection, can turn your trusted AI assistant into an unwitting insider threat. This guide explains the risk in simple business terms and outlines the practical steps you need to take to protect your organisation.

25 Jul 2025

A Practical Guide to AI Data Privacy & Security

For any business leader exploring AI, data privacy is a primary concern. Headlines about security risks can create significant Fear, Uncertainty, and Doubt (FUD), making you hesitate to use powerful Large Language Models (LLMs) with your company's confidential information.

Let's be direct: for businesses, the widely discussed fear of a major provider like Microsoft or OpenAI misusing your data is largely a myth, backed by strong legal and technical protections. However, this doesn't mean there are no risks. Real, serious risks do exist—they just aren't the ones the headlines focus on.

We understand that the perception of risk among your team and customers is a business challenge in itself. This guide provides a practical framework to address those fears, separate the myths from reality, and focus on mitigating the risks that truly matter.

10 Jul 2025

Choosing Your AI Engine: A Practical Comparison For Business Leaders

You've decided that using AI will be useful to your business. Now you face a critical and confusing decision: which Large Language Model (LLM) should power your project? In a landscape dominated by names like ChatGPT, Claude, Gemini, and DeepSeek, choosing the right engine is crucial for success. Selecting the wrong one can lead to budget overruns, poor performance, or a solution that simply doesn’t meet your needs.

The technical choice is actually a strategic business decision. The guide below provides a clear comparison, focusing on the practical differences that matter most to your project's outcome and its ROI.

09 Jul 2025

All AI Insights

We're Easy to Talk to - Let's Talk

Don't worry if you don't know about the technical stuff or exactly how AI will help your business. We will happily discuss your ideas and advise you.

Birmingham:

London:

Email: