Skip to main content
main content, press tab to continue
Article | WTW Research Network Newsletter

Reshaping the GenAI Landscape: Part 2 – LLM Effectiveness at Scale

By Anas Alfarra , Swetha Garimalla , Carlos Loarte , Crystal McKinney , Sonal Madhok and Omar Samhan | June 20, 2025

DeepSeek offers top-tier LLM performance at low cost, combining advanced reasoning capabilities and challenging big AI’s expensive models with efficient, open-source solutions.
Insurance Consulting and Technology|Risk and Analytics
Artificial Intelligence

DeepSeek: A Fading Trend or Trendsetter?

On January 10, 2025, Chinese hedge fund and quantitative firm High-Flyer released its AI chatbot DeepSeek-R1 as an MIT licensed open-source product. As part of the technical report outlining the chatbot’s predecessor, model V3, DeepSeek says its V3 was trained on only 2,048 H800 GPUs from Nvidia over a two-month period, while pre-training, context extension, and post-training costs totaled less than $6 million.[1] By comparison, OpenAI’s ChatGPT utilized over 20,000 Nvidia A100 chips, with each chip costing between $10,000 and $15,000[2], while Meta’s Llama 3 required 24,000 of Nvidia’s H100 chips at a total cost of $72 million.[3] This increased efficiency at a fraction of the cost for training has opened the door to an issue that has plagued the industry since the widescale rollout of chatbots: how to maintain an innovation edge advancing artificial general intelligence while minimizing the large carbon footprint of large language models (LLMs).

To help WTW better understand the forces influencing this evolving AI market, the WTW Research Network (WRN) partnered with the University of Pennsylvania’s Wharton School and its Mack Institute’s Collaborative Innovation Program (CIP). Building on our previous work with the CIP and their Executive MBA students, Green Algorithms – AI and Sustainability, the WRN has sought to further examine the LLM competitive landscape including new disruptions and opportunities for optimization and efficiency. Part 1 looks at GenAI’s impact on risk management frameworks while this piece examines DeepSeek’s initial impact on the LLM market. Part 3 rounds out the series, providing a look at The Future of Hardware Computing and examining the implications for the market going forward as the industry moves from the training to the inference phase.

As the AI revolution has picked up pace, Nvidia’s rise has been concomitant with the rollout of chatbots by hyperscaler companies such as Alphabet, Amazon, Meta, and Microsoft. OpenAI’s ChatGPT largely set the trend of widespread use of LLMs. An unavoidable reality of the industry, however, was the necessity of deploying tens of thousands of chips to power these LLMs, whose compute-intensive functions require elaborate data center cooling systems. DeepSeek-V3’s training costs and GPU usage has forced a rethinking of that business model.

LLM Benchmarks: An Intro

LLM Benchmarks: An Intro

Presented as a Mixture-of-Experts (MoE) language model, DeepSeek was trained on over 671 billion parameters and claims to perform better than OpenAI’s o1 on key benchmarks such as AIME, MATH-500, and SWE-bench Verified[4], all key metrics designed to evaluate language models’ capabilities in math, physics, reasoning, science, and realistic software engineering scenarios.

To verify the claims of DeepSeek, High-Flyer asserted its benchmarks were beyond those of industry competitors. Deploying such evaluations and benchmarks is now a staple of LLMs, as such standards are now used to determine the accuracy and superiority of chatbots. Evaluations are often broken down into model evaluations vs. system evaluations. Model evaluations focus on the internal capabilities of the LLM while system evaluations focus on the LLM’s performance within a larger application.[5]

One of the most widely used and comprehensive list of benchmarks for LLMs is HuggingFace’s Big Benchmarks Collection, specifically its Open LLM Leaderboard.[6] Some common LLM benchmarks include: precision, recall, exact match, and perplexity. LLM benchmarks consist of sample data, a testing regiment of that data, metric evaluation, and scores based off of outputs. AI researchers must constantly curate the data so as to avoid hallucinations while adhering to the fidelity of the outputs. This is done through massive amounts of training, re-training, inputting, and precision and recall. Inputting billions and trillions of parameters into the model will then allow its designers to compare trustworthiness, accuracy, and recall with other LLMs, allowing a standard method of comparison across chatbots.

AI Distillation, Reinforcement Learning, and Mixture of Experts: What are they?

AI Distillation, Reinforcement Learning, and Mixture of Experts: What are they?

DeepSeek’s impressive reasoning abilities at a fraction of the cost threw into doubt the prevailing notion that tech companies must invest massive capital expenditures as part of their investment processes to power language models for training and inference in their quest to advance superior artificial general intelligence.

In their Technical Report, the authors attributed DeepSeek’s impressive performance to a number of factors, key amongst them were reinforcement learning and a Mixture-of-Experts model. While some such as Elon Musk and Dario Amodei of Anthropic have cast doubt regarding some of DeepSeek’s cost structure and abilities, many believe that DeepSeek was able to achieve its remarkable benchmarks through “knowledge distillation.”This technique aims to transfer the learnings of a larger pre-trained model (teacher model) and fine-tune them to a smaller model (student model), enabling the student model to achieve efficient results on a smaller scale.[7]

An analogous comparison to this, as Ali Ghodsi, CEO of data management company Databricks, puts it, is to have a two hour interview with Albert Einstein, having adopted the same level of knowledge in physics.[8] While distillation has been used for years, DeepSeek’s advances has opened the door for start-ups, developers, and companies without large budgets to utilize distillation and access the capabilities of existing large language and foundation models such as Open AI’s ChatGPT, Meta’s LLaMa, Alibaba’s Qwen, and others. This low cost of applicability and refinement stands in stark contrast to hyperscalers such as Microsoft, who had utilized ChatGPT-4 to distill its small language models, Phi, after investing $14 billion in OpenAI.[9]

DeepSeek’s use of reinforcement learning (RL) is emerging as an avenue for building LLMs with advanced reasoning capabilities. DeepSeek was able to improve on its reasoning due to chain of thought (CoT) prompting which encourages the model to break down problems into smaller, step-by-step reasoning to subjects that are known to have definitive solutions. RL is a type of machine learning that learns by trial and error – in other words, reinforcement learning acts in a way that replicates the learning process that humans undertake to achieve their goals. Unlike supervised learning (prediction tasks) and unsupervised learning (uncovering patterns with unlabeled data), RL doesn’t explicitly tell a model what it should output. Instead, the model starts out behaving randomly and discovers desired behaviors by earning rewards for its actions.

Another method that DeepSeek employed is the Mixture-of-Experts (MoE) approach whereby an AI model is separated into separate sub-networks (or “experts”), each specializing in a subset of the input data, to jointly perform a task. MoE architectures enable large-scale models, even those comprising many billions of parameters, to greatly reduce computation costs during pre-training and achieve faster performance during inference time.[10] This is attained via the delegation of tasks through micromanaging and the specialization of expertise, allowing a more efficient use of the neural network. MoE models take a larger base model and use it to train itself, increasing capacity relative to the base model by utilizing the MoE’s subnetwork of experts.[11] This architecture allows a trade-off between higher computational efficiency, reduced energy consumption, improved performance, more seamless scalability, and lower training/operating costs. Utilizing this approach lessens compute requirements while simultaneously retaining model quality.

Conclusion

Conclusion

The immediate release of DeepSeek-V3 and its successor R1 caused significant ripples in the tech world due to its open-source, cost-effective, less computationally powered models. DeepSeek was able to accomplish this goal by shifting the current paradigm of hardware deployment to software-driven resource optimization. Due to the US government’s export control regime regarding high-end AI chips, Nvidia was unable to export its most sophisticated chips, the A100 and H100, to Chinese customers. To comply with these restrictions, Nvidia developed pared down versions of each chip, the A800 and H800, respectively, that reduced their computing capabilities, which allowed the company to continue selling to Chinese cloud computing firms such as Alibaba and Tencent.[12] DeepSeek’s use of only 2,048 H800 chips were instrumental in the company employing its MoE and Multi-head Latent Attention (MLA) architectures for effective training and efficient inference.

While many of DeepSeek’s server infrastructure and running cost claims have yet to be independently verified (due to trade and IP secrets), its computational advances and sophisticated output veracity have accelerated the trend that, over time, efficiency improvements will continue in stride. Since most businesses do not need massive models to run their products, models such as V3 and R1 may prove to be more suitable as they are less expensive to create and require less data center infrastructure, providing a cost-effective alternative to larger, more expensive models.

References

  1. DeepSeek-V3 Technical Report. Return to article
  2. ChatGPT Will Command More Than 30,000 Nvidia GPUs: Report. Return to article
  3. Ten Gifts Nvidia Gave Its Investors. Return to article
  4. DeepSeek claims its ‘reasoning’ model beats OpenAI’s o1 on certain benchmarks. Return to article
  5. Understanding LLM Evaluation and Benchmarks: A Complete Guide. Return to article
  6. The Big Benchmarks Collection. Return to article
  7. What is knowledge distillation? Return to article
  8. The Wall Street Journal. Return to article
  9. AI companies race to use ‘distillation’ to produce cheaper models. Return to article
  10. What is mixture of experts? Return to article
  11. Applying Mixture of Experts in LLM Architectures. Return to article
  12. Nvidia tweaks flagship H100 chip for export to China as H800. Return to article

Authors


MBA Student, Wharton University, Pennsylvania, USA

MBA Student, Wharton University, Pennsylvania, USA

MBA Student, Wharton University, Pennsylvania, USA

MBA Student, Wharton University, Pennsylvania, USA

Technology Risks Analyst
email Email

Technology and People Risks Analyst
email Email

Contact us