When you purchase through links on our site, we may earn an affiliate commission. This doesn’t affect our editorial independence.
Pruna AI is a European startup focusing on AI model compression, making its optimization framework open-source.
This AI model compression has created a framework that applies several efficiency methods, including but not limited to caching, pruning, quantization, and distillation, to a particular AI model.
According to the CTO and Pruna AI co-founder, John Rachwan, “We also standardize saving and loading the compressed models, applying combinations of these compression methods, and evaluating your compressed model after you compress it.”
Pruna AI’s framework can particularly assess significant quality loss after the AI model compression and the performance gains that you get.
“If I were to use a metaphor, we are similar to how Hugging Face standardized transformers and diffusers—how to call them, save them, load them, etc. We are doing the same, but for efficiency methods,” CTO added.
Big AI labs have already been using various AI model compression methods. For instance, OpenAI has relied on distillation to create faster versions of its flagship models.
This is likely how OpenAI developed GPT-4 Turbo, a faster version of GPT-4. Similarly, the Flux.1-schnell image generation model is a distilled version of the Flux.1 model from Black Forest Labs
Distillation involves the technique used to extract knowledge from a large AI model compression with a “teacher-student” model. Developers send requests to a teacher model and record the outputs. Answers are sometimes compared with a dataset to see how accurate they are. These outputs are then used to train the student model, which is trained to approximate the teacher’s behavior.
“For big companies, what they usually do is build this stuff in-house. And what you can find in the open source world is usually based on single methods. For example, let’s say one quantization method for LLMs or one caching method for diffusion models.”
Rachwan said. “But you cannot find a tool that aggregates all of them, makes them all easy to use, and combines them. And this is the big value that Pruna is bringing right now.”
Pruna AI supports any AI model compression, from large language models to diffusion models, speech-to-text models, and computer vision models. The company also focuses more specifically on image and video generation models.
It is reported that some of Pruna AI’s existing users include Scenario and PhotoRoom. Added to the open-source edition is Pruna AI, which has an enterprise offering with advanced optimization features, including an optimization agent.
“The most exciting feature we will release soon will be an AI model compression agent,” Rachwan said. “You give it your model; you say: ‘I want more speed, but don’t drop my accuracy by more than 2%.’ And then the agent will do its magic. It will find the best combination for you and return it for you. You don’t have to do anything as a developer.”
Pruna AI charges by the hour for its pro version. “It’s similar to how you would think of a GPU when you rent a GPU on AWS or any cloud service,” Rachwan said.
Pruna AI has made a Llama model eight times smaller using its compression framework without too much loss. Pruna AI hopes its customers will consider its compression framework an investment that pays for itself. If your model is a critical part of your AI infrastructure, you’ll save a lot of money on inference with the optimized model.
Pruna raised $6.5 million in seed funding for its AI model compression round a few months ago. Investors in the startup include EQT Ventures, Daphni, Motier Ventures, and Kima Ventures.