To Deploy or Not to Deploy: LLMs in Healthcare and Beyond

July 7, 2023

Technical Experts — Shoptalk Series: Array Insights is fortunate to have a team of dedicated and passionate data scientists and engineers working to advance patient-centric AI research. Our ‘Technical Experts – Shoptalk Series’ is a place for our technical experts to share their thoughts about emerging technologies, interesting use-cases, and other topics that intersect with the mission of Array Insights, without necessarily translating all the nitty-gritty details and acronyms for a non-technical audience.

Author: Jaap Osternbroek, Lead Data Scientist at Array Insights

Large language models (LLMs) are the coolest-looking sneakers in the playground right now. Their potential for disrupting business is touted from every high tree and tall tower.

However, despite their buzz on the digital playground, it does not make sense for everyone to utilize them. Microsoft has cleverly launched a very much for profit “safe” alternative, and pretty soon other parties will follow suit. Some companies will rebuff these options in order to gain even more control over how and when they use LLMs, but there are several barriers folks should consider prior to running their own. 

Since Array Insights is currently benchmarking our own deployments of LLMs, as well as the APIs available via Azure, we wanted to walk through some of the considerations of our decision in the hopes that other security-focused organizations might find it informative.

The Hurdles of LLM Implementation

On the most basic level, know that the first L in LLM – large – is no joke. Previous generations of large computer vision models went from 8GB of memory for yolo 8 or stable diffusion, to maybe 10 times that for more ambitious implementations. The vast majority of these would fit comfortably on a single consumer-level GPU. To run the smallest in the current generation of language models you need at least 16GB of GPU memory; full size ones will casually exceed 100GB of GPU ram. GPT4 is rumored to have 1 trillion parameters likely consuming in excess of a TB of vRam. 

Where would you get many dozens of gigabytes of GPU memory in this day and age? As a fully remote, security-focused company, we need to be extremely discerning about where we rent our silicon. On paper, Azure stocks the following GPUs on demand: M60 (8GB), V100 (16GB), T4 (16GB), A10 (24GB) and the A100 (80GB) in various configurations. However, accessing those cards is costly. Once you see what is being charged for it, suddenly that RTX 3090 your friendly neighborhood corner shop is stocking starts to look mighty appealing.

Even if you secure the necessary GPU memory at a reasonable price, the next step – configuration – can be daunting. Since everything needs to be compatible, installing your drivers, cuda, python and pytorch can be intimidating the first few times, as securing the right versions is a challenge even with helpful lists on websites and forums pointing you towards the right version. Depending on the hardware you are using, you might have to compile your own driver’s version of each of these packages as well, but it’s nice to know that this is not nearly as painful as it once was.

Assuming you have traversed these fiery rings, there is one more obstacle to clear. Unless you managed to get your hands on one of these bigger A100 or A10 cards (or an equivalent multi-GPU configuration), the model like llama or alpaca are still too big to fit in your GPU memory. The smallest editions are 13B parameters. Most modern programming languages take 4 bytes to represent a single parameter, which would bring these models up to the 52GB memory range. Luckily for us, most GPUs never really went beyond 16 floating point precision, but even 26GB is too big for our poor V100 card.

Quantization: An LLM Solution for Everyone? 

A solution to this problem is quantization, which is the process of reducing the precision of neural network parameters in order to shrink and speed up neural networks overall. The process is more involved than just down-casting a bunch of floating point numbers to their lower precision peers; some of the math in the neural networks itself also has to be rewritten. There are great benefits to training or tuning the networks into a quantization-aware mode. Fortunately for us, most of the bootleg LLMs out there have 8 and even 4 bit quantization available off the shelf. How one would represent a floating-point number in 4 bits is a complete mystery to me, but once you have made it this far, the moment for asking questions such as these is probably over.

After all of this, we finally arrive at that magic moment where we get to prompt our model. “Who builds the pyramid?” I asked, but the model was not quite sure. That leads me to the more fundamental question in this whole exercise: although running your own LLM is certainly feasible and provides you a way around ChatGPT’s ever-increasing alignment problems, it is foolish to expect to get the same quality results. As it stands, Openai is probably losing money on every API call, even from its paying users. 

Does it make sense for companies for whom it is not their core business to run these kinds of models? The answer to this question can only be found by formulating clear objectives around use of these models and measuring the quality of results. For some, the added reliability and control over their own fine-tuned LLMs will work, but for the vast majority of them, it will not.

Array Insights and LLM

For our part, we continue to benchmark both our own deployments of LLMs and the APIs available via Azure, and we have yet to make the call in regard to which direction we will take. That being said, LLMs are here to stay, and we see huge potential for their use in healthcare due to the abundance of unstructured datasets holding a wealth of key patient-benefitting insights. It will be interesting to see how companies large and small make their decisions over the coming years.

Follow us on LinkedIn to learn more about how Array Insights is using LLMs to increase accessibility and representation in health research.