Have you ever thought if you have a private ChatGPT for enterprise and you ask the chatbot for specific content? Imagine a scenario where a simple query to your digital companion yields precise content or even documents brimming with sought-after information. The potential is, an on-point data retrieval application come true. However, as you traverse from the straightforward to the nuanced, the narrative takes a twist. When tasked with gauging departmental performance against a six-month plan, the AI seems to lose its footing. The situational comprehension required to provide a holistic view of performance metrics appears to elude our digital comrade.
This highlights a fundamental problem with large language models, or "LLMs" for short: they are rather knowledgeable about common, public domain knowledge, such as prime number theory. They are, however, in the dark about the specifics of proprietary or private information, such as how your team fared in the previous quarter. And let's face it, most enterprise-level workflows depend heavily on this kind of insider information. While having a model that is adept at analyzing the large ocean of publicly available data is endearing, in its unaltered state, it falls short for most enterprises.
I've had the good fortune to work with a variety of organizations over the past twelve months, delving into the use of Large Language Models (LLMs) for enterprise scenarios. This article clarifies important concepts and factors that should be on your radar if you're entering this field, along with some spicily insightful thoughts on how I see LLMs evolving over time and what that means for ML product plans. This guide is designed for product managers, creatives, coders, and other professionals who have little to no knowledge of the LLM mechanics but are curious enough to get the essence without becoming bogged down in the technical details.
Fundamental
RAG: Retrieval-Augmented Generation
At its core, guiding an LLM to navigate through proprietary data can be as straightforward as feeding the said data into the model's prompt and they'd nail it.
Navigating the real world, where documents range from thousands to millions, is tricky. Figuring out which bits of information belong in the context is the challenge here, especially when every word has a cost. This is where embeddings' magic comes into play. Imagine embeddings as a clever approach that turns language into numerical vectors, grouping related text into close-knit vectors in an N-dimensional space. This text may come from documents, websites, or perhaps a sizable amount of digital safe havens like Atlassian, OneDrive, Workspaces, or SharePoint. Now, whenever a user prompt knocks, we jazz it up into an embedded form, search through our text corpus, and fish out the relevant information.
If you need to find out more about how to build RAG, you can refer to my substack
Some starting point is
then how to use Llama’s index, how to use storage with LlamaIndex, choose the right embedding model and finally deploy in production
Fine-Tuning
The LLM need the whole context to accompany each rally it participates in, which is the Achilles' heel of embeddings. It seems that the LLM consistently fails to grasp even the most fundamental enterprise-centric ideas. The charges could quickly spiral out of control given that a sizable percentage of cloud-hosted LLM providers employ a per prompt token billing meter.
Now, this is when fine-tuning comes into play and changes the game. It enables an LLM to grasp enterprise-centric concepts without having to include them in each challenge. Imagine building a model from scratch. This genius already has a vast store of general knowledge stored in billions of learned knobs. The next step is to slightly turn these knobs in order to reflect particular business acumen while still clinging to the fundamentals of universal knowledge. And presto! When we send this well-tuned smarty pants racing, we may enjoy the luxury of business insights without spending any additional money.
Fine-tuning flirts more with the old spirit of machine learning, when creating models from scratch was the standard for ML squads, as opposed to the enigmatic black box scenario in embeddings or prompt engineering. Fine-tuning, however, is no simple task; it requires a filling meal of labelled observations from a training dataset. Furthermore, it is very particular about the quality and quantity of the training data. The configuration options, such as choosing the number of epochs and the learning rate, also plunge us into the deep end. More layers are added to the mix by managing these marathon training gigs and monitoring the model versions. Several core model providers offer us a lifeline by providing APIs that hide part of this convolution.
While inferences may be cheaper with fine-tuned models, that can be outweighed by costly training jobs.
Evaluation
A new set of difficulties arises when navigating the LLM environment, with assessing the quality of complicated outputs emerging as a key issue. Old-school ML teams have a variety of tried-and-true metrics at their disposal to assess the precision of simple outputs, like numerical forecasts or categorizations. However, LLMs are typically thrown into the deep end when the business ring rings, hammering up responses that range from a few to a boatload of words. Additionally, there are numerous ways to describe a subject once it requires more than ten words to express. Imagine having a "expert" response that has been verified by a human by your side. A model may be undervalued if its response is pitted against it using a strict exact string match test, which would diminish the model's response quality. The narrative intricacy of LLM outputs encourages us to think outside of the box when it comes to accuracy measurements and to dance to a beat that celebrates the complicated and verbose nature of the generated responses.
How about the malicious user-like hacker which is now prompt hacking? The type and length of the inputs and outputs the system is willing to accept could be restrained by a good old-fashioned starting point. When the field is open to external users, these safeguards transform into a non-negotiable shield, even though they fit snugly when you're playing ball with internal users. It's a future where input-output limitations, which protect the system from potential prompt-hacking chaos and keep the business machinery running smoothly, could be a stitch in time that saves nine.
Remember this story
Want to know more about the Evaluation process in RAG, you can find details in this article
What are the challenges?
Or perhaps concerns?