This is what I'd do if I could learn how to build LLM from scratch

What if start from absolute zero, knowing what I know now? Where would I begin? How would I tackle each challenge?

Oct 03, 2023

Okay, so for someone who is the first time read my blog, let’s imagine for a second. You know those mind-blowing AI tools that can chat with you, write stories, and even help you finish your sentences? Those are powered by things called Large Language Models, or LLMs for short. It’s like having a chat with a super-knowledgeable friend who’s read, well, almost everything.

The thing is, these LLMs aren’t just nifty tech novelties. They hold the promise to transform various sectors, be it healthcare, finance, education and perhaps poverty🤔. Yet, for this transformation to happen, we desperately need more hands on deck; more folks who are adept at crafting and rolling out LLM applications.

And for someone who already read my substack before, well, you already know.

Now, let’s get real for a second. Every day, I dive into the world of Large Language Models (LLMs) — learning, writing, and trying my hand at building one. And seriously, it’s been a rollercoaster. Just when I think I’ve got a grip on it, a technical glitch or some wild theory throws me for a loop. Countless times I’ve had to backtrack, pore over research, and experiment with proofs of concept just to move an inch forward. It’s like trying to assemble a jigsaw puzzle with pieces that keep changing shapes!

All this trial and error, the late nights spent debugging and theorizing, really got the gears in my head turning:

What if I started from absolute zero, knowing what I know now? Where would I begin? How would I tackle each challenge?

This is where this blog comes in. This blog post will cover the value of learning how to create your own LLM application and offer a path to becoming a large language model developer.

So, why should you consider becoming an LLM developer?

Super hot, sexy and an absolute higher salary job in the future (I really hope so).
You will be able to create your own AI, think about having your own Jarvis and you will be another version of Iron Man
You can enhance the overall performance of your current applications using LLMs.
It positions you ahead of the curve in your professional career.
It lets you play an active role in the evolving landscape of AI innovation.

“You got me at Jarvis and Iron man, so I’m in. How do I learn it from scratch”

Roadmap

A Quick Note on Our Roadmap’s Audience:

This roadmap is tailored specifically for those with a foundational footing in the tech world, be it as software engineers, data scientists, or data engineers. If you’re familiar with coding and the basics of software engineering, you’re in the right place! However, if you’re an absolute beginner just starting to dip your toes into the vast ocean of tech, this might be a bit advanced. I’d recommend gaining some basic knowledge first before diving into this roadmap.

Also, I know it is not everyone's cup of tea and I’m not claiming to be an expert educator who can craft an entire course. My intention is to share the essential knowledge I believe is required to be an LLM developer.

Just like almost everything else, this roadmap is subject to change. I will keep it updated regularly with not only new materials but also more content, especially as technology changes every day.

Now, let’s get into it.

1. Introduction to the Foundation of LLM

Quick overview of foundation models, learn more about transformers and attention which is the birth of all the modern LLM models.

Attention is all you need (a birth of all the LLM modern models)
Overview of Transformer (Part 1, Part 2. Another one here)
Crash course on HuggingFace’s Transformers

2. Deep Dive: Transformers and Attention Mechanism

The attention mechanism lets LLMs highlight crucial sentence segments during text generation. Incorporated within Transformers, a neural network type, it elevates performance in tasks like machine translation by emphasizing essential parts of the source text, question answering by concentrating on vital question elements, and text summarization by focusing on the main text aspects.

Attention mechanism and transformer models.

Encoder decoder
Transformer models (I guess the detail in #1 is enough)
BERT and T5 models
Attention mechanism components
Self-attention and multi-head attention
Transformer networks: Tokenization, embedding, positional encoding, and transformer block

That is a very quick glimpse, and I hope it provides enough theory for someone to review and learn a bit about the vast world of LLM. Of course, the more you know, the better. However, I don’t want to overwhelm you with excessive theory and a multitude of research papers. You already know that I always aim to build something tangible and drive towards a solution. If I believe there are theories that need to be included in the future, I will update this post.

3. Deep Dive: Embeddings

Embeddings encode words or phrases into vectors, enabling LLMs to grasp the context. They’re pivotal in LLM applications like machine translation, where they help interpret text in multiple languages, question answering to align queries with responses, and text summarization to extract key points for concise summaries

Traditional approach:

Semantic encoding approach:

Word2Vec and dense word embeddings

Text embeddings and text similarity measures:

4. Deep Dive: Vector Database

Vector databases, designed to store data in vectors, streamline the way LLMs access and manage information. These databases play a pivotal role across various LLM applications. For instance, in machine learning, they house the foundational training data. In natural language processing, they’re repositories for essential vocabulary and grammar rules. And for recommendation systems, serve as reservoirs of users’ specific product and service preferences.

Overview:

Indexing techniques:

Product quantization (PQ), Locality-sensitive hashing (LSH), and Hierarchical navigable small world (HNSW)

Retrieval techniques:

Cosine similarity
Nearest neighbor search

Bonus

Do you actually need a vector database

5. Emerging Architectures in LLM Applications and Groundbreaking Use-Cases

Several innovative architectures, including Transformer-based models, graph neural networks, and Bayesian models, are shaping the future of LLM applications. These are being applied across domains like natural language processing, machine translation, and healthcare. For instance, Transformer models enhance machine translation accuracy, graph neural networks bolster fraud detection, and Bayesian models improve medical diagnosis precision.

Use-cases and PoC:

6. Deep Dive: Semantic Search (Optional)

Semantic search goes beyond keywords to understand query meaning and user intent, yielding more accurate results. Powered by LLMs, it’s a game-changer for industries like e-commerce, helping users find products without exact names, and customer service, enabling faster and more precise responses, and research, streamlining the discovery of pertinent papers and datasets.

Basic concept:

Knowing what is semantic search?
Distinguishing semantic search from the lexical search
Semantic search using text embeddings

Advance concept:

Multilingual search
Limitations of embeddings and similarity in semantic search
Improving semantic search beyond embeddings and similarity

7. Prompt Engineering

Prompt engineering is the art of designing specific cues or prompts to steer LLMs in generating text that perfectly aligns with a user’s objectives. It plays a pivotal role across diverse LLM applications. In the realm of creative writing, it aids LLMs in crafting varied outputs, be it poems, scripts, codes, or even emails. Meanwhile, machine translation sharpens accuracy in translating between languages, and in the context of question-answering, it ensures LLMs pinpoint the most accurate and relevant response.

Prompting by instruction
Prompting by example
Prompting with being creative vs. predictive
Template prompting:
- Summarizing
- Inferring
- Transforming
- Expanding
- Generating a product with a pitch
- Simplifying concept

You can find a short free course here:

8. Fine-tuning of foundation models

Foundation models stand as a cornerstone in the realm of large language models, having been meticulously pre-trained on vast and diverse datasets. These comprehensive models, once set, aren’t static; they can be further refined and tailored to excel in specialized tasks. This customization is achieved through a process called fine-tuning, wherein the model’s parameters are subtly adjusted to enhance its performance. This technique is especially crucial when we aim to harness the prowess of LLMs for diverse challenges, whether that’s ensuring accurate translation between niche languages, responding to intricate queries on specialized topics, or crafting text summaries that carry a distinct stylistic signature

Rationale of fine-tuning
Limitations of fine-tuning
Parameter efficient fine-tuning

I use LlamaIndex’s approach for fine-tuning

9. Basic of LlamaIndex or Langchain

I personally prefer LlamaIndex over Langchain for anything text retrieval-related

Official Document:

LlamaIndex: https://gpt-index.readthedocs.io/en/stable/
Langchain: https://docs.langchain.com/docs/
My Substack as I cover a lot of the latest LLM development

10. Autonomous Agents

Autonomous agents represent a class of software programs designed to operate independently with a clear goal in mind. With the integration of Large Language Models (LLMs), these agents can be supercharged to handle an array of tasks more efficiently. From streamlining customer service interactions and detecting fraudulent activities with heightened accuracy to aiding in precise medical diagnoses, the union of LLMs and autonomous agents is ushering in a new era of automated capabilities.

This is a bias as I lean more toward LlamaIndex for implementation.

Agents and tools
Agent types:
- Conversational Agents
- OpenAI functions agents
- ReAct agents
- Plan and execute agents

LlamaIndex’s agent: https://gpt-index.readthedocs.io/en/latest/end_to_end_tutorials/agents.html

Langchain’s agent: https://docs.langchain.com/docs/components/agents/

11. LLM Ops

Yet, we still couldn't figure out the full job description for MLOps and now we have another one: LLM Ops.

What is LLM Ops (something to start with)
LLM Ops vs. ML Ops

Recommended PoC or small projects

There are several recommended projects for developers who are interested in learning more about LLMs. These projects include:

PDF Chatbots: LLMs can be used to create chatbots that can hold natural conversations with users. This can be used for a variety of purposes, such as customer service, education, and entertainment. For example, Google Assistant uses LLMs to answer questions, provide directions, and control smart home devices.
Code generation: LLMs can be used to generate code, such as Python scripts and Java classes. This can be used for a variety of purposes, such as software development and automation. For example, the GitHub Copilot tool uses LLMs to help developers write code more quickly and easily.
Creative writing: LLMs can be used to generate creative text formats, such as poems, code, scripts, musical pieces, emails, letters, etc. This can be used for a variety of purposes, such as entertainment, education, and marketing. For example, the Bard language model can be used to generate different creative text formats, such as poems, code, scripts, musical pieces, emails, letters, etc. Think about QuillBot and how to replicate that site.

Summary

In my opinion, the materials in this blog will keep you engaged for a while, covering the basic theory behind LLM technology and the development of LLM applications. However, for those with a curious mind who wish to delve deeper into theory or practical aspects, this might not be sufficient. I recommend using this blog as a starting point and broadening your understanding through extensive self-research. I will regularly update this post with more content.

If you have any questions or believe I’ve overlooked an essential topic that should be included in the roadmap, please leave a comment or connect with me on LinkedIn. I’m eager to receive feedback and insights from everyone. My goal is to refine this roadmap, ensuring it serves as a reliable starting point for aspiring LLM developers.

❤️ If you found this post helpful, I'd greatly appreciate your support by giving it a heart. It not only means a lot to me, but it also indicates that my work is resonating with readers. Additionally, if you'd like to further support my efforts, you can consider a small contribution through a paid subscription.

See you in the next posts.

How AI Built This

Discussion about this post