2024 is officially the year of Generative AI and it’s been like living in a Renaissance period in a way. No one can deny the rapid progress made in the space. You can’t attend a single panel discussion without Gen AI or Large Language Models (LLMs) being brought up. But one question I keep hearing is: “What is reality, vs. just hype?”
There is an appropriate level of skepticism from many business leaders out there. How can LLMs possibly live up to the hype? How do you make a practical business application out of it? Moreover, what are the risks in going down this road, and how can you avoid oversights that others have made?
This post aims not to be an in-depth nor exhaustive discussion of sub-topics around Generative AI, but to acknowledge the progress made in the space, discuss surrounding hype while addressing present-day limitations, and cover some pitfalls in developing applications with Generative AI and how to avoid them.
The Progress:
Without question, there have been substantial advancements in Gen AI and seemingly ubiquitous adoption of the technology (not just in the tech community). Widespread adoption, in large part, has allowed Gen AI technologies to improve, because with more usage, there’s more data, and the “it” in AI models is the dataset (not the architecture, hyperparameters, or optimizer choices).
I’ve personally seen Generative AI replace traditional ML solutions I’ve built. For example, two years ago I was tasked to classify unstructured medical passages scraped from a website. Back then, I needed to store medical ontologies, do pre-processing, parse the text, send it to a pre-trained NLP model for the task, and finally wrap up with post-processing. Since the advent of LLMs, I could do the same thing with more reliable quality, drastically reduced code-base, and get it running in a fraction of the time. Another project I completed around then was summarizing domain-specific articles. Following a similar development cycle as the previous example, I developed extractive and abstractive summarization solutions in a few weeks of hard work and collaboration with experts. Nowadays, the summarization of these articles could be done with the same quality, or maybe even better by an LLM and again (with less code, and developed in a fraction of the time).
Over time, vendors of these LLMs needed to catch up to their popularity. OpenAI’s APIs are much more robust today than when they were even 6 months ago (salute to everyone who dealt with exception handling in the early days). GPT-4 is an incredible asset, and the differences are staggering compared to GPT-3. Word on the street is that GPT-5 will be as different as GPT-4 was to GPT-3. I’d be remiss to not mention other amazing LLM alternatives out there, such as Meta’s Llama 3, Anthropic’s Claude 3 Family, Eagle 7B, and Google’s Gemini 1.5, to name a few. There have been recent landmark contributions to Generative AI in audio, video and multi-modal fields, as well (e.g. Microsoft’s VASA, OpenAI’s SORA and GPT-4o). I would expect that each new iteration of these models will continue to be smarter than the last - which seems far-fetched until it isn’t.
As an applied AI engineer, I’m even more excited by the use cases being built in concert with LLMs. For example, vector databases and Retrieval Augmented Generation (RAG) extend LLMs beyond the data they are trained on to solve various tasks such as question answering, information extraction, fact-checking, and even reasoning! I’m a big fan of the work the folks at LangChain are doing to build a community and tools for Generative AI developers and would recommend others to check them out as well.
The future is so exciting, and the rate of change is astonishing.
The Hype (Good and Bad):
That being said… your nephew won’t stop talking about ChatGPT. You are correct to proceed with caution. This level of hype resembles the blaring horns of blockchain/NFT/Web3 that we all endured. The main difference is that even some of the biggest blockchain advocates failed to provide a practical example of it, whereas Generative AI has produced countless examples of value-adding applications for high-tech companies and non-technical hobbyists alike.
We know that LLMs are amazing at language translation tasks (e.g. summarization, comprehension, etc.), but there are shortcomings, especially as the passage of text increases. New iterations of Deep Learning models typically advance by increasing the number of parameters or size of the training dataset. Similarly, the trend for LLMs is to increase the size of the training dataset and context window or the size of the text passage it can work with. It’s just a matter of time before prompts can ingest an entire book, instead of just a paragraph. For now, there are tricks to reduce the context, by brute-force chunking, recursive summarization, using something like RAG to only work with relevant chunks, etc. It’s certainly a relevant limitation in the technology today to consider.
Another well-documented problem regarding the quality of responses is “hallucinations.” So much so, that the word “hallucinate” (for artificial intelligence) was Dictionary.com’s 2023 Word of the Year. Hallucinations typically happen when prompting a generative AI model to handle information beyond its training data, more current information than when the model was trained, or when the model “guesses” due to insufficiently detailed prompts. This materializes into fabricated, non-factual, or non-sensical information being returned to the user. It poses serious risks for businesses that would like to integrate an LLM into their products, such as returning incorrect information, spreading dangerous/toxic/unsafe information, and even exposure to financial, legal, and security risks. There are some ways to mitigate hallucinations, such as prompt engineering, adding a validation layer to “fact-check” the model’s responses, limiting the response length, lowering model “temperature”, and model fine-tuning, to name a few. Even these methods aren’t bullet-proof (e.g. LLMs struggle to learn new factual information through unsupervised finetuning.) The takeaway is to be sensitive to hallucinations and your ability to ensure quality, as it can pose serious risks for your business.
On the other hand, there are just as many doomsday-ers in the Generative AI conversation. “AI is going to take all of our jobs!” Certainly, some tasks are primed to be replaced by AI, but this isn’t new; ML/AI models have been automating work long before LLMs were introduced. I won’t get into the economics of job replacement with AI. I will mention this, though, because the hype around this is negative, and should also be addressed with objectivity. As mentioned, the tech itself already has shown its limitations and quality issues are prevalent (although these should improve over time). There’s also another big factor that limits the effectiveness of Gen AI: humans. After all, we are the ones who have authored the models, comprised the datasets, written the prompts, etc. Like any AI/ML model, they are subject to human biases & limitations. Beyond the team of engineers that originate the LLMs, we must also recognize our own biases and knowledge gaps which can contribute to suboptimal results when creating any application. However, you don’t need to be the world’s best engineer to achieve the best results. The ideal Gen AI team is skilled in design, contextual understanding, collaboration/communication, forensics, and anticipation. A pure coder without these skills won’t be a very useful teammate.
The hype (good and bad) is justified to an extent, but like anything else, approach the topic without prejudice and you’ll understand that hype can distract you from your goals.
The Pitfalls (and how to avoid them):
With all the advancements in Generative AI, it’s normal to feel the urge to integrate it into your business. However, one should be careful when embarking on this journey, so that your team doesn’t take any missteps along the way. A McKinsey report says it well: “Companies looking to score early wins with Gen AI should move quickly. But those hoping that gen AI offers a shortcut past the tough—and necessary—organizational surgery are likely to meet with disappointing results.”
That same report highlights three different company profiles: Makers (builders of custom LLM), Takers (users of available Gen AI tools, often via APIs and subscription services), and Shapers (integrators of available models with proprietary data). A Maker approach is too expensive for most companies. This is to say, think twice before committing to creating custom models, or be prepared for the required investments. A hybrid between a Taker (for productivity improvements) and a Shaper (for gaining competitive advantage) is typically the most suitable. Moreover, building a prototype of a Gen AI application and scaling it to production are different animals. It’s estimated that model costs are only 10-15% of the total cost of the solution. You’ll also need to consider if your team has structured its data appropriately to take full advantage of Generative AI models. Once you’ve finally implemented it you’re likely not done paying for it, especially if you’re using a Taker approach, as you’ll pay for every call you make to an API. While the cost of the API is the reason for not building one in-house, make sure you’re aware of all the types of calls and your anticipated volumes, so that there’s no buyer's remorse.
Another extremely important and often overlooked consideration is determining how compliant your solution is with relevant regulations. For example, if you are building a solution in the healthcare space, you’ll need to be sure not to send any personal health information (PHI) over an API call to be HIPAA-compliant. The same goes for any application in highly regulated environments (e.g. finance, legal, government, etc.). You can avoid this by going down the open-source route, or finding ways to disallow your LLM provider from retaining any sensitive information. Related, how transparent do you need your results to be, for auditors, etc.? This consideration adds to the complexity of building a Gen AI application and should be discussed before writing a single line of code.
Lastly, how can you tell if your results are hitting the mark? The evaluation step is one of the most under-specified parts of the AI/ML cycle for businesses, even though it might seem obvious. It’s crucial (as it is with any type of model out there) to know what your measures of success are and to be able to iterate and improve on them. If your task is to summarize, how can you tell if the summary is concise enough, or meets your quality standard? If your task is to extract, are you extracting the right information consistently? Without some sort of performance monitoring, it’s not right to call new iterations “experiments” because evaluations are only gut-checked. Be sure to clearly define this before you start experimenting, to save time, money, and headache.
You might be asking by now if Gen AI is right for your business, and that’s also a great question! It’s worthwhile to consider simpler alternatives altogether. Will some basic logic/heuristics get you 70% of the way there? Is that enough? How about simple ML models? What is the risk of not innovating? All of the considerations are dependent on your unique business & use case and whether you move forward or not will come down to the tradeoffs, with the most critical decision happening at the beginning. What direction should your team run in?
In conclusion:
Without a doubt, the field of Generative AI has advanced so rapidly, it can be difficult to keep up with the latest innovations that seem to be introduced daily. There’s so much value already being captured by this technology, and much more still to be revealed. But with anything new, there are real limitations that exist and one should tread carefully to discern hype from substance.
If your team is approaching this topic, reach out to me! I’ve been down this road and I can help your team avoid the pitfalls, save tons of time, and capture the most value.
Comments