Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Article 7 - Keeping AI Up To Date #18

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 90 additions & 0 deletions _posts/2023-11-22-keeping-ai-up-to-date.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
---
title: "Keeping AI Up to Date with Streaming Data"
categories:
image: /images/blog/article-7-icon.png
author_staff_member: "LangStream"
date: November 22, 2023
---

If you've used ChatGPT-4 a lot, you're likely familiar with messages such as: "As of my last update in April 2023, I don't have information on the winner of the 2023 UEFA Champions League final. For the most current results, please check the latest sports news from a reliable source."

This is one of the side effects of creating AI models. It underlines a key issue with large language models (LLMs): the static nature of their training data constrains their knowledge base. GPT-4, the LLM behind ChatGPT, can provide an incredibly detailed and broad snapshot of information — but only up to the point of the model's last update.

In their current stage of development LLMs are inherently frozen in time, but the world doesn't stand still. It's constantly generating unceasing streams of information. The artificial intelligence (AI) tools of the future must be able to analyze and act on these streams of information in real-time.

In this article, we'll talk about how the technology already exists to stream data directly to AI solutions — and how [LangStream](https://docs.langstream.ai/about/what-is-langstream) is at the forefront of this advancement.


## The Lifecycle of LLM Data

The data lifecycle within an LLM begins with the data's inception and concludes when the data becomes outdated or new information supersedes it. This process profoundly impacts the performance of these AI systems in terms of user experience and overall effectiveness. Here's how:



1. **Ingestion and integration: **LLMs ingest vast datasets encompassing a wide array of knowledge. The data is current, relevant, and rich with potential insights at this initial phase.
2. **Training and learning: **The LLM trains on the data, enabling it to learn and understand facts, patterns, and contexts.
3. **Deployment and interaction: **Once trained, the model is ready for interaction with users. It bases its responses on the learned data, which remains static within its architecture. During this stage, the potential for outdated and erroneous responses increases.
4. **Outdating and obsolescence: **As the world moves on and new data emerges, the previously learned information ages rapidly. A trove of current knowledge gradually becomes a collection of historical snapshots — still useful, butincreasingly disconnected from the present.
5. **Rejuvenation or retirement:** Without intervention, some of the data the LLM has trained on becomes obsolete, making the LLM prone to [hallucination](https://flyte.org/blog/getting-started-with-large-language-models-key-things-to-know). The model must work with new data, or it risks becoming increasingly inaccurate.


## How LLMs Learn

During the training stage, an LLM like GPT-4 ingests vast amounts of text from a curated dataset. Diverse sources contribute to this dataset, including books, websites, articles, and other text-rich media, providing a broad spectrum of knowledge.

Compiling this dataset involves crawling websites and other digital repositories to harvest their textual content. This is a time-intensive endeavor, not only due to the sheer volume of available data but also the need for careful selection and preprocessing to ensure the quality of information.

Once compiled, the training process itself can take weeks or even months, depending on the model's complexity and the computational resources available. During this time, the model assimilates the intricacies of language but is limited to the knowledge contained within its training set, which becomes static once training concludes.

The model learns by identifying patterns, correlations, and structures within this data, effectively coming to "understand" language through statistical inference. Of course, it doesn't truly understand language or anything else — at least not how humans define understanding. Instead, it generates responses based on patterns.

While much of what the model produces is factual, logical, and well-reasoned, it gets there through a sophisticated process of blind emulation, not through the application of actual reasoning and logic. This is why LLMs can sometimes hallucinate false information.


### Hallucination in LLMs

When a user asks an LLM a question that the latter doesn't have the answer to, perhaps because the information relates to a time after its last training date, it may try to guess the answer by reproducing patterns it has learned. The model will confidently report incorrect and sometimes nonsensical information called hallucinations.

One way to reduce the frequency of hallucinations is by empowering LLMs with the ability to ingest streaming data. With a dynamically refreshed knowledge base to draw from, there's less chance that the model ever needs to "guess" and potentially hallucinate.


## Staying Current with Streaming Data

A system that incorporates streaming data capabilities essentially operates on a continuous data ingestion model. New information is streamed and integrated into the model's knowledge base on a rolling basis and is instantly available to prompts.

Here's a breakdown of the data lifecycle in a streaming data system:



1. **Real-time ingestion:** The LLM captures** **information such as news articles, stock market movements, or traffic data upon availability.
2. **Continuous integration: **This streaming data is preprocessed and formatted to be compatible with the LLM's learning architecture. Unlike traditional systems, this step happens in near real time, ensuring minimal latency between data generation and availability for learning.
3. **Incremental learning: **The LLM continuously updates its parameters to incorporate the new information. It employs techniques like transfer learning or meta-learning, which allow the model to assimilate new data without requiring full retraining cycles.
4. **Dynamic deployment: **A dynamic deployment of the updated model replaces or augments the existing system. Users get responses that reflect the latest information.
5. **Feedback loops: **The system monitors** **user interactions and system outputs to identify anomalies or outdated information. It feeds these back into the ingestion phase to refine data streams further.


## Streaming Data System Applications

Not only does equipping an LLM-based system with streaming data capabilities reduce hallucinations and instances of the dreaded "As of my last update" warning, it also enables the system to do more. Streaming systems can perform critical roles in industries that rely on real-time data and updates — roles that a standard LLM (without streaming data) couldn't fulfill.

Here are some roles that streaming data systems can fulfill that traditional, batch-fed LLM systems could not:



* **Financial services:** Streaming systems can offer real-time financial insights, stock price updates, and market trend analyses.
* **Healthcare: **Streaming systems can give healthcare professionals the most current information, including the latest research, treatment protocols, and drug approvals. They can help monitor real-time patient data and alert healthcare providers to changes in patient conditions that may require immediate attention.
* **News and content aggregation: **Streaming systems can curate and summarize the latest news from multiple sources in real time, creating up-to-date news digests personalized to user interests. They can track social media trends and breaking news, informing users with timely content.
* **Crisis management:** During crises, streaming systems can provide live updates and safety information, and help coordinate emergency responses by analyzing current data from various sources. They can integrate data from sensors and social media to give real-time alerts and updates.
* **E-commerce: **Streaming systems can give customers the latest inventory changes, price adjustments, and product recommendations based on real-time shopping trends and availability.
* **Manufacturing: **AI systems with streaming data can predict equipment failures by continuously analyzing data from IoT devices and detecting anomalies before they lead to breakdowns.


## Summary

LLMs provide a broad and detailed snapshot of knowledge delivered with authority. However, for many practical purposes, batch-trained LLMs fall short in a world constantly generating new data. Without the ability to read streaming data, knowledge gaps can lead to hallucinations and an inability to answer certain questions.

Enabling LLMs like GPT-4 to ingest and learn from new information as it becomes available greatly increases the potential of this technology. With a dynamic and continuously updated knowledge base, an LLM can bring transformative benefits to an array of industries, from financial services to healthcare to crisis management. With streaming data, AI systems can supply real-time insights, improved decision-making, and timely, relevant responses to user inquiries.

LangStream is an open-source development platform that brings data streaming to AI application development. Harnessing Apache Kafka for its data handling and stream processing capabilities, LangStream offers developers a robust and familiar infrastructure for creating streaming data-ready AI solutions.

Give your AI applications the power of streaming data with LangStream and [start building today](https://langstream.ai/quick-start/).
Binary file added images/blog/article-7-icon.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.