NEW

A dedicated one-day conference focused on enhancing the Developer Experience

JOIN NOW

RESEARCH

From “RAG”s To Riches: A Practical Discussion On The Use Of GenAI

Published : 
December 2, 2024

I. Introduction: The Rise of Generative AI

Generative AI (GenAI) is rapidly emerging as one of the most exciting advancements in artificial intelligence, unlocking unprecedented opportunities across industries. This cutting-edge technology enables machines to create new content, such as text, images, code, or complex simulations, by analyzing patterns within massive datasets.

Unlike traditional AI, which focuses on recognizing patterns or making predictions, Gen AI goes further—it’s your buddy. Or an assistant. (Depending on how you view it.) True to its name, Gen AI “generates” original outputs that are creative and highly adaptable. From crafting realistic images and videos to drafting human-like text or developing functional software code, its capabilities are reshaping how we approach problem-solving and innovation.

Applications of Gen AI span a wide array of industries. In marketing, it powers personalized content creation and advertising campaigns. It accelerates drug discovery and medical research by simulating biological processes in healthcare. The entertainment industry uses it to design immersive games and visual effects. In finance, it assists in generating insights from unstructured data and automating customer interactions. One of the most common ways of obtaining such contextual results is through RAGs.

II. Understanding Retrieval-Augmented Generation (RAG)

A. What Is RAG?

Retrieval-Augmented Generation (RAG) is changing how AI systems handle information. It combines two strengths: retrieving accurate data and generating creative, human-like responses. Traditional generative models rely only on what they were trained on. This can limit their ability to provide correct or up-to-date information. Retrieval systems, on the other hand, can pull relevant facts but struggle to explain or connect them naturally. RAG bridges this gap. It retrieves the most valuable data and uses it to generate clear and context-rich answers. This makes AI responses more precise, relevant, and adaptable to real-world needs. With RAG, AI becomes more intelligent and more reliable.

B. How Does RAG Work?

RAGs function in 3 stages. Retrieval, Fusion, and Generation - 

Retrieval:

The RAG system starts by finding the most relevant information from external data sources. This is done using advanced retrieval models like Dense Passage Retrievers (DPR) and BM25-based systems. DPR uses machine learning to understand the deeper meaning of queries and documents, making it ideal for handling complex or nuanced requests. BM25, on the other hand, focuses on matching specific terms in the query with those in the documents, which works well for precise, keyword-driven searches. By combining these retrieval techniques, RAG pulls in accurate and relevant data tailored to the user's needs.

Fusion:

Once the data is retrieved, the system moves to the fusion phase. Here, it blends the retrieved information with the user's query to create a rich and comprehensive input. This step is crucial because it ensures the generative model has the right context to understand what the user is asking. Think of it like gathering the ingredients for a recipe—each piece of information adds flavor and depth, ensuring the final result is as complete and meaningful as possible.

Generation:

Finally, the generative model takes over. The fused input creates a detailed, coherent, and human-like response. Unlike models that rely solely on pre-trained knowledge, including real-time retrieved data makes the output both relevant and contextually accurate. Whether answering a complex question or drafting a thoughtful reply, this step combines creativity and precision to deliver a polished result.

Each stage in RAG builds on the last, creating a seamless flow from raw data to an insightful and tailored response. This structured approach makes RAG systems intelligent and adaptable, setting them apart in AI.

C. Examples of RAG in Action:

RAG excels in applications that demand accuracy and rely on external knowledge. From helping customers and employees to empowering doctors and students, it bridges the gap between raw data and meaningful insights, making it an invaluable tool across industries.

A few real-life use cases of such entry-level models are:

Customer Support Chatbots:

Models like Amazon Titan Text Express, Gemini 1.5 Flash, and GPT-3.5 Turbo can efficiently support customer service operations by automating responses to frequently asked questions, solving simple customer problems, and escalating complex issues to human agents. 

Document Summarization:

These models are well-suited to summarizing large volumes of documents, such as internal knowledge bases, legal documents, or customer support tickets. This can help organizations provide quick answers and improve productivity. 

Question Answering (QA) Systems:

A key component of Retrieval-Augmented Generation (RAG) systems, these models can answer questions by retrieving relevant information from databases or external knowledge sources and generating contextually appropriate responses. 

Content Creation:

From writing blogs and social media posts to generating marketing content, these models can help create contextually relevant, human-like text based on company-specific knowledge and guidelines. 

III. Cost Considerations for GenAI Projects: A Cloud Perspective

Well, in this economy, hardly anything costs less than a dime. Let alone something as useful as GenAI tools. But how much does one need to invest for their projects anyway? And what all factors do they need to account for?

A. Understanding Token-Based Pricing:

Leading cloud platforms like AWS, Google Cloud, and Azure often use a token-based pricing model for their Large Language Models (LLMs), the backbone of Generative AI (GenAI) services. This model determines the cost by the number of tokens processed rather than a flat fee. These providers charge for using their AI models based on how much text is input into and output from the system.

Every interaction with the model, whether it’s a question or a response, is measured in "tokens." Depending on the processed language, a token can be as small as a single character or a whole word. The cost is divided into two categories: input tokens and output tokens.

  • Input tokens: The text you send to the model. For example, if you input a paragraph of text or a question, each word or character is counted as a token.
  • Output tokens: The text the model generates in response. So, if the model replies with a lengthy explanation or a short summary, each word in the response counts toward the cost.

Tracking the number of tokens used—both in terms of input and output—can optimize interactions with AI models. 

B. Additional Cost Factors Beyond LLMs:

The total cost of using GenAI services goes beyond just token-based pricing. Compute, storage, networking, and vector database costs all contribute to the overall expenses. Understanding these components is essential for businesses and developers to estimate and manage their budgets effectively when deploying advanced AI models and services.

Compute Costs:

Running GenAI models requires powerful hardware like GPUs, and the more complex the model or data, the higher the compute costs. These are a major factor in the total cost of GenAI services, especially for large-scale or high-demand applications.

Storage Costs:

GenAI systems rely on large datasets for training and retrieval, making storage costs significant. Storing data like embeddings and other large datasets, such as medical records, adds up quickly as the data volume grows.

Networking Costs:

Networking costs involve data transfer between systems and cloud infrastructure. These costs increase with the amount of data transferred, especially in global deployments, making network optimization essential for controlling expenses.

Vector Database Costs:

Vector databases store and retrieve embeddings used in RAG systems, and the costs depend on database size and query complexity. Maintaining large-scale, efficient databases for quick searches can become costly as more queries and data are processed.

C. Private GenAI Services: An Alternative to Consider:

Private GenAI services, such as those offered by Initializ.ai, provide an attractive alternative to cloud-based solutions by allowing businesses to run AI models on their infrastructure. This approach can lead to lower costs, as it avoids the high fees of cloud providers. It also enhances AI security by keeping sensitive data within the organization’s network, reducing the risk of breaches. Additionally, private services offer greater flexibility, enabling businesses to choose and customize open-source models and vector databases to suit their specific needs, offering more control over performance and cost.

So, as one might expect, there’s…a lot to think about. And risks of miscalculation will always be there. However, a well-informed decision-making framework can help one optimize resources and achieve the best possible outcome. The following section aims to provide a structured approach to the same. 

IV. Guidance Towards a Financial Decision

A. Key Factors To Consider Before Deciding Anything:

  • The scale of Operation: How many daily active users (DAU) will the GenAI application have
  • Performance Requirements: What accuracy, speed, and complexity level is needed from the GenAI model?
  • Budget Constraints: What is the available budget for running the GenAI application?
  • Security Needs: What level of data privacy and security is required?

B. Making the Choice: A Decision Framework

One particular 3 - stage approach to such decision-making when adopting AI, has been detailed below:

Stage 1: Experimentation and ROI Validation:

● Identify Potential Use Cases: Begin by brainstorming and identifying use cases within the organization where AI, specifically generative AI, could offer value. This step involves understanding the capabilities of AI and aligning them with business needs.

● Test with POCs: Conduct Proof of Concept (POC) projects on the identified use cases using readily available tools and resources. The goal is to quickly assess each use case's feasibility and potential return on investment (ROI).

● Prioritize Based on ROI: Analyze the results of the POCs and prioritize use cases that demonstrate the most substantial potential for generating ROI. This step ensures that resources are focused on initiatives with the highest likelihood of success.

Stage 2: TCO Analysis and Production Deployment:

● Conduct a Thorough TCO Analysis: Once promising use cases are identified, perform a Total Cost of Ownership (TCO) analysis for each prioritized use case. This analysis should consider various cost factors, including:

  • Human Resources: Determine the need to upskill existing staff or hire new personnel with specialized AI skills.
  • Infrastructure: Evaluate the costs associated with infrastructure, including GPU usage, storage, memory, and databases. 
  • Model Training and Deployment: Consider the costs of fine-tuning existing models or leveraging pre-trained models and services.

● Choose the Deployment Method: Decide on the most cost-effective deployment method, weighing the pros and cons of:

  • Managed Services: To expedite deployment and reduce upfront costs, consider using managed services from cloud providers or platforms like Initializ.
  • Self-Hosted Solutions: Evaluate if specific components can be built and run in-house for greater cost efficiency at scale.

● Deploy and Validate in Production: Deploy the chosen solution into a production environment, carefully monitoring performance and gathering user feedback to validate the ROI assessment.

Stage 3: Scaling and Optimization:

● Iterate Based on Feedback: Continuously monitor the performance of the deployed solution, gathering user feedback and iterating on the solution to improve its effectiveness and ROI.

● Optimize Cost and Performance: Explore further cost optimization strategies, such as:

  • Fine-Tuning vs. RAG: Determine whether to invest in fine-tuning models for specific use cases or leverage RAG (Retrieval Augmented Generation) techniques for less complex scenarios.
  • LoRA (Low-Rank Adaptation): Consider using LoRA to fine-tune multiple use cases on top of a single large language model to reduce inference costs.
  • GPU/CPU Fractioning: Explore techniques like GPU/CPU fractioning to optimize resource allocation and reduce costs.

● Scale Strategically: Once the use case proves its value and the solution stabilizes, scale the deployment strategically to meet the organization's growing demands. Consider whether to continue with the existing deployment approach or transition to pre-built platforms for enhanced scalability and cost optimization.

C. A Word of Caution:

Now that you are equipped with a strategy, it's easy to jump into the application part. However, intelligent financial decisions for Generative AI (GenAI) projects start with looking at the big picture. It’s not just about the upfront costs—success hinges on understanding long-term value and hidden expenses. To make the right call, dive deep into research, seek expert insights, and run pilot tests to see what works best. These steps help you fine-tune your approach, reduce risks, and ensure your investment delivers real impact. With careful planning and experimentation, you can unlock the full potential of GenAI while staying financially savvy.

V. Conclusion: Navigating the Cost Landscape of GenAI

Putting all of that together mathematically leads us to many equations and numbers. Before you reach for your own Excel sheet, here’s a reference cost-result comparison table for running RAG applications with entry-level models across AWS, Google Cloud, and Azure at different DAU levels:

Assumptions made:

  • Session Duration: 5 minutes per session. 
  • Messages per Minute: 4 messages exchanged per minute. 
  • Tokens per Message: 100 tokens (input and output combined). 
  • Total Tokens per Session:  5 minutes × 4 messages/minute × 100 tokens = 2,000 tokens 
  • Sessions per User per Day: $260.66  in 1 session. 
  • Daily Active Users (DAU): This represents the number of users actively engaging with the system daily. 

In the ever-changing world of GenAI, keeping a close eye on costs is key to staying ahead. Depending on your specific requirements, it may or may not always be financially wise to go all-in on AI. Either way, staying updated on the latest technologies, pricing models, and cost-saving strategies is key to making smarter decisions and unlocking the full potential of your AI investments!