LogRocket BlogFebruary 16, 2024

Extracting YouTube video data with OpenAI and LangChain

Extracting YouTube Video Data with OpenAI and LangChain

In this tutorial, we will learn how to extract YouTube video data using OpenAI and LangChain. We will use the YouTube API to fetch transcripts of videos and generate embeddings of the transcript using OpenAI's language model through LangChain. These embeddings will be stored in a vector store, allowing us to query and retrieve information from the video based on similarity comparisons.

Retriving the video transcript

To retrieve the transcript of a YouTube video, we will use the youtube-transcript library. This library provides an easy way to fetch and consolidate the transcript of a YouTube video.

First, we need to install the necessary packages. Run the following command:

npm install chalk dotenv youtube-transcript

Now, let's begin by creating a JavaScript file named index.js. Add the following code to it:

const { YouTubeTranscriptApi } = require('youtube-transcript-api');

async function fetchTranscript(videoUrl) {
  try {
    // Fetching the video transcription
    const transcript = await YouTubeTranscriptApi.getTranscript(videoUrl);
    
    // Consolidating the transcript into a single string
    const text = transcript.map((line) => line.text).join(' ');
    
    return text;
  } catch (error) {
    console.error('Error fetching transcript:', error.message);
    throw error;
  }
}

In the above code, we first import the YouTubeTranscriptApi module from the youtube-transcript-api library. This module allows us to fetch the transcript of a YouTube video.

The fetchTranscript function is responsible for fetching and consolidating the transcript of a YouTube video. It takes a videoUrl parameter, which is the URL of the YouTube video. Inside the try block, we use the YouTubeTranscriptApi.getTranscript function to fetch the transcript. Then, we map over the transcript lines and extract the text property from each line. Finally, we join all the lines using a space separator to obtain a single string representing the transcript.

Generating and storing transcript embeddings

Next, we will generate embeddings for the transcript using OpenAI's language model. We will use the Hugging Face Transformers library for this purpose. Additionally, we will store the embeddings in a vector store for later retrieval and similarity comparisons.

First, install the necessary packages by running the following command:

npm install @langchain/langchain @langchain/transformers transformers

Now, let's modify the index.js file as follows:

const { YouTubeTranscriptApi } = require('youtube-transcript-api');
const { prompt } = require('enquirer');
const { green, red, blue } = require('chalk');
const { createDocumentEmbedding } = require('@langchain/langchain');
const { loadModel } = require('@langchain/transformers');

// Load environment variables
require('dotenv').config();

async function fetchTranscript(videoUrl) {
  // Same code as before...
}

async function generateEmbeddings(transcript) {
  try {
    // Loading the OpenAI language model
    const model = await loadModel({
      apiKey: process.env.OPENAI_API_KEY,
      modelName: 'text-davinci-003',
    });

    // Generating the transcript embeddings
    const embedding = await createDocumentEmbedding(model, transcript);

    // Storing the embeddings in the vector store
    // ...
  } catch (error) {
    console.error('Error generating embeddings:', error.message);
    throw error;
  }
}

In the above code, we first import the necessary modules from the LangChain library and the dotenv library to load environment variables from a file.

The generateEmbeddings function is responsible for generating embeddings for the provided transcript using the OpenAI language model. It takes a transcript parameter, which is the transcript of the YouTube video. Inside the try block, we load the OpenAI language model using the loadModel function from the @langchain/transformers module. We provide our OpenAI API key and specify the model name as text-davinci-003.

Next, we use the createDocumentEmbedding function from the @langchain/langchain module to generate the transcript embeddings. This function takes the loaded language model and the transcript as input and returns the embeddings. Finally, we store the embeddings in the vector store, but we haven't implemented this part yet.

Retrieving information from the video

In this section, we will use LangChain and the OpenAI model to query and retrieve information stored in the vector store containing the transcript embeddings.

Add the following code to the index.js file:

async function retrieveInformation(query) {
  try {
    // Loading the OpenAI language model
    const model = await loadModel({
      apiKey: process.env.OPENAI_API_KEY,
      modelName: 'text-davinci-003',
    });

    // ...
  } catch (error) {
    console.error('Error retrieving information:', error.message);
    throw error;
  }
}

The retrieveInformation function is responsible for retrieving information from the vector store based on a query using the OpenAI language model. It takes a query parameter, which is the query string. Inside the try block, we load the OpenAI language model just like we did in the previous section.

We will continue implementing the rest of the retrieveInformation function in the next section.

To be continued...