Meet Jockey: A Conversational Video Agent Powered by LangGraph and the Twelve Labs API

Recent developments in artificial intelligence are completely changing the way humans interact with video content. The open source video chat agent ‘Jockey” is a great example of this innovation. Jockey delivers enhanced video processing and interaction using the powerful capabilities of Twelve Labs APIs and LangGraph.

Twelve Labs offers modern video understanding APIs that extract comprehensive information from video footage. Their APIs work directly with video data, analyzing visuals, audio, on-screen text, and temporal correlations, unlike traditional methods that rely on pre-generated captions. This holistic approach allows videos to be understood more accurately and contextually.

Video classification, question answering, summarization, and search are some of the key capabilities of Twelve Labs APIs. With these APIs, developers can build applications for a variety of use cases, including AI-generated video footage, interactive video FAQs, automated video editing, and content discovery. The scalability and enhanced enterprise-grade security of these APIs make them ideal for managing large video archives, creating new opportunities for applications that rely on video.

With the release of LangGraph v0.1 by LangChain, a scalable framework for building agentic and multi-agent applications has been introduced. With LangGraph’s customizable API for cognitive architectures, developers can more precisely control code flow, prompts, and large language model (LLM) calls than they could with LangChain AgentExecutor, its predecessor. Additionally, LangGraph allows for human approval before tasks are executed and provides “time travel” capabilities to modify and resume agent operations, which in turn facilitates human-agent collaboration.

LangChain introduced LangGraph Cloud, currently in closed beta, to complement this architecture. LangGraph Cloud provides a scalable infrastructure for deploying LangGraph agents and managing servers and task queues to efficiently handle multiple concurrent users and large states. It interfaces with LangGraph Studio and enables visualization and troubleshooting of agent trajectories using real-world interaction models. With this combination, agent applications can be developed and deployed faster.

With its most recent version, v1.1, Jockey has undergone a substantial change from its original LangChain-based version. By using LangGraph, Jockey benefits from improved scalability and features in both front-end and back-end operations. This change has optimized Jockey’s architecture, allowing for more precise and efficient control of complex video workflows.

Jockey fundamentally combines the advantages of LLMs with the customizable framework of LangGraph to provide Twelve Labs video APIs. The complex network of nodes that make up LangGraph, which includes elements such as Supervisor, Planner, Video-Editing, Video-Search, and Video-Text-Generation nodes, helps Jockey in decision-making. This configuration ensures smooth execution of video-related operations and fast processing of user requests.

The precise control that LangGraph provides at every stage of the workflow is one of its most remarkable features. By carefully controlling the flow of information between nodes, Jockey can maximize token consumption and improve the accuracy of node response. Video processing is more efficient and performant thanks to this fine-grained control.

Jockey’s advanced architecture uses a multi-agent system to manage complex video-related activities. The supervisor, scheduler, and workers are the three main parts of the architecture. As the main coordinator, the supervisor oversees the process and assigns tasks to other nodes. It handles error recovery, ensures that the plan is followed, and initiates rescheduling when necessary.

The scheduler is responsible for breaking down complex user requests into digestible chunks that workers can execute. This part is essential for managing workflows, which include multiple video processing steps. Workers perform activities according to the scheduler’s strategy and include agents specialized in video search, video text generation, and video editing.

Jockey’s modular architecture makes it easy to extend and customize. To accommodate more complex scenarios, developers can extend the state, modify prompts, or add additional workers for specific use cases. Because of its adaptability, Jockey provides a flexible platform on which to build sophisticated video AI applications.

In conclusion, Jockey is a great combination of Twelve Labs’ advanced video interpretation APIs and LangGraph’s adaptive agent framework. This combination creates new opportunities for intelligent video engagement and processing.

Tanya Malhotra is a final year student of University of Petroleum and Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a data science enthusiast with good analytical and critical thinking skills, as well as a keen interest in learning new skills, leading groups, and managing work in an organized manner.

🐝 Join the fastest growing AI research newsletter, read by researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many more…