📺 Real-Time YouTube Trending Video Analyzer

The aim of this project was to gain a deep understanding of real-time data pipelines and how large tech companies use Kafka and Spark for streaming, monitoring, and analytics — by simulating such a system locally. Kafka and Spark are highly scalable systems typically used in production for tasks like clickstream analysis, fraud detection, and dynamic dashboarding. This project replicates a simplified version of that on a single local machine, proving that real-time processing concepts can be learned and implemented even without cloud infrastructure. Before I begin, here is a brief explanation of what Kafka and Spark are.

For a deeper dive into what Kafka and Spark are, how they work individually and as a powerful team, check out the full explanation here!

What is Kafka?

Kafka is like a supercharged message bus that lets different parts of your system talk to each other — without shouting across the room. It works by organizing messages into topics, which are like labeled folders where you drop your data (say, trending videos). Producers send data into these topics, and consumers pick it up at their own pace, kind of like subscribing to your favorite podcast.

What is Spark?

Spark is the brainy, fast-processing engine that takes all that data (maybe from Kafka!) and crunches it like a pro. It shines because it keeps data in memory while working, unlike older systems like Hadoop that keep running to the hard disk. This makes it insanely fast — like, 100 times faster in some cases! At its core is the RDD (Resilient Distributed Dataset), a smart way to spread data across machines so it can survive crashes and be worked on in parallel.

Summary structure

This project summary is structured to provide a clear and comprehensive overview of the end-to-end data streaming pipeline built using Kafka, Spark, and Streamlit on a single local machine. It begins with a high-level architecture description of the system, followed by a short video demonstration of the pipeline in action. After the demo, I break down each component of the pipeline—Kafka, Spark, storage, and the dashboard—explaining their individual roles, how they interact, and what their equivalents look like in an industry-scale cloud environment. Finally, I reflect on key technical and conceptual learnings gained throughout the process, especially around real-time data flow, processing bottlenecks, and debugging distributed-like systems locally.

🔁 Architecture Overview

[Kafka Producer] → [Kafka Topic: youtube-trending] → [Spark Streaming Consumer] → [CSV Output] → [Streamlit Dashboard]

Each component mirrors a real-world role:

Kafka Producer: Publishes video trend data to a Kafka topic.
Kafka Topic: Acts as a distributed queue.
Spark Consumer: Listens to the topic, aggregates likes over time.
CSV Sink: Stores real-time output for visualization.
Streamlit Dashboard: Displays most liked videos by country.

🖥️ Demo Walkthrough

The video begins with starting the Kafka server in a terminal window.
In the second terminal, a Kafka producer is started. It reads rows from multiple country-specific YouTube CSV files and sends data to a Kafka topic named youtube- trending.
In the third terminal, a Spark Structured Streaming script is launched. This consumer reads data from the Kafka topic, extracts fields, and aggregates total likes per video title and country. Output is written as CSV files in a monitored folder.
In the fourth terminal, the Streamlit dashboard is started
Finally, another terminal window shows new CSV files being created by Spark every few seconds- evidence that the pipeline is actively processing and aggregating data.

💻 Curious to see it in action? Check out the full project on and try running it on your own machine! If you run into any issues, feel free to reach out — I would love to help. 🚀 Github

Components & Workflow

1. Kafka Setup (Messaging Layer)

Kafka is a distributed event streaming platform used to decouple data producers and consumers. Here is how I set it up:


# Format storage 
bin/kafka-storage.sh format -t $(bin/kafka-storage.sh random-uuid) -c config/kraft/server.properties 

# Start Kafka Server (KRaft mode – no Zookeeper)
bin/kafka-server-start.sh config/kraft/server.properties

# Create topic named "youtube-trending" 
bin/kafka-topics.sh --create --topic youtube-trending --bootstrap-server localhost:9092 --partitions 1 --replication-factor 1

💡 Industry Analogy: In large companies like LinkedIn or Netflix, Kafka runs on a multi-node cluster, and producers send logs or event data (e.g., user clicks, video plays) in real time to different Kafka topics hosted on cloud infrastructure (AWS MSK, Confluent Cloud, etc.).

2. Kafka Producer (Simulating Data Streams)

The producer script reads multiple country-specific CSV files (USvideos.csv, INvideos.csv, etc.) and randomly samples rows to simulate real-time video trend data being published to the Kafka topic.

💡 Industry Analogy: This simulates a use case like YouTube’s ingestion pipeline, where logs about video views or likes are published to Kafka every few seconds/minutes.

3. Spark Consumer (Real-Time Aggregation)

Apache Spark consumes messages from Kafka and performs streaming aggregations using Structured Streaming. The script extracts fields from Kafka JSON messages, converts publish_time to timestamp, applies watermarks to handle late data, and aggregates likes per video per country. The results are written in append mode to CSVs.

💡 Industry Analogy: This is similar to how Uber aggregates trip events in real-time, using Spark and Kafka to generate dashboards or fraud alerts. In production, this would write to a distributed sink (e.g., Delta Lake, Snowflake, BigQuery) or trigger alerts via webhook/Slack.

4. Streamlit Dashboard (Visualization)

A separate Streamlit app reads the output CSV files generated by Spark and visualizes likes by country and trending videos.

💡 Industry Analogy: This mimics how monitoring dashboards (e.g., Grafana, Tableau) update in near-real-time to reflect backend streaming data — like YouTube Studio's real-time video analytics.

🏢 Local vs Industry Setup

Component	Local Version	Industry Equivalent
Kafka	Single-node Kafka in KRaft mode	Kafka Cluster on AWS MSK or Confluent Cloud
Spark	PySpark on local machine	Spark on Databricks, AWS EMR, or Kubernetes
Storage	CSV files	Delta Lake, Redshift, S3, or Snowflake
Dashboard	Streamlit	Tableau, Power BI, Grafana with live connectors
Monitoring	Manual (CSV watcher)	CloudWatch, Prometheus, ELK Stack

🧠 Learnings

Understood how Kafka decouples producers and consumers via topics.
Learned how Spark Structured Streaming performs real-time batch-style processing.
Dealt with edge cases like watermarking, checkpointing, and file sink behavior.
Gained insight into how streaming data pipelines are built at scale.
Realized the importance of understanding local development before jumping to cloud-based streaming platforms.