Analysis of Tedx Talks

Author

Soumyadipta Das

Data Source: Tedx Talks - Kaggle

Library

This code sets up an R environment for building an interactive Shiny application that processes and visualizes data. It loads necessary libraries for data manipulation (dplyr), creating interactive tables (DT), handling HTTP requests (httr2), working with JSON data (jsonlite), and generating interactive plots (plotly). Additionally, it sources custom helper scripts (mongodb_helper.R and helper.R) to manage MongoDB connections and provide utility functions. This setup allows the app to pull data from various sources, process it, and present it dynamically within an interactive web interface.

library(dplyr)
library(DT)
library(httr2)
library(jsonlite)
library(plotly)

source('mongodb_helper.R')
# source('api.R')
source('helper.R')

Read Transcript Data

This code reads TEDx talk transcript data into a data frame (df) from a CSV file named “sample_transcript.csv”. The commented line indicates an alternative method for reading data from a MongoDB database using a mongo_read() function, likely for a collection named ‘ted_talks_en’. Once the data is loaded, the glimpse() function from dplyr provides a quick overview of the data structure, including the column names, data types, and sample values, helping to understand the dataset’s contents before further analysis or processing.

df <- read.csv("data/sample_transcript.csv", header = TRUE)
# df <- mongo_read(table = 'ted_talks_en', db = 'sample_transcript', url = mongo_url)
df %>% glimpse()
Rows: 100
Columns: 19
$ talk_id        <int> 1, 92, 7, 53, 66, 49, 86, 94, 71, 55, 58, 54, 41, 65, 4…
$ title          <chr> "Averting the climate crisis", "The best stats you've e…
$ speaker_1      <chr> "Al Gore", "Hans Rosling", "David Pogue", "Majora Carte…
$ all_speakers   <chr> "{0: 'Al Gore'}", "{0: 'Hans Rosling'}", "{0: 'David Po…
$ occupations    <chr> "{0: ['climate advocate']}", "{0: ['global health exper…
$ about_speakers <chr> "{0: 'Nobel Laureate Al Gore focused the world’s attent…
$ views          <int> 3523392, 14501685, 1920832, 2664069, 65051954, 1208138,…
$ recorded_date  <chr> "2006-02-25", "2006-02-22", "2006-02-24", "2006-02-26",…
$ published_date <chr> "2006-06-27", "2006-06-27", "2006-06-27", "2006-06-27",…
$ event          <chr> "TED2006", "TED2006", "TED2006", "TED2006", "TED2006", …
$ native_lang    <chr> "en", "en", "en", "en", "en", "en", "en", "en", "en", "…
$ available_lang <chr> "['ar', 'bg', 'cs', 'de', 'el', 'en', 'es', 'fa', 'fr',…
$ comments       <int> 272, 628, 124, 219, 4931, 48, 980, 919, 930, 59, 84, 81…
$ duration       <int> 977, 1190, 1286, 1116, 1164, 1198, 992, 1485, 1262, 153…
$ topics         <chr> "['alternative energy', 'cars', 'climate change', 'cult…
$ related_talks  <chr> "{243: 'New thinking on the climate crisis', 547: 'The …
$ url            <chr> "https://www.ted.com/talks/al_gore_averting_the_climate…
$ description    <chr> "With the same humor and humanity he exuded in \"An Inc…
$ transcript     <chr> "Thank you so much, Chris. And it's truly a great honor…

Predict transcript type using AI

Now we will send the transcript data to AI model (meta/llama-3.1-8b-instruct) to predict transcript topic, is the talks supports the topic and strength of support as follows,

# nim_output <- data.frame()
# for(i in 1:nrow(df)){
#   text <- paste0(
#   gsub("[\'\"]", " ", df$transcript[i]),
#   "State which option the commenter is most likely to favor (A, B, C, D, E, F, G).
#   State if the comment is 'For', 'Against', or 'Neutral' on option.
#   tate if the strength of the commenter's opinon on a scale from 'Extremely strong', 'Very strong',
#   'Strong', 'Somewhat strong', or 'Mild'.
#   No need any explanation or extra word.
#   Produce the output in json format (Strictly follow the format) like this:\n{\n\'favored_option\': \'\',\n\'option_opinion\':
#   '\',\n\'opinion_strength\': \'\'\n}"
#   )
# 
#   response_nim <- chat_nvidia(
#             text,
#             history = NULL,
#             temp = 0.5,
#             api_key = nv_api_key,
#             model_llm = "meta/llama-3.1-8b-instruct"
#           ) %>%
#         as.data.frame()
# 
#   temp <- (df[i,] %>% select(talk_id, title)) %>%
#     as.data.frame() %>%
#     bind_cols(response_nim)
# 
#   nim_output <- nim_output %>% bind_rows(temp)
# }
# write.csv(nim_output, "data/nim_output.csv", row.names = FALSE)

Transcript prediction glimplse

Let’s read the data of response what we received from LLM,

text_output <- read.csv("data/nim_output.csv", header = TRUE)
head(text_output)
  talk_id                                  title favored_option option_opinion
1       1            Averting the climate crisis              A            For
2      92        The best stats you've ever seen              C            For
3       7                       Simplicity sells              B        Against
4      53                    Greening the ghetto              A            For
5      66            Do schools kill creativity?              C            For
6      49 Behind the design of Seattle's library              B            For
  opinion_strength
1 Extremely strong
2      Very strong
3      Very strong
4 Extremely strong
5 Extremely strong
6 Extremely strong

We will reformat the data as follows,

favored_df <- data.frame(favored_option = c("A", "B", "C", "D", "E", "F", "G"),
                         favored_option_new = c("Nature or environment or climate", "Science or Technology", 
                                                "Education or knowledge", "Social Media", "Economy or Finance", 
                                                "Travel or new places or new experiences", "Others"))
opinion_df <- data.frame(option_opinion = c('For', 'Neutral', 'Against'), 
                         option_opinion_new = c(1,0,-1))
strength_df <- data.frame(opinion_strength = c('Extremely strong', 'Very strong', 'Strong', 'Somewhat strong', 'Mild'), 
                         opinion_strength_new = c('Extremely strong', 'Very strong', 'Strong', 'Somewhat strong', 'Mild'))

text_output <- text_output |> 
  left_join(favored_df) |>
  left_join(opinion_df) |>
  left_join(strength_df) |>
  mutate(count = 1) 
Joining with `by = join_by(favored_option)`
Joining with `by = join_by(option_opinion)`
Joining with `by = join_by(opinion_strength)`
head(text_output)
  talk_id                                  title favored_option option_opinion
1       1            Averting the climate crisis              A            For
2      92        The best stats you've ever seen              C            For
3       7                       Simplicity sells              B        Against
4      53                    Greening the ghetto              A            For
5      66            Do schools kill creativity?              C            For
6      49 Behind the design of Seattle's library              B            For
  opinion_strength               favored_option_new option_opinion_new
1 Extremely strong Nature or environment or climate                  1
2      Very strong           Education or knowledge                  1
3      Very strong            Science or Technology                 -1
4 Extremely strong Nature or environment or climate                  1
5 Extremely strong           Education or knowledge                  1
6 Extremely strong            Science or Technology                  1
  opinion_strength_new count
1     Extremely strong     1
2          Very strong     1
3          Very strong     1
4     Extremely strong     1
5     Extremely strong     1
6     Extremely strong     1

Visualization

Now we will visualize our findings as follows,

text_output_plot <- text_output %>%
  select(-favored_option, -option_opinion, -opinion_strength) %>%
  group_by(favored_option_new, option_opinion_new, opinion_strength_new) %>%
  summarise(count = sum(count, na.rm = TRUE)) %>%
  ungroup() %>% 
  mutate(opinion_strength_new = as.factor(opinion_strength_new))

fig <- plot_ly(text_output_plot) %>% 
  add_trace(
    x = ~favored_option_new,
    y = ~count, 
    color = ~opinion_strength_new, 
    name = ~opinion_strength_new, 
    type = "bar"
    ) %>% 
  layout(title = "Transcript Analysis", 
         xaxis = list(title = "Favoured Option"),
         yaxis = list(title = "Count"),
         legend = list(title = list(text = "Opinion Strength")),
         barmode = 'stack', 
         margin = list(t = 50)
         )

fig

This bar plot titled “Transcript Analysis” illustrates the distribution of TEDx talk topics, predicting the strength of alignment with various themes. The x-axis represents different favored options or topics, while the y-axis shows the count of TEDx talks. The color-coded legend indicates the strength of opinion, ranging from “Mild” to “Extremely strong.”

Key Insights:

  1. Education or Knowledge is the most dominant theme, with a high number of talks showing “Extremely strong” and “Very strong” alignment.
  2. Science or Technology also features prominently, with significant contributions in the “Extremely strong” category.
  3. Economy or Finance and Nature or Environment or Climate have comparatively fewer talks but show strong opinions.
  4. Social Media and Travel or New Places or New Experiences are the least discussed topics, with only minimal representation across all opinion strengths.
  5. Most topics have a substantial share of “Extremely strong” opinions, indicating that TEDx talks are generally focused on impactful and strongly aligned themes.

This analysis highlights that TEDx talks heavily favor themes related to education and technology, reflecting the platform’s emphasis on knowledge dissemination and innovation.