library(dplyr)
library(DT)
library(httr2)
library(jsonlite)
library(plotly)
source('mongodb_helper.R')
# source('api.R')
source('helper.R')
Analysis of Tedx Talks
Data Source: Tedx Talks - Kaggle
Library
This code sets up an R environment for building an interactive Shiny application that processes and visualizes data. It loads necessary libraries for data manipulation (dplyr
), creating interactive tables (DT
), handling HTTP requests (httr2
), working with JSON data (jsonlite
), and generating interactive plots (plotly
). Additionally, it sources custom helper scripts (mongodb_helper.R
and helper.R
) to manage MongoDB connections and provide utility functions. This setup allows the app to pull data from various sources, process it, and present it dynamically within an interactive web interface.
Read Transcript Data
This code reads TEDx talk transcript data into a data frame (df
) from a CSV file named “sample_transcript.csv”. The commented line indicates an alternative method for reading data from a MongoDB database using a mongo_read()
function, likely for a collection named ‘ted_talks_en’. Once the data is loaded, the glimpse()
function from dplyr provides a quick overview of the data structure, including the column names, data types, and sample values, helping to understand the dataset’s contents before further analysis or processing.
<- read.csv("data/sample_transcript.csv", header = TRUE)
df # df <- mongo_read(table = 'ted_talks_en', db = 'sample_transcript', url = mongo_url)
%>% glimpse() df
Rows: 100
Columns: 19
$ talk_id <int> 1, 92, 7, 53, 66, 49, 86, 94, 71, 55, 58, 54, 41, 65, 4…
$ title <chr> "Averting the climate crisis", "The best stats you've e…
$ speaker_1 <chr> "Al Gore", "Hans Rosling", "David Pogue", "Majora Carte…
$ all_speakers <chr> "{0: 'Al Gore'}", "{0: 'Hans Rosling'}", "{0: 'David Po…
$ occupations <chr> "{0: ['climate advocate']}", "{0: ['global health exper…
$ about_speakers <chr> "{0: 'Nobel Laureate Al Gore focused the world’s attent…
$ views <int> 3523392, 14501685, 1920832, 2664069, 65051954, 1208138,…
$ recorded_date <chr> "2006-02-25", "2006-02-22", "2006-02-24", "2006-02-26",…
$ published_date <chr> "2006-06-27", "2006-06-27", "2006-06-27", "2006-06-27",…
$ event <chr> "TED2006", "TED2006", "TED2006", "TED2006", "TED2006", …
$ native_lang <chr> "en", "en", "en", "en", "en", "en", "en", "en", "en", "…
$ available_lang <chr> "['ar', 'bg', 'cs', 'de', 'el', 'en', 'es', 'fa', 'fr',…
$ comments <int> 272, 628, 124, 219, 4931, 48, 980, 919, 930, 59, 84, 81…
$ duration <int> 977, 1190, 1286, 1116, 1164, 1198, 992, 1485, 1262, 153…
$ topics <chr> "['alternative energy', 'cars', 'climate change', 'cult…
$ related_talks <chr> "{243: 'New thinking on the climate crisis', 547: 'The …
$ url <chr> "https://www.ted.com/talks/al_gore_averting_the_climate…
$ description <chr> "With the same humor and humanity he exuded in \"An Inc…
$ transcript <chr> "Thank you so much, Chris. And it's truly a great honor…
Predict transcript type using AI
Now we will send the transcript data to AI model (meta/llama-3.1-8b-instruct) to predict transcript topic, is the talks supports the topic and strength of support as follows,
# nim_output <- data.frame()
# for(i in 1:nrow(df)){
# text <- paste0(
# gsub("[\'\"]", " ", df$transcript[i]),
# "State which option the commenter is most likely to favor (A, B, C, D, E, F, G).
# State if the comment is 'For', 'Against', or 'Neutral' on option.
# tate if the strength of the commenter's opinon on a scale from 'Extremely strong', 'Very strong',
# 'Strong', 'Somewhat strong', or 'Mild'.
# No need any explanation or extra word.
# Produce the output in json format (Strictly follow the format) like this:\n{\n\'favored_option\': \'\',\n\'option_opinion\':
# '\',\n\'opinion_strength\': \'\'\n}"
# )
#
# response_nim <- chat_nvidia(
# text,
# history = NULL,
# temp = 0.5,
# api_key = nv_api_key,
# model_llm = "meta/llama-3.1-8b-instruct"
# ) %>%
# as.data.frame()
#
# temp <- (df[i,] %>% select(talk_id, title)) %>%
# as.data.frame() %>%
# bind_cols(response_nim)
#
# nim_output <- nim_output %>% bind_rows(temp)
# }
# write.csv(nim_output, "data/nim_output.csv", row.names = FALSE)
Transcript prediction glimplse
Let’s read the data of response what we received from LLM,
<- read.csv("data/nim_output.csv", header = TRUE)
text_output head(text_output)
talk_id title favored_option option_opinion
1 1 Averting the climate crisis A For
2 92 The best stats you've ever seen C For
3 7 Simplicity sells B Against
4 53 Greening the ghetto A For
5 66 Do schools kill creativity? C For
6 49 Behind the design of Seattle's library B For
opinion_strength
1 Extremely strong
2 Very strong
3 Very strong
4 Extremely strong
5 Extremely strong
6 Extremely strong
We will reformat the data as follows,
<- data.frame(favored_option = c("A", "B", "C", "D", "E", "F", "G"),
favored_df favored_option_new = c("Nature or environment or climate", "Science or Technology",
"Education or knowledge", "Social Media", "Economy or Finance",
"Travel or new places or new experiences", "Others"))
<- data.frame(option_opinion = c('For', 'Neutral', 'Against'),
opinion_df option_opinion_new = c(1,0,-1))
<- data.frame(opinion_strength = c('Extremely strong', 'Very strong', 'Strong', 'Somewhat strong', 'Mild'),
strength_df opinion_strength_new = c('Extremely strong', 'Very strong', 'Strong', 'Somewhat strong', 'Mild'))
<- text_output |>
text_output left_join(favored_df) |>
left_join(opinion_df) |>
left_join(strength_df) |>
mutate(count = 1)
Joining with `by = join_by(favored_option)`
Joining with `by = join_by(option_opinion)`
Joining with `by = join_by(opinion_strength)`
head(text_output)
talk_id title favored_option option_opinion
1 1 Averting the climate crisis A For
2 92 The best stats you've ever seen C For
3 7 Simplicity sells B Against
4 53 Greening the ghetto A For
5 66 Do schools kill creativity? C For
6 49 Behind the design of Seattle's library B For
opinion_strength favored_option_new option_opinion_new
1 Extremely strong Nature or environment or climate 1
2 Very strong Education or knowledge 1
3 Very strong Science or Technology -1
4 Extremely strong Nature or environment or climate 1
5 Extremely strong Education or knowledge 1
6 Extremely strong Science or Technology 1
opinion_strength_new count
1 Extremely strong 1
2 Very strong 1
3 Very strong 1
4 Extremely strong 1
5 Extremely strong 1
6 Extremely strong 1
Visualization
Now we will visualize our findings as follows,
<- text_output %>%
text_output_plot select(-favored_option, -option_opinion, -opinion_strength) %>%
group_by(favored_option_new, option_opinion_new, opinion_strength_new) %>%
summarise(count = sum(count, na.rm = TRUE)) %>%
ungroup() %>%
mutate(opinion_strength_new = as.factor(opinion_strength_new))
<- plot_ly(text_output_plot) %>%
fig add_trace(
x = ~favored_option_new,
y = ~count,
color = ~opinion_strength_new,
name = ~opinion_strength_new,
type = "bar"
%>%
) layout(title = "Transcript Analysis",
xaxis = list(title = "Favoured Option"),
yaxis = list(title = "Count"),
legend = list(title = list(text = "Opinion Strength")),
barmode = 'stack',
margin = list(t = 50)
)
fig
This bar plot titled “Transcript Analysis” illustrates the distribution of TEDx talk topics, predicting the strength of alignment with various themes. The x-axis represents different favored options or topics, while the y-axis shows the count of TEDx talks. The color-coded legend indicates the strength of opinion, ranging from “Mild” to “Extremely strong.”
Key Insights:
- Education or Knowledge is the most dominant theme, with a high number of talks showing “Extremely strong” and “Very strong” alignment.
- Science or Technology also features prominently, with significant contributions in the “Extremely strong” category.
- Economy or Finance and Nature or Environment or Climate have comparatively fewer talks but show strong opinions.
- Social Media and Travel or New Places or New Experiences are the least discussed topics, with only minimal representation across all opinion strengths.
- Most topics have a substantial share of “Extremely strong” opinions, indicating that TEDx talks are generally focused on impactful and strongly aligned themes.
This analysis highlights that TEDx talks heavily favor themes related to education and technology, reflecting the platform’s emphasis on knowledge dissemination and innovation.