During our weekly development reviews we get together over lunch and talk shop, during these get togethers we usually watch a video while everyone eats. We’ve found that TED talks are almost always interesting and give us a lot to talk about, even though they rarely relate directly to the technologies that we use.
After a few years of watching TED talks as a group we started to pick up on certain themes that repeat themselves and identified that TED talks seem to have a language of their own. Being data nerds we couldn’t let it stand at a simple intuition, so we decided that we’d build some software to analyze the TED talks and find out if our intuition was misguided or not. Over the next several weeks we’ll be devoting some of our free time towards this analysis in Project TED.
Before we could do any kind of analysis we needed to make sure we had a good data set. Lucky for us TED provides spell-checked transcripts for each talk, the downside is that we couldn’t find them in a downloadable format. We solved our problem with a simple web-scraper, based on Mechanize.
Our scraper visits the index pages for the talks and extracts a link to each talk. We modify this link to point to the English version of the interactive transcript and visit it. Once we’ve loaded the transcript we reach into the DOM and extract the element that contains the transcript text. From there we strip the HTML elements, leaving only text behind. The text is then saved to a file in the transcripts directory. While the files aren’t the prettiest, they do contain the full-text of the transcripts for every TED talk (almost, there are a few that were not transcribed). We have a data set!
Our next step will be to clean up the data and import it into a system for deeper analysis. Check back for part II soon.
Edit: see Part II for more