Last week we kicked off Part I of Project TED with a quick introduction to the project and talked a little about our scraper. This week we’ll talk about ingesting the scraped data into a database for further analysis.
Our scraper leaves us with one text file for each TED talk, the files include the complete spell-checked text for every TED talk posted on the TED.com. On their own the files aren’t great to work with, being that they were pulled out of HTML and forced into a text file we see a lot of whitespace and the words are all represented in sentence case. To do even a rudimentary analysis we need to represent each instance of a word in a normalized format, in our case we choose all lowercase. Since we’re working with data from over 2,000 talks it became apparent early on that we’d need a database.
While we mainly use Postgres for our daily database needs it didn’t seem like a great fit for this project, we’re more concerned with the relationships between words and the talks they appear in than the words themselves. We decided to use neo4j, a graph database, for our analysis.
Our database is extremely simple. We have a Talk entity and a Word entity, they are related by an Appearance relationship that includes a position property indicating where in the text the relationship occurs. Our ingestor visits each transcript file, then runs through word-by-word. Each word is converted to the normalized version, then the word is either inserted into the database or the existing database instance is referenced. We then create an Appearance relationship that ties the word to its position in the talk. The end result is that we have a complete list of words and each word has relationships for every appearance in every talk.
One of the downsides to our approach is that words such as “the”, “and”, and “I” end up in our database with as much weight as words such as “people”, “world”, and “technology”. We’ll dig into dealing with common words in our next part. Keep following along here and on Github to see how we proceed.