What Is The List?

The List is perhaps one of the most over engineered movie rankings to exist which takes movie ratings from a mixture of online sources and real peoples opinions. What started as a simple csv file of films we watched had now turned into a series of different databases and data lakes with the sole purpose of helping us answer the question "which movies are the best" or an even more important question of "what should we watch next?".

Theres Some Pretty Bad Films On Here?

Some of the first films to make it onto the list are not what someone would describe as good cinema. Things like Attack of the killer tomatoes or Transmorphers are not traditionally good films however on the list we treat every film as equal until they have been watched. Just as its good to watch the all time greats its also important to watch those B or even Z movies you will never know if you might enjoy them. These movies can have just as much entertainment value and so on The List we watch everything!

The Architecture

Most of the list architecture is split 50-50 on cloud and on prem. On premises we handle anything related to watching or rating films for instance sourcing where to watch as well as media server integrations. Once a movie has been watched it is sent to the cloud datalake which comprises a series of different parquet files containing things like ratings, movie meta data as well as more complex tables such as movie relationships. On cloud is also where we store the short list of films which have not yet been watched. These films also have full meta data which is then used to feed into a series of selection algorithms in order to select the next film to watch. These algorithms use the full extent of what info we have availble in the data lake and use things like graph searches in order to identify similar films.

As an example as part of all movies we keep a full cast list. This means we have a network graph that connects movies by actors which we can then use to identify things like communities as well as an estimate of how good a film is based on films that share a cast member.

The website itself is almost entirely static with most of the rendering and calculations actually being done client side. To achieve this we keep a web ready data cache which can address 90% of all the websites needs. For more complex requirements we instead have a series of lambdas behind an api gateway which have access to the original parquet data lake. A simplified version of the architecture can be seen below:

How Are Rankings Calculated?

The List rankings are driven of only non-online sources and are calculated by looking at primary component decompositions of all ratings. This has a few side-effects:

Ratings Are Normalised: Given that people tend to not be great at forming normally distributed ratings we adjust all rate sources ratings until they have an average of zero and a variance of 1
Consistent Rate Sources Are Prefered: If a rate source has not rated a movie then it is assumed that they rated that film as perferctly average, what this does to the primary component analysis is that if a rate source does not rate many films then their rating will not have the same amount of weight in the primary component as a rate source that has rated a lot of films.
Ratings Are Not 100% Stable: This is becoming less and less of a problem but it is entirely possible for movies to switch places over time as certain rate sources gain more ratings. This is somewhat intended as it represents their ratings becoming more accurate