What Is The List?

The List is perhaps one of the most over engineered movie rankings to exist which takes movie ratings from a mixture of online sources and real peoples opinions. What started as a simple csv file of films we watched had now turned into a series of different databases and data lakes with the sole purpose of helping us answer the question "which movies are the best" or an even more important question of "what should we watch next?".

Theres Some Pretty Bad Films On Here?

Some of the first films to make it onto the list are not what someone would describe as good cinema. Things like Attack of the killer tomatoes or Transmorphers are not traditionally good films however on the list we treat every film as equal until they have been watched. Just as its good to watch the all time greats its also important to watch those B or even Z movies you will never know if you might enjoy them. These movies can have just as much entertainment value and so on The List we watch everything!

The Architecture

Most of the list architecture is split 50-50 on cloud and on prem. On premises we handle anything related to watching or rating films for instance sourcing where to watch as well as media server integrations. Once a movie has been watched it is sent to the cloud datalake which comprises a series of different parquet files containing things like ratings, movie meta data as well as more complex tables such as movie relationships. On cloud is also where we store the short list of films which have not yet been watched. These films also have full meta data which is then used to feed into a series of selection algorithms in order to select the next film to watch. These algorithms use the full extent of what info we have availble in the data lake and use things like graph searches in order to identify similar films.

As an example as part of all movies we keep a full cast list. This means we have a network graph that connects movies by actors which we can then use to identify things like communities as well as an estimate of how good a film is based on films that share a cast member.

The website itself is almost entirely static with most of the rendering and calculations actually being done client side. To achieve this we keep a web ready data cache which can address 90% of all the websites needs. For more complex requirements we instead have a series of lambdas behind an api gateway which have access to the original parquet data lake. A simplified version of the architecture can be seen below:

How Are Rankings Calculated?

The List rankings are driven of only non-online sources and are calculated by looking at primary component decompositions of all ratings. This has a few side-effects: