User generated videos have an underlying structure as a starting point, ending, and certain objective steps between these two. In this project, we are aiming to discover this underlying structure with no labelling or supervision by only watching a large-set of video collections. We accomplish this using both visual (video) and language (speech) information in a joint unsupervised model.
In a nutshell, our algorithm starts with a single query over Youtube such as “How to make an omelette?”, and downloads a large number of videos. Then, it discovers activities and parses each video in terms of these activities. We call the resulting joint parse as a storyline. For 5 videos, it looks like the above figure. For another query of “How to make a milkshake?”, the resulting story line is below. We visualize the storylines as temporal segmentation of the videos and ground truth segmentation. We also color code the activity steps we discovered and
visualize their key-frames and the automatically generated captions.
In this video, we show a collection of discovered activities and their descriptions. Our algorithm generates these activities and their descriptions in a fully unsupervised way from large collection of YouTube videos.