How is the artificial intelligence system that feeds on a database of 50 million songs
By Desirée JaimovichNovember 12, firstname.lastname@example.org Share on FacebookShare Share on TwitterTweet Share on WhatsAppShareVideo: Google.
Google recently incorporated a new tool in its search engine, from which it is possible to discover the name of a song by simply humming or whistling the tune. This function, which was unveiled on October 15, is available worldwide through the Google application on smartphones with the iOS or Android operating system and recognizes 22 languages. Now, what is the secret behind this technology?
The secret is machine learning. When a song is hummed, the machine learning model transforms that audio into a sequence that represents the melody. Then those sequences are contrasted with a database that is made up of 50 million songs, as summarized by Christian Frank, manager and software engineer at Google, in a press conference in which Infobae participated.
This task, which seems so simple when described in just one paragraph, is actually quite complex. Getting artificial intelligence to properly contrast and recognize humming with a song is not an easy task. The songs have instruments, background voices, and a great variety of elements that are quite different from what can be a hummed melody, something that, on the other hand, also varies greatly depending on who sings.
The point, then, was to somehow solve this puzzle. And for that, Google engineers chose to implement a solution that, unlike other existing methods, produces an embedding in the melodic space of a spectrogram without trying to generate an intermediate representation.
Hum to search or Hum to search from Google uses a machine learning system that compares different fragments after following a process of analysis and selection
They match a hummed tune directly to the original (polyphonic) recordings, without the need for a hummed version of each . This approach greatly simplifies the feature's database, allowing it to be constantly updated with original recordings from around the world.
The neural network is trained with input data (in this case, pairs of hummed or sung audio with recorded audio) to produce inlays for each input, which will then be used to match a hummed melody.
Thus, the machine learning model can generate an inlay for a melody that is similar to the inlay for the reference recording of the song. So finding the right song is just a matter of looking for similar inlays in a database of reference recordings calculated from the music's audio.
The tool is available, from iOS and Android, in 22 languages
The data to train the system
To train the machine learning model it was required to have pairs of songs (recorded and sung). The database originally had song segments sung, and only a few of them had hums.
Then, data of hummed melodies was generated from audios that were obtained from SPICE, a tone extraction model that was obtained as part of the Freddie Meter project, an app that allows knowing how similar the user sings to the legendary Freddie Mercury
The next step was to replace the simple tone generator with a neural network that produces audio that resembles a hummed or whistled real melody.
The final instance was to compare the training data by mixing and matching the audio samples. In this sense, if you had a clip of two different singers, you would line them up and thus show the model a couple of audios that represented the same melody.
This increment and overlays of training data allowed the neural network model to recognize hummed or sung songs.
As you can see, the “behind the scenes” that allows you to find the name of the song in Google with just humming it was a work of several months that involved the selection of data, several trials and error and the implementation of a neural network that is followed nurturing melodies from around the world.