Music creation has never been as accessible as it is now. Gone are the days of classical composers, sheet music, and prohibitively expensive studio time when only trained, bankrolled musicians had the opportunity to transcribe notes onto a page. As technology has changed, so too has the art of music creation—and today it is easier than ever for experts and novices alike to compose, produce, and distribute music.
Now, musicians use a computer-based digital standard called MIDI (pronounced “MID-ee”). MIDI acts like sheet music for computers, describing which notes are played and when—in a format that’s easy to edit. But creating music from scratch, even using MIDI, can still be very tedious. If you play piano and have a MIDI keyboard, you can create MIDI by playing. But if you don’t, you must create it manually: note by note, click by click.
To help solve this problem, Spotify’s machine learning experts trained a neural network to predict MIDI note events when given audio input. The network is packaged in a tool called Basic Pitch, which we just released as an open source project.
“Basic Pitch makes it easier for musicians to create MIDI from acoustic instruments—for example, by singing their ideas,” says Rachel Bittner, a research manager at Spotify who is focused on applied machine learning on audio. “It can also give musicians a quick ‘starting point’ transcription instead of having to write down everything manually, saving them time and resources. Basically, it allows musicians to compose on the instrument they want to compose on. They can jam on their ukulele, record it on their phone, then use Basic Pitch to turn that recording into MIDI. So we’ve made MIDI, this standard that’s been around for decades, more accessible to more creators. We hope this saves them time and effort while also allowing them to be more expressive and spontaneous.”
For the Record asked Rachel to tell us more about the thinking and development that go into Basic Pitch and other machine learning efforts, and how the team decided to open up the tool for anyone to access and to innovate on.
Help us understand the basics. How are machine learning models being applied to audio?
On the audio ML (machine learning) teams at Spotify, we build neural networks—like the ones that are used to recognize images or understand language—but ours are designed specifically for audio. Similar to how you ask your voice assistant to identify the words you’re saying and also make sense of the meaning behind those words, we’re using neural networks to understand and process audio in music and podcasts. This work combines our ML research and practices with domain knowledge about audio—understanding the fundamentals of how music works, like pitch, tone, tempo, the frequencies of different instruments, and more.
What are some examples of machine learning projects you’re working on that align with our mission to give “a million creators the opportunity to live off their art”?
Spotify enables creators to reach listeners and listeners to discover new creators. A lot of our work helps with this in indirect ways—for example, identifying tracks that might go well together on a playlist because they share similar sonic qualities like instrumentation or recording style. Maybe one track is already a listener’s favorite and the other one is something new they might like.
We also build tools that help creative artists actually create. Some of our tech is in Soundtrap, Spotify’s digital audio workstation (DAW), which is used to produce music and podcasts. It’s like having a complete studio online. And then there’s Basic Pitch, which is a stand-alone tool for converting audio into MIDI that we just released as an open source project. We open sourced Basic Pitch and built an online demo, so anyone can use it to translate musical notes in a recording (including voice, guitar, or piano).
Unlike similar ML models, Basic Pitch is not only versatile and accurate at doing this, but it’s also fast and computationally lightweight. So the musician doesn’t have to sit around forever waiting for their recording to process. And on the technological and environmental side, it uses way less energy—we’re talking orders of magnitude less—compared to other ML models. We named the project Basic Pitch because it can also detect pitch bends in the notes, which is a particularly tricky problem for this kind of model. But also because the model itself is so lightweight and fast.
What else makes Basic Pitch a unique machine learning project for Spotify?
I mentioned before how computationally lightweight it is—that’s a good thing. In my opinion, the ML industry tends to overlook the environmental and energy impact of their models. Usually with ML models like this—whether it’s for processing images, audio, or text—you throw as much processing power as you can at the problem as the default method for reaching some level of accuracy. But from the beginning, we had a different approach in mind: We wanted to see if we could build a model that was both accurate and efficient, and if you have that mindset from the start, it changes the technical decisions you make in how you build the model. Not only is our model as accurate as (or even more accurate than) similar models, but since it’s lightweight, it’s also faster, which is better for the user, too.
What’s the benefit of open sourcing this tool?
It gives more people access to it since anyone with a web browser can use the online demo. Plus, we believe the external contributions from the open source community help it evolve as software to create a better, more useful product for everyone. For example, while we believe Basic Pitch solves an important problem, the quality of the MIDI that our system (and others’) produces is still far from human-level accuracy. By making it available to creators and developers, we can use our individual knowledge and experience with the product to continue to improve that quality.
What’s next for Basic Pitch in this area?
There’s so much potential for what we can do with this technology in the future. For example, Basic Pitch could eventually be integrated into a real-time system, allowing a live performance to be automatically accompanied by other MIDI instruments that “react” to what the performer is doing.
Additionally, we shared an early version of Basic Pitch with Bad Snacks, an artist-producer who has a YouTube channel where she shares production tips with other musicians. She’s been playing around with Basic Pitch, and we’ve already made improvements to it based on her feedback, fixing how the online demo handles MIDI tempo, and other things to make it work better for a musician’s workflow. We partnered with her to use Basic Pitch to create an original composition, which she released as a single on Spotify. She even posted a behind-the-scenes video on her channel showing how she used Basic Pitch to create the track. The violin solo section is particularly cool.
But it’s not just artists and creators that we’re excited about. We’re equally looking forward to seeing what everyone in the open-source developer community has been doing with it. We expect to discover many areas for improvement, along with new possibilities for how it could be used. We’re proud of the research that went into Basic Pitch and we’re happy to show it off. We’ll be even happier if musicians start using it as part of their creative workflows. Share your compositions with us!
Create a cool track using Basic Pitch? Share it on Twitter with the hashtag #basicpitch and tag the team @SpotifyEng.