Skip to the content.

Be With You (与你同在)

Music is better than words, please enjoy:

Jiatong · Be with you (与你同在)

Motivation and Highlights

Our song creation stems from our thinking about the relationship between humans and artificial intelligence (AI), which was originally designed to support humans by providing assistance with simple problems. However, with the continuous development, we expect that AI will eventually be able to deeply understand human beings and share pleasure with human beings. We imagine this future in our song, which is born through a combination of various deep learning algorithms and in-depth collaboration with humans.

The highlights are as follows:

Workflow, Collaboration, and Songwriting Process

General workflow

Our song creation process includes three general stages: Lyric & Theme Generation, Music Score Preparation, and Music Production. They are conducted in sequential order, as shown in Figure 1. In Lyric & Theme generation stage, the lyric generation and theme generation models iteratively refine the lyrics and their corresponding theme with the help of the general public. Then, lyrics-theme pairs are expanded with a melody generation model, and are further polished in the second music score preparation stage. For the last stage, the music score is passed to the music production team and several singing synthesis models to produce the final song. Note that humans and AI models are highly collaborative in all three stages.

image

Detailed workflow

Given the general workflow defined in Section 3.1, our detailed workflow for each stage is illustrated in Figure 2 below. In all the three stages, AI models (i.e., orange blocks), musical specialists (i.e., yellow blocks), and general public (i.e., blue blocks) are interconnected.

image

Lyric & theme generation (first stage)

For the first stage, we first apply a pre-trained lyric generation model (i.e., a neural language model) to generate N short lyrics samples based on some key words that are randomly inserted given the lengths of sentences. Then, we invite the general public to vote on the generated lyrics samples through social media (i.e., WeChat). The voting helps us to narrow down the major concepts that we want to convey in the song. Afterwards, we extract format information, namely rhythm template, from the lyrics. Given the rhythm template, we apply a theme generation model that can generate the corresponding musical theme from the given lyrics. Initially, the music theme generation is not stable because of the limited amount of open-source, Mandarin, annotated data. Therefore, our team invite musical professionals to go over thousands of open-accessible Mandarin Pop songs and annotate them with related structure information (e.g., intro, verse, chorus, etc.). With the introduction of the fine-grained annotated data, we observe a much more reasonable melody from the model.

As this procedure does not directly involve any semantic information but only the prosody information, the theme generation could potentially generate more diverse themes. Therefore, we further use the power of the crowd (i.e., voting) to find a reasonable match.

Music score preparation (second stage)

Given the matched lyrics and music theme, we further produce the melody for the whole song using a melody generation model, which was trained on our fine-grained dataset. Similarly, we conduct the voting to achieve a balance between musical advisors and public taste. With our musical advisors, the music arranger and the songwriter in our team re-organize the matched melody and lyrics into a full music score. We initially expect the song to be performed in solo. However, as we realize several insights from the generated melody and limitation in performances with a single singer, we decide to use a duet-format with both male and female singers.

Music production (second stage)

Because of regional regulations, it is difficult to acquire enough resources (e.g., available singers, recording studios, and musical instrument players). In the music production stage, we minimize the dependency in song production by mostly using the available sources. For musical instruments, we mainly use the available materials from Logic, Kontakt, and Sibelius. Meanwhile, we also use several singing voice synthesis (SVS) models to construct the first demonstration. With neural networks, we can generate reasonable singing voices, but sometimes, it still suffers from rigid performances. To tune outputs from SVS models, we also use a parametric vocoder to allow a visualization over pitch contour in singing voices. This enables even general public to modify the voice for naturalness. In general, the process allows quite fast iterations and remote collaboration overseas.

For the final production, we use both human voices and AI-synthesized voices in the duet and conduct music mixing for better performance.

Human-AI collaboration

AI Models

For lyrics generation, we adopt T5 , a pretrained sequence-to-sequence language model and finetune it in SongNet framework . In addition, we introduce extra music information, such as rhythm patterns in lyrics, to generate with fine-grained formats and better content control. This suggests a preliminary verse/chorus structure for the generated lyrics. This model is a unified model for any inputs with rhythm information. Therefore, we also use the model to generate lyrics from human modified melody to achieve feedback loop as illustrated in the first stage (i.e., Lyric & theme generation) of Figure 2. More technical details will be released in other technical conferences. For both theme generation and melody generation, we use TeleMelody to generate the melody condition on lyrics. This model employs rhythm template as the bridge between lyrics and melody. As it would result in melody-lyrics alignment automatically, we use the model for our conditional theme and melody generation (as illustrated in Figure 2). For singing synthesis models, we produce our music with ACE Singer based on RefineGAN and use Bytesing V3 for acoustic modeling.

Diversity, Ethical, and Cultural Considerations

We care about diverse perspectives in cultural, gender, and musical backgrounds, respectively.

Lyrics

Singer AI
世界曾寂静得一如既往 (The world was as silent as ever)  
直到你在我身旁 (Until you are by my side)  
生命便倔强得无人可挡 (My life is stubborn and unstoppable)  
你就是我的翅膀 (Since you are my wings)  
  你带我去远方 (You take me far away)
  陪我穿越时空 哦 (Through time and space, oh)
  我听见你在唱 (I hear you singing)
  这未来如你所想 (The future will turn out as your wish)
带我穿过迷惘 (Take me through confusion) 带我穿过迷惘 (Take me through confusion)
未知染上了渴望 (The unknown is sparked by our desire) 未知染上了渴望 (The unknown is sparked by our desire)
释放我想象 (Release my imagination) 释放我想象 (Release my imagination)
未来不同寻常 (The future will be extraordinary) 未来不同寻常 (The future will be extraordinary)
带我告别惆怅 (Take me goodbye to melancholy) 带我告别惆怅 (Take me goodbye to melancholy)
时间洒满了希望 (Time is full of hope) 时间洒满了希望 (Time is full of hope)
世界的万象 (Everything in the world) 世界的万象 (Everything of the world)
咫尺可及的力量 (The magic to touch and to feel the world) 咫尺可及的力量 (The magic to touch and to feel the world)
就在我手掌 (Is now in our hands) 就在我手掌 (Is now in our hands)
世界曾寂静得一如既往(The world was as silent as ever)  
直到你在我身旁 (Until you are by my side)  
生命便倔强得无人可挡 (My life is stubborn and unstoppable)  
你就是我的翅膀 (Since you are my wings)  
带我穿过迷惘 (Take me through confusion) 过迷惘 (Through confusion)
未知染上了渴望(The unknown is sparked by our desire) 未知染上了渴望 (The unknown is sparked by our desire)
释放我想象 (Release my imagination) 我想象 (My imagination)
未来不同寻常 (The future will be extraordinary) 未来并不同寻常 (The future will be extraordinary)
带我告别惆怅 (Take me goodbye to melancholy) 啦 (La)
时间洒满了希望 (Time is full of hope) 啦啦 (Lala)
世界的万象 (Everything of the world) 啦 (La)
咫尺可及的力量 (The magic to touch and to feel the world) 啦 (La)
就在我手掌 (Is now in our hands) 啦 (La)
  跟着我跟着我 (Follow me, follow me)
去探索 (To explore) 去浩瀚世界去探索 (To explore the vast world)
  听我说听我说 (Listen to me, listen to me)
回忆也会更深刻 (Memories will be deeper) 回忆也会更深刻 (Memories will be deeper)
怀念的梦想的 (Missing and dreaming) 怀念的梦想的 (Missing and dreaming)
每一天都能叫醒耳朵 (Wake up your ears every day) 每一天都能叫醒耳朵 (Wake up your ears every day)
啦啦啦啦 (Lalalala) 告诉我告诉我 (Tell me tell me)
你还想去体会什么 (What else do you want to experience) 你还想去体会什么 (What else do you want to experience)
带我穿过迷惘 (Take me through confusion) 过迷惘 (Through confusion)
未知染上了渴望 (The unknown is sparked by our desire) 未知染上了渴望 (The unknown is sparked by our desire)
释放我想象 (Release my imagination) 释放我想象 (Release my imagination)
未来不同寻常 (The future will be extraordinary) 未来不同寻常 (The future will be extraordinary)
带我告别惆怅 (Take me goodbye to melancholy) 带我告别惆怅 (Take me goodbye to melancholy)
时间洒满了希望 (Time is full of hope) 时间洒满了美好希望 (Time is full of beautiful hope)
世界的万象 (Everything of the world) 世界的万象 (Everything of the world)
咫尺可及的力量 (The magic to touch and to feel the world) 咫尺可及的力量 (The magic to touch and to feel the world)
就在我手掌 (Is now in our hands)  

Our team

The AIM3 team is a joint team with both musical specialists and AI researchers. The passion for music and computer algorithms brought us together across the Pacific Ocean from various backgrounds. Our team members are primarily from academic institutions. Some are also from the musical, technical or financial industries.

Team members

Jiatong Shi (Producer & AI researcher, Carnegie Mellon University)

Hang Yin (Music arranger & Songwriter, Xi’an Lingxin Culture Media)

Tao Qian (AI engineer, Renmin University of China & ByteDance AI Lab SA)

Keyi Zhang (Songwriter, Zhonghe Capital)

Yuning Wu (AI engineer and imaginary designer, Renmin University of China)

Shuai Guo (AI engineer, Renmin University of China)

Zhaodong Yao (Musical advisor, Tottori University)

Huazhe Li (Musical advisor, Tsinghua University)

Peter Wu (Technical advisor, University of California, Berkeley)

Qin Jin (General advisor, Renmin University of China)

References