The podcast Generation Do It Yourself is one of the most listen in France. I listen this podcast every weekend and I decide to use my programming skill to scrap audio data from the web site. This post will also guide you to download audio from your favorite podcasts.
Prerequisites
Before we dive into the details, you need to install the prerequisites by running: pip install requests beautifulsoup4 tqdm
Understanding the Python Script
Let’s delve into the Python script that automates the process of downloading podcast episodes. I will introduce you to a class named AudioLoader
that is equipped with methods to perform various tasks such as loading, updating, downloading, and processing audio data.
Class Initialization
The AudioLoader class initializes with a data_path parameter which is the path to save the downloaded audio files and keep track of loaded episodes. The loaded_episodes set stores the loaded episode names.
class AudioLoader(object):
def __init__(self, data_path):
self.data_path = data_path
self.loaded_episodes = self.load_loaded_episodes()
Loading and Updating Episode Details
The load_loaded_episodes
method checks for the existence of a loaded_episodes.json file and creates one if it doesn’t exist, to store the details of loaded episodes.
The update_loaded_episodes
method updates the loaded_episodes set and JSON file with new episode details.
def load_loaded_episodes(self):
if not os.path.exists(os.path.join(self.data_path,'loaded_episodes.json')):
with open(os.path.join(self.data_path,'loaded_episodes.json'), 'w') as f:
json.dump([], f)with open(os.path.join(self.data_path,'loaded_episodes.json'), 'r') as f:
return set(json.load(f))
def update_loaded_episodes(self, episode_id):
self.loaded_episodes.add(episode_id)
with open(os.path.join(self.data_path,'loaded_episodes.json'), 'w') as f:
list(self.loaded_episodes), f) json.dump(
Fetching and Processing Data
The load_data
method fetches all episodes from the podcast’s RSS feed using the requests and BeautifulSoup libraries.
The process_data
method iterates over all episodes and filters out those with certain phrases in the title (like “[EXTRAIT]”). It also prevents downloading episodes that have already been loaded.
def load_data(self, feed_url):
= requests.get(feed_url)
page = BeautifulSoup(page.content, "xml")
soup return soup.find_all('item')
def process_data(self):
= set()
all_name = {}
audio_info = ["[EXTRAIT]","[EXTRACT]","[REDIFF]"]
audio_to_skip = self.load_data("https://rss.art19.com/generation-do-it-yourself")
data for episode in data:
= episode.find("enclosure")["url"]
link = episode.find("title").text
title = " ".join(title.split(" - ")[:-1]).replace("#", "")
episode_id = re.sub(r'[%/!@#\*\$\?\+\^\\\\\\]', '', episode_id)
episode_id
= [skip_audio for skip_audio in audio_to_skip if skip_audio in title]
skip if not skip:
if episode_id not in self.loaded_episodes:
try:
= self.simplify_name(episode_id)
episode_id except:
print(title)
if episode_id in all_name:
= episode_id+"-1"
episode_id = title
audio_info[episode_id]
all_name.add(episode_id)self.download_episode(link, episode_id)
self.update_loaded_episodes(episode_id)
return audio_info
Downloading Episodes
The download_episode
method downloads an episode’s audio file and saves it with a simplified name derived from the title.
def download_episode(self, episode_url, audio_name):
= requests.get(episode_url)
audio with open(os.path.join(self.data_path, audio_name+".mp3"), "wb") as fp:
fp.write(audio.content)
Simplifying File Names
The simplify_name
method simplifies episode names based on certain conditions to create a clean and concise file name for each downloaded audio file.
def simplify_name(self, file_name:str):
= file_name.strip()
file_name if file_name.startswith("COVID"):
= "-".join(file_name.split()[:2])
new_filename elif file_name.lower().startswith("hors"):
= file_name.split()
new_filename = "-".join(new_filename[:2])
new_filename elif file_name.startswith("Early"):
= "-".join(file_name.split()[:3])
new_filename else:
= file_name.split()[0]
new_filename return new_filename
Conclusion
With this Python script, you can automatically load and download episodes of the Generation Do It Yourself podcast, saving you time and effort. You can customize the script to work with other podcasts by modifying the RSS feed URL and adjusting the naming conventions in the simplify_name
method.
Feel free to extend this script with more features, such as adding metadata to the audio files.