Jekyll auto posts from YouTube feeds
Simple Python automation to create markdown posts for Jekyll
Note: This post was originally published on my Dev.to profile. Code blocks are still language agnostic here on Substack 🤷♂️.
🧩 The problem
I wanted to automate this boring and repetitive workflow: my idea is that every time a YouTube video is published on my channel I want to have an associated post on my personal Jekyll blog.
Until very recently I handled all this by hand. This process was very tedious because, except for the video summary (more on this later), it was merely a copy paste operation of fields in the YAML front matter.
✅ The solution
📶 Feeds
The solution to this is to leverage on the RSS feeds provided by YouTube itself, and a few Python libraries:
- 📶 feedparser: the core dependency element
- 💻 PyYAML: parses the existing front matter
- 🐌 slugify: not crucial for this use case but useful
Every YouTube channel, in-fact, has feeds in this URL format:
https://www.youtube.com/feeds/videos.xml?channel_id={channel_id}
⚙️ Algorithm
Essentially the algorithm work like this:
get and parse the feed file
for each news item (
entry
) in the feed file, extract:1. URL
2. title
3. published date
4. tags (via the video description, using a regex)
create a new markdown file in the
_posts
directory using variables of step 2, and avoid changing existing auto-posts
Concerning steps 1 and 2 it's quite simple, thanks to list comprehensions:
def extract_feed_youtube_data(feed_source: str) -> list[dict]:
d = feedparser.parse(feed_source)
data = [
{
'url': e['link'],
'title': e['title'],
# feedparser generates a Python 9-tuple in UTC.
'published': datetime.datetime(*e['published_parsed'][:6]),
# Use a regex for this.
# If there are no hash tags available just use default ones.
'tags': get_video_tags(e['summary'])
if ('summary' in e and e['summary'] != '')
else STANDARD_TAGS,
# The video description is not used at the moment.
'sm': e['summary'],
}
for e in d.entries if 'link' in e
]
Before returning the data you can also perform a cleanup:
for e in data:
# Always use default tags. This also works in case videos
# do not have a description.
e['tags'] += STANDARD_TAGS
# Unique.
e['tags'] = sorted(list(set(e['tags'])))
e['summary'] = ' '.join(e['tags'])
return data
Step 3 involves an f-string. We need to take care of specific fields to avoid Jekyll throwing YAML parsing errors when using quotes. This can happen in the title
and description
fields specifically.
def create_markdown_blog_post(posts_base_directory: str, data: dict):
return f"""---
title: |-
{data["title"]}
tags: [{youtube_tags_to_jekyll_front_matter_tags(data["tags"])}]
related_links: ['{data["url"]}']
updated: {datetime_object_to_jekyll_front_matter_utc(data["published"])}
description: |-
{data["title"]}
lang: 'en'
---
{generate_youtube_embed_code(get_youtube_video_id(data["url"]))}
<div markdown="0">
{data["summary"]}
</div>
*Note: post auto-generated from YouTube feeds.*
"""
🎯 Result
As you see, the content of each auto post is very basic: beside the standard fields, in the body we can find an HTML YouTube embed code, and a list of hash tags extracted from the video description using a regex. The description (summary
) was left out on purpose. The idea to implement in the future is to get the video transcription and let a LLM (via Ollama) generate a summary. Of course I then need to manually proofread it. I also cannot copy the video description verbatim because of SEO.
Another improvement could involve replicating a subset of the fields of the auto-post to send toots to Mastodon using its API.
▶️ Running
At the moment the script is triggered by a local pre-commit hook which also installs the Python dependencies in a separate environment:
- repo: local
hooks:
- id: generate_posts
name: generate_posts
entry: python ./.scripts/youtube/generate_posts.py
verbose: true
always_run: true
pass_filenames: false
language: python
types: [python]
additional_dependencies: ['feedparser>=6,<7', 'python-slugify[unidecode]>=8,<9', 'PyYAML>=6,<7']
🎉 Conclusion
And that's it really. This script saves time and it's the kind of automation I like. Thankfully YouTube still provides RSS feeds, although I already had to fix the script once to adapt to their new structural changes.
If you are interested in the source code you can find it on its repo.
You can comment here and check my YouTube channel] for related content!