Table of Contents
- Developing Gallery-Dl Extractors to support new sites
- Table of Contents
- Framework Architecture
- Basic Concepts
- Creating Your First Extractor
- Step 1: Setting Up
- Step 2: Creating the Basic Structure
- Step 3: Implementing a Single Image Extractor
- Step 4: Implementing a Gallery Extractor
- Step 5: Adding to Framework
- Step 6: Testing
- Advanced Extractor Development
- Working with Website APIs
- Authentication and Session Handling
- Handling Pagination
- Handling Lazy Loading
- Common Patterns and Examples
- Example 1: Image Gallery Extractor (like Desktopography)
- Example 2: Blog Post Extractor (like Blogger)
- Example 3: API-Based Extractor (like Flickr)
- Example 4: File Hosting Extractor (like Catbox)
- Testing and Debugging
- Reference Documentation
Developing Gallery-Dl Extractors to support new sites
Table of Contents
- Framework Architecture
- Basic Concepts
- Creating Your First Extractor
- Advanced Extractor Development
- Common Patterns and Examples
- Testing and Debugging
- Reference Documentation
Framework Architecture
Overview
Gallery-dl is designed to efficiently download media (primarily images and videos) from various websites. It follows a modular architecture where each website has its own specialized "extractor" component that understands the website's structure and can locate downloadable content from a given URL.
Key Components
- Extractors: Python classes that implement website-specific logic to identify and download content.
- Message System: Communication mechanism between extractors and the download engine.
- Configuration System: Customizes the behavior of extractors through user options.
- Download Engine: Handles actual downloading of content identified by extractors.
- Archive System: Tracks previously downloaded content to avoid duplicates.
How Extractors Fit In
Extractors are at the heart of the framework and serve as "connectors" between websites and the download engine:
[User Input URL] → [Extractor Selection] → [Appropriate Extractor] →
→ [Content Extraction] → [Message Queue] → [Download Engine] → [File System]
Each extractor implements a specific pattern recognition mechanism through regular expressions, allowing the framework to route URLs to the appropriate extractor. The extractor then analyzes the page, identifies downloadable content, and passes this information to the download engine through a standardized message system.
Basic Concepts
Extractor Class Hierarchy
Extractors are organized in a class hierarchy that promotes code reuse:
- Extractor: Base class for all extractors, handles common functionality
- GalleryExtractor: For image galleries with multiple images
- ChapterExtractor: Specialized for manga/comic chapters
- MangaExtractor: For manga series with multiple chapters
- AsynchronousMixin: Adds asynchronous capability to extractors
- BaseExtractor: For handling multiple domains with similar structures
- GalleryExtractor: For image galleries with multiple images
This hierarchy allows specialized behavior while sharing common functionality.
Message System
The framework uses a message-based system to communicate between components:
Message.Directory: Specifies the destination directory for subsequent downloadsMessage.Url: Indicates a resource to be downloadedMessage.Queue: Schedules another URL to be processed by a different extractorMessage.Version: Specifies the message protocol version
Extractors yield these messages to communicate with the download engine.
Configuration System
The framework employs a hierarchical configuration system:
- Global default configuration
- Category-specific configuration (e.g., for all "flickr" extractors)
- Subcategory-specific configuration (e.g., for "flickr:user" extractors)
- Instance-specific configuration (for a specific URL)
Extractors can access their configuration through the config() method.
Basic Extractor Structure
A minimal extractor requires:
- A class that inherits from
Extractor - A
patternattribute with a regular expression to match URLs - A category and subcategory designation
- An
items()method that yields appropriate messages
from .common import Extractor, Message
from .. import text
class SimpleExtractor(Extractor):
category = "example"
subcategory = "simple"
pattern = r"https?://example\.com/(\d+)"
def items(self):
yield Message.Version, 1
# Get page content
page = self.request(self.url).text
# Extract data
data = {"id": self.match.group(1)}
image_url = "https://example.com/image.jpg"
# Yield directory info
yield Message.Directory, data
# Yield image URL
yield Message.Url, image_url, data
Creating Your First Extractor
Step 1: Setting Up
Before creating an extractor, understand what content you want to extract:
- Identify the website's URL pattern
- Determine how to access the content (direct HTML, API, etc.)
- Understand how the website organizes its content
Step 2: Creating the Basic Structure
Let's create a simple extractor for a fictional image hosting site "imagex.com":
# gallery_dl/extractor/imagex.py
from .common import Extractor, Message
from .. import text
class ImagexExtractor(Extractor):
"""Base class for imagex extractors"""
category = "imagex"
root = "https://imagex.com"
Step 3: Implementing a Single Image Extractor
class ImagexImageExtractor(ImagexExtractor):
"""Extractor for single images from imagex.com"""
subcategory = "image"
pattern = r"(?:https?://)?(?:www\.)?imagex\.com/image/([a-zA-Z0-9]+)"
filename_fmt = "{category}_{id}.{extension}"
archive_fmt = "{id}"
def __init__(self, match):
ImagexExtractor.__init__(self, match)
self.image_id = match.group(1)
def items(self):
url = f"{self.root}/image/{self.image_id}"
page = self.request(url).text
# Extract image URL using text.extract function
image_url = text.extract(page, '<img src="', '"')[0]
# Prepare metadata
data = {
"id": self.image_id,
"url": image_url,
}
text.nameext_from_url(image_url, data)
yield Message.Directory, data
yield Message.Url, image_url, data
Step 4: Implementing a Gallery Extractor
class ImagexGalleryExtractor(ImagexExtractor):
"""Extractor for image galleries from imagex.com"""
subcategory = "gallery"
directory_fmt = ("{category}", "{gallery_id} {title}")
filename_fmt = "{category}_{gallery_id}_{num:>03}.{extension}"
archive_fmt = "{gallery_id}_{id}"
pattern = r"(?:https?://)?(?:www\.)?imagex\.com/gallery/([a-zA-Z0-9]+)"
def __init__(self, match):
ImagexExtractor.__init__(self, match)
self.gallery_id = match.group(1)
def items(self):
url = f"{self.root}/gallery/{self.gallery_id}"
page = self.request(url).text
# Extract gallery title
title = text.extract(page, '<h1>', '</h1>')[0]
# Extract all image URLs
gallery_data = {
"gallery_id": self.gallery_id,
"title": title or self.gallery_id,
}
yield Message.Directory, gallery_data
# Find all image containers
image_containers = text.extract_iter(page, '<div class="image-container">', '</div>')
for num, container in enumerate(image_containers, 1):
# Extract image URL and ID
image_url = text.extract(container, 'src="', '"')[0]
image_id = text.extract(container, 'data-id="', '"')[0]
# Prepare image metadata
data = {
"gallery_id": self.gallery_id,
"id": image_id,
"num": num,
}
text.nameext_from_url(image_url, data)
# Add gallery metadata
data.update(gallery_data)
yield Message.Url, image_url, data
Step 5: Adding to Framework
Add your extractor to the module list in gallery_dl/extractor/__init__.py:
# gallery_dl/extractor/__init__.py
modules = [
# ...
"imagex",
# ...
]
Step 6: Testing
Test your extractor with a URL:
$ gallery-dl -v "https://imagex.com/gallery/abc123"
Advanced Extractor Development
Working with Website APIs
Many websites offer APIs that provide data in structured formats like JSON. Here's an example using Flickr's API:
class FlickrAPIClient:
"""Minimal interface for the Flickr API"""
API_URL = "https://api.flickr.com/services/rest/"
API_KEY = "your_api_key"
def __init__(self, extractor):
self.extractor = extractor
def photos_getInfo(self, photo_id):
"""Get information about a photo"""
params = {
"method": "flickr.photos.getInfo",
"photo_id": photo_id,
"api_key": self.API_KEY,
"format": "json",
"nojsoncallback": "1",
}
return self.extractor.request(self.API_URL, params=params).json()["photo"]
Authentication and Session Handling
Some websites require authentication to access content:
def login(self):
"""Login and set necessary cookies"""
username, password = self._get_auth_info()
if username:
self.log.info("Logging in as %s", username)
url = self.root + "/login"
data = {
"username": username,
"password": password,
"remember": "1",
}
response = self.request(url, method="POST", data=data)
if not response.cookies.get("sessionid"):
raise exception.AuthenticationError("Login failed")
return True
return False
Handling Pagination
Many websites implement pagination for content spanning multiple pages:
def images(self, page):
"""Return all image URLs from a paginated gallery"""
url = self.gallery_url
images = []
page_num = 1
while True:
self.log.info("Downloading page %d", page_num)
response = self.request(url)
# Extract images from current page
page_images = self._extract_images_from_page(response.text)
images.extend(page_images)
# Look for next page link
next_url = text.extract(response.text, 'class="next" href="', '"')[0]
if not next_url:
return images
url = self.root + next_url
page_num += 1
Handling Lazy Loading
Some websites load images dynamically using JavaScript:
def _extract_images_from_page(self, page):
"""Extract both static and lazy-loaded images"""
images = []
# Extract static images
for url in text.extract_iter(page, '<img src="', '"'):
if "/placeholder.jpg" not in url:
images.append(url)
# Extract lazy-loaded images
for url in text.extract_iter(page, 'data-src="', '"'):
if url not in images:
images.append(url)
return images
Common Patterns and Examples
Example 1: Image Gallery Extractor (like Desktopography)
For websites primarily focused on image galleries:
class DesktopographyExhibitionExtractor(DesktopographyExtractor):
"""Extractor for a yearly desktopography exhibition"""
subcategory = "exhibition"
pattern = r"https?://desktopography\.net/exhibition-([^/?#]+)/"
def __init__(self, match):
DesktopographyExtractor.__init__(self, match)
self.year = match.group(1)
def items(self):
url = "{}/exhibition-{}/".format(self.root, self.year)
base_entry_url = "https://desktopography.net/portfolios/"
page = self.request(url).text
data = {
"_extractor": DesktopographyEntryExtractor,
"year": self.year,
}
for entry_url in text.extract_iter(
page,
'<a class="overlay-background" href="' + base_entry_url,
'">'):
url = base_entry_url + entry_url
yield Message.Queue, url, data
Example 2: Blog Post Extractor (like Blogger)
For websites with blog posts containing images:
class BloggerPostExtractor(BloggerExtractor):
"""Extractor for a single blog post"""
subcategory = "post"
pattern = r"[\w-]+\.blogspot\.com(/\d\d\d\d/\d\d/[^/?#]+\.html)"
def __init__(self, match):
BloggerExtractor.__init__(self, match)
self.path = match.group(match.lastindex)
def posts(self, blog):
return (self.api.post_by_path(blog["id"], self.path),)
Example 3: API-Based Extractor (like Flickr)
For websites with comprehensive APIs:
class FlickrImageExtractor(FlickrExtractor):
"""Extractor for individual images from flickr.com"""
subcategory = "image"
pattern = r"(?:https?://)?(?:www\.|secure\.|m\.)?flickr\.com/photos/[^/?#]+/(\d+)"
def items(self):
photo = self.api.photos_getInfo(self.item_id)
self.api._extract_metadata(photo)
if photo["media"] == "video" and self.api.videos:
self.api._extract_video(photo)
else:
self.api._extract_photo(photo)
photo["user"] = photo["owner"]
photo["title"] = photo["title"]["_content"]
photo["comments"] = text.parse_int(photo["comments"]["_content"])
photo["description"] = photo["description"]["_content"]
photo["date"] = text.parse_timestamp(photo["dateuploaded"])
photo["id"] = text.parse_int(photo["id"])
url = self._file_url(photo)
yield Message.Directory, photo
yield Message.Url, url, text.nameext_from_url(url, photo)
Example 4: File Hosting Extractor (like Catbox)
For simple file hosting websites:
class CatboxFileExtractor(Extractor):
"""Extractor for catbox files"""
category = "catbox"
subcategory = "file"
archive_fmt = "{filename}"
pattern = r"(?:https?://)?(?:files|litter|de)\.catbox\.moe/([^/?#]+)"
def items(self):
url = text.ensure_http_scheme(self.url)
file = text.nameext_from_url(url, {"url": url})
yield Message.Directory, file
yield Message.Url, url, file
Testing and Debugging
Test Framework
The framework includes a test system to validate extractor functionality:
# test/results/imagex.py
# Import your extractor:
from gallery_dl.extractor import imagex
# Create test cases and assert the outcome of that test.
{
"#url" : "https://www.imagex.com/image/testimage2239",
"#id" : "testimage2239"
},
Common Issues and Solutions
1. URL Pattern Not Matching
Issue: Extractor not being recognized for a URL
Solution: Test your regex pattern separately:
import re
pattern = r"(?:https?://)?example\.com/gallery/(\d+)"
url = "https://example.com/gallery/123"
match = re.match(pattern, url)
print(bool(match), match.groups() if match else None)
2. Element Not Found
Issue: text.extract() returns None or empty string
Solution: Print the page content to see actual structure:
def items(self):
page = self.request(self.url).text
with open("debug.html", "w", encoding="utf-8") as f:
f.write(page)
# Continue with extraction...
Debugging Techniques
-
Enable Verbose Logging:
$ gallery-dl -v URL -
Dump HTTP Responses:
def __init__(self, match): Extractor.__init__(self, match) self._write_pages = True -
Examine Request Headers:
def items(self): response = self.request(self.url) print("Request Headers:", response.request.headers) print("Response Headers:", response.headers)
Reference Documentation
Base Classes
Extractor
The base class for all extractors.
Attributes:
category: Site identifier (e.g., "flickr")subcategory: Content type (e.g., "image", "gallery")pattern: Regular expression to match URLsdirectory_fmt: Format string for directory namesfilename_fmt: Format string for file namesarchive_fmt: Format string for archive entries
Methods:
items(): Yields messages for downloadingrequest(url, ...): Makes HTTP requestsconfig(key, default=None): Gets configuration valueslog.info/debug/warning/error(...): Logging functions
GalleryExtractor
Base class for gallery extractors.
Methods:
metadata(page): Returns gallery metadataimages(page): Returns a list of image URLs and metadata
ChapterExtractor
Specialized extractor for manga/comic chapters.
MangaExtractor
Extractor for manga series with multiple chapters.
Message Types
Message.Version: Protocol version identifierMessage.Directory: Directory information for subsequent filesMessage.Url: URL to be downloadedMessage.Queue: URL to be processed by another extractor
Utility Functions
text.extract(text, start, end): Extracts text between markerstext.extract_iter(text, start, end): Iterates over all matchestext.nameext_from_url(url, data=None): Extracts filename and extensiontext.parse_int(string, default=0): Converts string to integertext.parse_timestamp(string): Converts timestamp string to datetime
Configuration Options
Common configuration options for extractors include:
username/password: Login credentialscookies: Cookies for authenticated sessionsretries: Number of times to retry failed requestssleep-request: Time to wait between requeststimeout: Request timeout in secondsproxy: Proxy server to useverify: Whether to verify SSL certificates
gallery-dl: user's wiki