[generic] Check for valid feeds #5

Open
opened 2022-04-23 18:44:48 -04:00 by 0x80 · 2 comments
Owner

When rsstube calls its generic extractor, it checks for a success (2xx) HTTP status code for downloaded pages, but it doesn't check the downloaded page to see if it's a valid feed.

This is especially problematic because some JS-heavy web apps (such as PeerTube, Funkwhale, and Pleroma) always respond 200 and then use client-side JS to render any error messages (such as 404s). When rsstube tries its generic extractor on these sites, it causes false positives, as any URL appears to be valid.

A fix for the false positives could be to simply disable the generic extractor on (some or all) known software, but it would be better in general to properly check if downloaded pages are actually RSS/Atom feeds.

When rsstube calls its generic extractor, it checks for a success (2xx) HTTP status code for downloaded pages, but it doesn't check the downloaded page to see if it's a valid feed. This is especially problematic because some JS-heavy web apps (such as PeerTube, Funkwhale, and Pleroma) always respond 200 and then use client-side JS to render any error messages (such as 404s). When rsstube tries its generic extractor on these sites, it causes false positives, as any URL appears to be valid. A fix for the false positives could be to simply disable the generic extractor on (some or all) known software, but it would be better in general to properly check if downloaded pages are actually RSS/Atom feeds.
Author
Owner

Basic check implemented in a1475943a7

Basic check implemented in https://code.negativezero.link/0x80/rsstube/commit/a1475943a7484cb732e0be0a8dc1391acfe643a3
Author
Owner

I'd rather not pull in something like feedparser. rsstube doesn't need to read the feed, just figure out if it is a feed. I think just checking for required elements is fine. If issues come up, I'll try to address them, but I'd rather err on the side of not discounting things that are meant to be feeds but maybe missing elements or improperly formatted. The goal is to identify intended feeds, not to run them through a validity checker.

I'd rather not pull in something like feedparser. rsstube doesn't need to read the feed, just figure out if it is a feed. I think just checking for required elements is fine. If issues come up, I'll try to address them, but I'd rather err on the side of not discounting things that are meant to be feeds but maybe missing elements or improperly formatted. The goal is to identify intended feeds, not to run them through a validity checker.
Sign in to join this conversation.
No Label
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: 0x80/rsstube#5
No description provided.