Lemmy federation protocol: How is the "posts from all communities" view sourced?

7EP6vuI@feddit.org · 1 month ago

Lemmy federation protocol: How is the "posts from all communities" view sourced?

ericjmorey@programming.dev · edit-2 1 month ago

If each lemmy instance has only a partial dataset

You can stop saying if. It is nearly certain that any instance only has a partial dataset in the same way that a search engine only indexes a partial dataset of every web page.

If this is the case: what happens if a bad actor subscribes to all communities of all servers?

There are bots that were built to do exactly that. I wouldn’t call them bad actors unless the instance owner prohibited such actions.

7EP6vuI@feddit.org · 1 month ago

so the instances only save the metadata/title of federated posts, but when a user wants to see the comments or content, then the other instances are queried for more details?

what are the bots good for?

Andrew@piefed.social · 1 month ago

is it in reality not “all” but only “all posts that at least one user of this instance is subscribed to”?

Exactly this, yes. Not literally ‘all’ (a brand new instance would have nothing in its All feed). This is what was meant by ‘partial data set’ - everything for a subscribed community (from the moment it was subscribed to), but nothing for a community that no-one’s subscribed to.

Some instances run bots to populated their All feed more than what would happen naturally (with the idea being that the bot unsubscribes when a human does)

7EP6vuI@feddit.org · 1 month ago

interesting. thanks.

so this would mean that if i wanted to receive an event for each upvote/comment/post in the lemmy fediverse i would have to create my own instance in the ActivityPub space, subscribe to all communities (there is no such single wildcard call (?), so i would have to subscribe to all ~30k communities each by its own and also watch for new communities) and then i could utilize the ActivityPub protocol as instance feed me with their events?

there are currently about 600 instances and 30k communities, but only ~2k communities have more than 600 subscribers (according to [0]). does this mean that those bots only subscribe to communities above a certain threshold?

Andrew@piefed.social · 1 month ago

Yeah. There’s no wildcard call. One thing you could do to script it would be pull JSONs from https://data.lemmyverse.net - use one for the initial effort, then subsequent ones to track new communities. You’d definitely want to filter it - as you’ve noticed the vast majority of that 30k are dead or spam or something you wouldn’t want for one reason or another (e.g. communities from instances you’ve defederated from).

As for what bots do, it depends on how they were programmed I suppose. There’s a bonkers one on https://leaf.dance that just seems to crawl comments and subscribe to any ! links it finds, but there are others (I can’t remember their names) where it’s more of a manual job (the mods of a community submit the details to it).

rust_alt@discuss.online · edit-2 1 month ago

It’s also possible to just pull all posts from an instance. The API is easy to understand.

main.rs

use serde::{Deserialize, Serialize};
use std::collections::HashMap;

fn main() {
    let rt = tokio::runtime::Builder::new_current_thread()
        .enable_all()
        .build()
        .unwrap();

    let to_page = Some(5);
    let posts = rt.block_on(get_posts_to(to_page)).unwrap();

    println!("-----------");
    println!(
        "All posts to page {} as JSON:",
        to_page.map(|v| v.to_string()).unwrap_or("infinity".into())
    );
    println!("-----------");
    println!("{}", serde_json::to_string(&posts).unwrap());
}

#[derive(Serialize, Deserialize, Debug, Clone)]
struct PostData {
    id: usize,
    name: String,
}

#[derive(Serialize, Deserialize, Debug)]
struct PageItem {
    post: PostData,
}

#[derive(Serialize, Deserialize, Debug)]
struct PostPageResult {
    posts: Vec<PageItem>,
}

async fn get_page(index: usize) -> Result<HashMap<usize, PostData>, ()> {
    let result = reqwest::get(format!(
        "https://programming.dev/api/v3/post/list?dataType=Post&listingType=All&sort=New&page={}",
        index
    ))
    .await;

    if let Ok(r) = result {
        if let Ok(text) = r.text().await {
            if let Ok(data) = serde_json::from_str(&text) {
                let data: PostPageResult = data;

                let map =
                    data.posts
                        .iter()
                        .fold(HashMap::new(), |mut map, post| {
                            map.insert(post.post.id, post.post.clone());

                            map
                        });

                if map.len() > 0 {
                    return Ok(map);
                }
            } else {
                println!("{:?}", serde_json::from_str::<PostPageResult>(&text));
            }
        }
    }

    Err(())
}

/// If page is not `None` then it stops after the page count. Otherwise it continues forever
async fn get_posts_to(page: Option<usize>) -> Result<HashMap<usize, PostData>, ()> {
    let mut idx = 1;
    let mut map = HashMap::new();

    while let Ok(more_posts) = get_page(idx).await {
        println!("page: {}, {:#?}", idx, more_posts);
        map.extend(more_posts.into_iter());
        idx += 1;

        if let Some(page) = page {
            if idx > page {
                break;
            }
        }
    }

    Ok(map)
}

Cargo.toml

[package]
name = "lemmyposts"
version = "0.1.0"
edition = "2021"

[dependencies]
reqwest = "0.12.7"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0.128"
tokio = { version = "1.4", features = ["rt"] }

7EP6vuI@feddit.org · 1 month ago

Thanks for the suggestion and the code snippets.

i want to see how votes/comments accumulate over time on a post, therefore i would have to poll the “all” posts endpoint in a regular interval. but I would either see new posts with small number of comments/upvoted, or already upvoted post, or i would have to download all posts in a regular time interval which seems impossible to me.

rust_alt@discuss.online · edit-2 1 month ago

Comments are also easy, the API allows pulling them by latest too. If I was writing a search engine, I would probably just track all known instances and just pull local content from them instead of deduplicating. I haven’t really looked at how votes are federated though, so that might be more complicated to keep updated.

I expect just syncing posts and comments from all instances to be mostly easy. In the past I was able to pull all posts and comments from smaller instances in like less than 10 minutes. It’s mostly just text so it doesn’t take that long. After it’s pulled, it can be kept mostly up to date by just pulling to the last date received, and should take much less time than the first sync.

I’ve noticed there’s lots of stuff on Lemmy that fails to federate to other instances. I think there’s also actually a 3000€ reward at the moment for improving federation, so if you spend very much time on it, it might be a good idea to see if it can be claimed. Though, I don’t really know how the milestone system works, and it might only be available to inside contributors.