Social Media Data Collection for Analysis

Hey all,

I was curious if anyone knows the hot solution these days for collecting data from social media, in particular I am interested in Twitter and perhaps more importantly Reddit. Would like to be able to scrape where APIs fall short (so it should have modules for web scraping).

I am aware of software like Huginn but am wondering if there’s something better out there, specifically I’d like it to work with Reddit (PRAW?).

The idea is to collect data to analyze who is posting what, when. Bonus points if I can do something like ETL for sentiment analysis (say, Google Cloud or similar). Then I need to figure out how to access this data with Python or R.

Doesnt need to be low/no-code. Preference for standard academic or industry approaches assuming they can be installed in a homelab for free…

Thanks!

You are not likely to get much of a positive response to this post on this forum as this kind of a tool is very abusable. While your intentions might be pure, others reading this forum may take this information and use it for nefarious purposes.

1 Like

Abusable in which sense? Is public data not free to be collected and analyzed?

Is it any more abusive than scraping a retail website for PC component price changes? I think a reasonable response is no, though the sysadmin of the website you’re scraping probably feels differently.

If these social media outlets are really intended to be town squares, why the fear of sunshine? :laughing: I will figure it out either way…

1 Like

That’s something different entirely. I’m asking how to scrape publicly available data, not tricking people into filling out surveys which are then sold on to political parties, insurors and the like.

Obviously, but at the end of the day, it’s how the general public react/feel about these things. If this thread was an issue I would have deleted/locked it. My post was just to inform you what the general public feel about it as your original post has been flagged by community members specifically due to this concern.

Understood. Thank you.

1 Like

Morally speaking web scraping is a grey area, it is technically against TOS, for some websites, to scrape data. The same tech companies that brought image and language models used web scraping to build their datasets.

If you want to stay in python, you’ll want to look into selenium, as it is able to control a web browser and send commands to it. With user agent setups, rate limiting, and other spoofing techniques, you can trick some websites into thinking you are a user browsing the page.

Check out the libraries here

This is a good resource if the website they have a api accessible on the page

If you are looking at reddit, someone actually made their own api

If you want to work in web development you might consider learning puppeteer with JavaScript

On a side note, as you gather and analyze your data, you might want to consider using graphs, especially with social media.

Also relevant

2 Likes

Abusable in the fact that someone can copy the information and sell it for profit to spammers, scammers, and any other nefarious individual.
And those of us who value our privacy suffer for it.

Consider a massive increase of robocalls, spam bombs in you email, browser bombs in your computer.
Hijacked computers and phones.
Not to mention Identity theft.

No matter how good the intentions are, telemetry is a very sore subject to most of the tech users in the world.

Awesome! Thanks so much!

That’s fair to a point. Apropos of nothing, I’m more interested in analyzing the abuse of social media by automated or otherwise intentional means.

However people who value privacy probably know better than to post personally identifiable information online.

As to the rest of your comment, I don’t think my creating an activity (time of day) heatmap or doing sentiment analysis on select accounts would be relevant to spam, robocalls or more sinister things.

Telemetry is a sore point, but it is not necessarily all evil. Some telemetry is used by developers to fix bugs, and when we turn it off (like most of us do by default) the developers get less diagnostic info (alongside whatever else the marketing department wants :wink:) and consequently (perhaps!) driver support is worse!

1 Like

What needs to be considered here is the fact that if you are in your home office scraping a site halfway across the globe you are not directly connected to their server.
Your connection is routed through numerous isp and gateways.
while vpn’s are higher security they are far from bulletproof.

Consider the data breaches that occur on major entitys
How much informatin was leaked?
And to whom?
I can tell you personally! It takes months to years to recover from identity theft.
Not to mention the toll the stress take from you.

Design it and employ it as you wish I’m not stopping you but i am just telling you might not get a warm reception from others.