Can I use a OpenVPN (or similar) to buffer very bursty connections?

Blissed · January 7, 2022, 10:12pm

This maybe a stupid question. Im a computer science student and I’m harvesting financial data. Its quiet a bit usually 3-5mbit/s but its very bursty.
When the market is busy it maxed my 50mbit/s internet connection for a couple of secounds and then falling back to 5-10mbit/s. But I seem to be missing data at the times when my connection is maxed out.

So I thought it might be possible to setup a vpn on a 100mbit+ connection and max the network buffers and queue length

Or is there a solution except faster internet?

Four0Four · January 7, 2022, 10:23pm

Any reason you’re not just harvesting the data from a cloud compute instance? AWS, Azure, Linode, and Digital Ocean will all handle +500Mbit on their cheapest plans.

Trooper_ish · January 7, 2022, 10:24pm

If the weakest link is your connection, the slight overhead of openvpn might not help.

Could you look at a proxy host, on like a VPS? They should have plenty wide enough pipes to ingest, if you can get the data down off it?

A faster internet Would remove the bottleneck, but might be a longer term investment than a remote machine?

Blissed · January 7, 2022, 10:36pm

yes i’ve looked into vps, but as I said im a student and found the prices not very attractive. since it will be mouth’s until i gathered a useful amount of data. Also I do quite a lot of processing and the bustyness also applies to the cpu, so it would be running a 10%util just to have the headroom to cope with the bursts.

Blissed · January 7, 2022, 10:49pm

Also I’m at the point where I get ratelimited from the exchanges when I try to harvest more data, so two vpn’s would allow me to harvest twice as much.
Besides the point that its absolutly infuriating that 99% of the time my connection is enough
And if pay a bunch of money I might aswell get faster internet

x0r · January 7, 2022, 11:03pm

Any VPN that is effectively adding an extra hop isn’t going to do you much good. As it’s just another point to lose packets.

The only way it can buffer is if it has really big queues or some how the VPN splits a TCP connection and again, buffers those Megabytes.

I don’t know what transfer or queueing tech you are using to get the exchange data, but a machine that can split or have some message queue system that acknowledges all messages so nothing is lot.

Do the exchanges not offer an option to have guaranteed delivery to avoid data loss?

Iron_Bound · January 7, 2022, 11:08pm

So we can get free credit for cloud providers, which resolves the bandwidth issue.

$100 fee: $100 Linode Credit | Linode
Azure/DO credit: GitHub Student Developer Pack - GitHub Education

Depending on how hard it is to scrape data, we could discuss a worker model to spread the requests across a few IP’s, tho the credit should be a good start!

risk · January 8, 2022, 6:38am

You could use a VPN but it’s the suckiest way on earth to do it that way.

How long does the “burst” last? … is this when exchanges open?

I think I’d consider using the $100 linode credit to get a $5/month VPS and would just leave the thing running writing to disk. ($100 you get for free will last you a year and a half for the cheapest VPS, and you get a 1TB of transfer per month which works out to 4mbps average).

You can SFTP mount the home dir from linode and tail the data file as they’re being written to (or files, if you e.g. split them hourly) as they’re being written to and ingest them locally.

You don’t need tail if you don’t care about pseudo-realtime processing, you can download to a file with a .part at the end, and change to a new file every hour. Once you close the file, rename it to remove the .part . That would indicate it’s complete for processing.

This way, you’ll even have the data recorded, so you can tweak stuff.

But if you want a more realtime-y maybe a few seconds delayed feed, just use tail to pipe the data into your app over stdin, implement “tail” in your language of choice

To get an even more real-time feed (sub 1s), you could try stuffing data into Kafka on your VPS, but implementing your own simple/slow pubsub in something like Python or Go is a piece of cake …
I wish more people would do this at least once, it’s a very useful programming exercise for distributed computing environments because you can organize servers into a tree and serve many many watchers or subscribers from a single source that wouldn’t be able to handle it otherwise.

(just don’t expose this hacky server to the internet, expose to local host, or to a Unix socket only and use ssh port forwarding if you want to develop this).

May I ask what you’re using to stream the exchange data? How does the API look like? Is it free to use? Do you have your code in some public repo?

bedHedd · January 8, 2022, 7:04am

VPN stuff isn’t my domain, but I am familiar with university life. At my university, there were compute clusters available to faculty and students in research, in addition to, linux workstations engineering students could remote into. I’d recommend speaking with your academic advisor to see if they could connect you to resources or people who might be able to help with your project.

Alyernatively, you could apply the deals below, but you might be more limited compared to the university

Blissed · January 8, 2022, 1:55pm

Thank you all for your suggestions. cloud seems to be the way to go
My issue with shared cloud instances is that I cant judge how much cpu performace I have, also since they are shared the performace is variable (at least to my understanding). Is there a vm configuration to emulate the cpu performace or would i need to test all the hyperscaler free offerings. A quick google search yields that azure, aws, google and oracle have free instances

x0r · January 8, 2022, 2:52pm

I think the free credit expires after 60 days. So you will only have it free for 2 months!

AWS does offer free 12 months of their basic VPS. I can’t remember the specs off the top of my head. But I do know that when you do pay, Linode is cheaper. I was paying $10 USD/mo with that free AWS instance after the 12 months expired, where as Linode gives me the equivalent resources for half that!

Blissed · January 8, 2022, 2:59pm

Looking at aws the t4g instance seems to be exactly what I want. I dont need RAM pretty much at all. Anybody experience with t4g?

risk · January 8, 2022, 3:16pm

Why would you need a lot of CPU?

Blissed · January 8, 2022, 3:19pm

Im looking at t4g.nano/micro/small I wouldnt classify that as a lot of cpu

Iron_Bound · January 8, 2022, 3:29pm

t4g is not based on x86 but ARM for its cpu. no issue if the code/programs you use support it, else you MAY have more work compiling things.

Outside of this give it a go! AWS is a common cloud provider and any skills you gain here will help in future employment.

Blissed · January 8, 2022, 3:35pm

yes I know its java and python scripts
However I use zstd-jni which is c linked i guess and I dont quiet know if they have a compatible package
https://repo1.maven.org/maven2/com/github/luben/zstd-jni/1.5.1-1/
They got a zstd-jni-1.5.1-1-linux_arm.jar but I dont have any arm experience besides apple m1

risk · January 8, 2022, 4:10pm

I think this is the biggest free tier, especially if you go with arm (4 cores, 24G ram, 10TiB egrees/mo) Cloud Free Tier | Oracle Ireland

(haven’t used personally, but I’m now tempted to open an account).

Im looking at t4g.nano/micro/small I wouldnt classify that as a lot of cpu

yeah, but I’m thinking if you’re just relaying a few megabits of data – surely a potato is good enough for that?

Blissed · January 8, 2022, 4:48pm

well the issue is that the burst are about 50+ mbits maybe 100mbits
The bursts are the issue so I want headroom

Four0Four · January 8, 2022, 6:13pm

Thanks for sharing! That was enough to convince me to try it out. I launched just a 1 arm core, 6GB, oracle linux instance and I’m thoroughly impressed with the network performance. Going to to set it up as a wireguard/pihole server to see if it can replace my existing $6/m service.

I’m getting consistent 500Mbit up/down on Oracle’s free tier with no jitter. Might want to consider trying it first over AWS.

Speedtest Results

Speedtest by Ookla

 Server: Whitesky Communications LLC - Ashburn, VA (id = 23373)
    ISP: Oracle Cloud
Latency:     0.50 ms   (0.00 ms jitter)

Download: 492.94 Mbps (data used: 682.7 MB )
Upload: 484.07 Mbps (data used: 695.7 MB )
Packet Loss: 0.0%
Result URL: Speedtest by Ookla - The Global Broadband Speed Test
[opc@instance-20220108-1248 ~]$ ./speedtest

Speedtest by Ookla

 Server: Netprotect - Ashburn, VA (id = 37568)
    ISP: Oracle Cloud
Latency:     0.52 ms   (0.03 ms jitter)

Download: 495.22 Mbps (data used: 677.6 MB )
Upload: 490.50 Mbps (data used: 618.0 MB )
Packet Loss: Not available.
Result URL: Speedtest by Ookla - The Global Broadband Speed Test
[opc@instance-20220108-1248 ~]$ ./speedtest

Speedtest by Ookla

 Server: Whitesky Communications LLC - Ashburn, VA (id = 23373)
    ISP: Oracle Cloud
Latency:     0.50 ms   (0.00 ms jitter)

Download: 489.15 Mbps (data used: 665.7 MB )
Upload: 495.24 Mbps (data used: 688.4 MB )
Packet Loss: 0.0%
Result URL: Speedtest by Ookla - The Global Broadband Speed Test
[opc@instance-20220108-1248 ~]$ ./speedtest

Speedtest by Ookla

 Server: Netprotect - Ashburn, VA (id = 37568)
    ISP: Oracle Cloud
Latency:     0.51 ms   (0.03 ms jitter)

Download: 485.59 Mbps (data used: 694.4 MB )
Upload: 498.68 Mbps (data used: 653.5 MB )
Packet Loss: Not available.
Result URL: Speedtest by Ookla - The Global Broadband Speed Test

Blissed · January 8, 2022, 6:25pm

I meant cpu headroom because i do quite a lot of parsing. And the bursts might be 100mbit so the cpu has to be able to keep out so it has to be overprovisioned

But I might aswell try the oracle arm server

@Four0Four could run some cpu benchmark?