Yo, fellow enthusiasts! I hope you’re ready to talk about the importance of focusing on open source machine learning models.
We all know that time and energy are valuable resources, and we want to make sure that we are not stuck depending on Megacorp’s. That’s why I believe that open source machine learning models should be at the forefront of our priorities.
Here’s why:
Flexibility: Open source models give us the freedom to adapt and improve the algorithms to fit our specific needs. No more being limited by proprietary systems.
Collaboration: The beauty of open source is that it encourages collaboration and sharing among the community. Imagine having access to the knowledge and expertise of the best data scientists in the world, all working together to improve the model.
Cost-effectiveness: Open source machine learning models are typically free, making them a great choice for those who don’t have the resources for a proprietary solution.
Increased transparency: Proprietary machine learning models can often be difficult to understand, leaving us with limited insight into how they make predictions. With open source models, the code is readily available for examination, leading to greater transparency and accountability.
I hope you liked the joke but let’s take advantage of their benefits and make open source models a priority.
We passed the mark of 500 trees in ready_for_export state … 25% of our initial 2k tree goal is complete. Big thanks to everyone who contributed so far!
Current message stats:
lang | count
-------±------
hu | 5
de | 1064
ja | 30
fr | 484
zh | 76
en | 10917
es | 171
pt-BR | 134
ru | 594
vi | 4
On the way to 50k
lang | count
-------±------
vi | 41
hu | 23
de | 1667
ja | 161
ko | 8
fr | 946
tr | 2
zh | 181
en | 19136
es | 1066
uk-UA | 32
pt-BR | 258
ru | 1789
Latest update we are moving to data filter/training!
Hey @everyone I just wanted to give a short update on the current status of the project.
Most important thing: We’re currently processing the data and IT LOOKS AWESOME no joke, we have like 7k entire trees to export, that’s over 100k(!) messages, meaning we’ve gotten millions of contributions and that’s all thanks to YOU .
And the quality of the data is beyond what I’ve ever expected. The vast majority of contributions are super high quality, people really try their best, and the filtering systems we have in place also work very well to weed out any crap, so huge respect to anyone and thank you so much, I’m very sure this will already be a huge contribution to the field, no matter what happens next. I’m very sure no amount of paid crowd-workers from OpenAI can ever match the quality of people who are really putting their soul into something. And also, we have not only English data, but many, many languages, congratulations to all!
What’s happening now:
we are on exporting the dataset v1, we need to run some cleaning, filter PII and stuff as possible, scramble user IDs, etc. and determine the best export format. Now that doesn’t mean the data collection stops, the more the merrier, and we hope we can release many future versions, plus now that we train models, we can interleave human and model data and really test how good they are!
as said, we are currently training the initial batch of models. so far it looks pretty good, but also as said, the data isn’t fully processed yet, so I hope those improve as well. Our goal is to soon make a release of v1 of the data, models, and the inference system, although maybe those will come one by one, we’ll see
if you want to help, there are still plenty of GH issues, and of course data contributions are still extremely welcome!