How Voting Algorithms Shape Decisions at Deep Projects
Purpose of the experiment
What is the first thing someone thinks when asked about the mechanism of voting? They will either go deep enough and start theorizing on democracy, or will just think of the previous election (no in-between, I know that’s not reality, but it’s easier to make my point). Yes, we can and should converse on the bigger topics; nonetheless, this will be limited to voting algorithms. I wanted to (in a nutshell) select a sample of voting processes and compare them with a sample of evaluators and a sample of objects to evaluate. A lot of samples, so take it easy with the generalization…
Well, here three methods are compared: pairwise, quadratic, and graded (score). This was done to understand in a universe such as the DEEP Funding reviewers’ one, how each process would take place, how they would compare to the official community evaluations’ benchmark, and how each individual reviewer’s experience would be.
Initially, we began with 15 human reviewers, but then we decided to explore Large Language Models, so some changes were made to the target metrics and to the original objectives’ scope.
Why it matters
It doesn’t… Just kidding! It matters because standards and rules matter. So the decision on how to compare the voting algorithms should be done logically, i.e, which processes are cheaper, more intuitive, and more accurate (accuracy as a broader term here, not diving into any statistical definition just yet). Also, could LLM assist human reviewers? Which degree of autonomy would we give it to decide for us? In a community like DEEP, how can we better select, filter, slice, and dice the ideas, projects, and proposals to always output the better explained, more scalable, more relevant ones?
If you are really interested, be my guest and delve into social choice theory and how different voting methods can yield different winners even with the same input preferences. I’m sure you will find plenty of interesting material with authors like Nicolas de Condorcet, Jean-Charles de Borda, and Kenneth Arrow (the first two French from the 18th century, I wonder what happened there at that time…)
What this post will cover (and what it will not):
Covering…
Not covering…
A function that aggregates individual preferences into a group decision.
Voting methods and how they work
Why they differ
The same preference profile can produce different winners depending on the rule used.
If not obvious enough, try evaluating the next movie you watch. Maybe it’s an excellent movie, you give it a perfect 10. Would you give it a perfect 10 when comparing it to Lord of the Rings: Return of the King (I’m a little biased)? Or would you lower the score? What if the vote needed to be distributed among the movies you saw today? Yes, the transformative black box representing the voting method matters!
Here, the experiment was divided into three main parts: the people, the 15 human reviewers and a LLM with a personality disorder (a feature, not a bug of course), the proposals, (the objects to be evaluated), in this case, the Beneficial Generalized Intelligence (BGI) proposals submitted in 2025, and, last, but not least, the data, i.e, the choices made by the reviewers on the proposals.
1. The people
Reviewers selected internally from the Review Circle and a personality vector embedded in an LLM to simulate the other 15 non-human reviewers.
2. The proposals
15 projects were submitted during our BGI round, and each reviewer evaluated each proposal under the optics of each voting algorithm.
3 . The data
Each [interaction/combination] [between/of] reviewer, voting algorithm, and proposal resulted in a specific outcome. Some preprocessing was needed to end up with a resulting score and later on with a ranking.
Graded (Score) voting was the easiest; there’s no relative evaluation to the subset of proposals, and the project has the score it was given. Quadratic voting, on the other hand, has the number of tokens relative to the total amount. And a similar thing occurs with pairwise voting: its score is the wins divided by the total number of match-ups.
After all this information is gathered, we can calculate some correlation metrics, Spearman, kendall tau, and do some basic statistics with the dataset.
Let’s go back to our purpose here: compare how each voting algorithm compares to the rest and analyze LLM-aided evaluation.
This is how the distributions looked after the scores were preprocessed for each voting method.
Variance here is the population variance, so the sum of quadratic differences between instances and the mean, divided by the number of instances. And average being the arithmetic mean, so the sum of instances divided by the number of them (not to be mistaken for other mean values, such as median or mode).
Some binning (distribution of evaluation values in these buckets or intervals) was first applied to make the chart easier to understand, and then we can kind of get a sense of how each method works:
Target metrics used and the associated dimensions’ results
Not everything can be broken down by a metric, and some stuff might remain as qualitative indicators. Despite having correlation metrics, mean average error, and variance to get a sense of consistency, we also wanted to understand if an algorithm was easy to understand, time-consuming, and the reviewers’ personal preferences as well. Here’s a more concise display of everything put together:
Demystifying famous metrics:
Spearman’s Rho (ρ): A rank correlation coefficient. Measures whether the ordering of items is preserved (monotonic relationship). Range: -1 (perfect inverse) → 0 (no correlation) → +1 (perfect match). Higher = better alignment with real rank.
Kendall’s Tau (τ): Another rank correlation metric, based on pairwise concordance/discordance. Range: -1 (perfect inverse) → 0 (no agreement) → +1 (perfect match). More robust when ties or small ranking sets exist. Higher = better.
In the end, we don’t have an end. As mentioned, this is an experiment without any inferential analysis done. What we set ourselves to do was to start a conversation of best practices when it comes to (again) voting process choice and LLM-aided evaluation.
With that said, here are some other not-to-generalize conclusions:
In light of this, the analysis suggested that the choice of a “best” voting method should not rest solely on quantitative performance (MARE, Spearman, Kendall), but must also consider contextual priorities: whether simplicity, fairness, robustness, or expressiveness is most valued.
Ultimately, the dashboard developed here enables stakeholders to navigate these trade-offs:
Also, going back to the second goal, LLMs are ultimately a tool. We found it hard to correlate it with human evaluations, but it sure is consistent. It will depend on numerous factors, such as how it’s trained, how it’s prompted to act, what criteria it’s using, …
We don’t want to blindly use LLMs that are just predicting the next word without strict, deterministic rules. But we also don’t want to not follow technological trends that could make our lives easier.
I definitely vouch for a hybrid approach, with LLM helping humans by:
Injecting expert intuition where formal criteria fall short.
We should talk, experiment, test, collaborate, and begin discussions. Not everything is set in stone, and procedures can and should be optimized and improved.
I hope that, with this little venture, we can start other, more defining challenges that will bring us closer to the creation of a democratic, decentralized, beneficial AGI by fulfilling ideas and empowering many builders across our big, blue world.
Copyright 2026 © Theme Created By DeepFunding, All Rights Reserved.