Back Back

Structure

How Voting Algorithms Shape Decisions at Deep Projects

Gustavo Lodi

Mar 27, 2026

How Voting Algorithms Shape Decisions at Deep Projects

Introduction

Purpose of the experiment

What is the first thing someone thinks when asked about the mechanism of voting? They will either go deep enough and start theorizing on democracy, or will just think of the previous election (no in-between, I know that’s not reality, but it’s easier to make my point). Yes, we can and should converse on the bigger topics; nonetheless, this will be limited to voting algorithms. I wanted to (in a nutshell) select a sample of voting processes and compare them with a sample of evaluators and a sample of objects to evaluate. A lot of samples, so take it easy with the generalization…

 

Well, here three methods are compared: pairwise, quadratic, and graded (score). This was done to understand in a universe such as the DEEP Funding reviewers’ one, how each process would take place, how they would compare to the official community evaluations’ benchmark, and how each individual reviewer’s experience would be.

Initially, we began with 15 human reviewers, but then we decided to explore Large Language Models, so some changes were made to the target metrics and to the original objectives’ scope.

Why it matters

It doesn’t… Just kidding! It matters because standards and rules matter. So the decision on how to compare the voting algorithms should be done logically, i.e, which processes are cheaper, more intuitive, and more accurate (accuracy as a broader term here, not diving into any statistical definition just yet). Also, could LLM assist human reviewers? Which degree of autonomy would we give it to decide for us? In a community like DEEP, how can we better select, filter, slice, and dice the ideas, projects, and proposals to always output the better explained, more scalable, more relevant ones?

If you are really interested, be my guest and delve into social choice theory and how different voting methods can yield different winners even with the same input preferences. I’m sure you will find plenty of interesting material with authors like Nicolas de Condorcet, Jean-Charles de Borda, and Kenneth Arrow (the first two French from the 18th century, I wonder what happened there at that time…)

What this post will cover (and what it will not):

Covering…

  1. A walkthrough of the voting algorithms tested.
  2. How the experiment was set up.
  3. Evaluation methods and target metrics of comparison.
  4. Key insights.

Not covering…

  1. Homo Sapiens Sapiens’ generalization.
  2. The official ranking of voting methods is to be interpreted as a universal truth.
  3. Transformers and LLMs’ secret design decisions implemented by OpenAI.

 

Background: Voting Methods in Theory

What is a voting method

A function that aggregates individual preferences into a group decision.

Voting methods and how they work

  • Graded (Score) method: assign each proposal a score from 0 to 10.

  • Quadratic method: Objects are ranked with each vote, quadratically reducing the number of “tokens” the reviewer has.

  • Pairwise method: Elect the object that would win every head-to-head matchup. In our case, we need to score every project/proposal, so we count the wins and normalize this value.

Why they differ

The same preference profile can produce different winners depending on the rule used. 

If not obvious enough, try evaluating the next movie you watch. Maybe it’s an excellent movie, you give it a perfect 10. Would you give it a perfect 10 when comparing it to Lord of the Rings: Return of the King (I’m a little biased)? Or would you lower the score? What if the vote needed to be distributed among the movies you saw today? Yes, the transformative black box representing the voting method matters!

Experiment Design

Here, the experiment was divided into three main parts: the people, the 15 human reviewers and a LLM with a personality disorder (a feature, not a bug of course), the proposals, (the objects to be evaluated), in this case, the Beneficial Generalized Intelligence (BGI) proposals submitted in 2025, and, last, but not least, the data, i.e, the choices made by the reviewers on the proposals.

 

1.  The people

Reviewers selected internally from the Review Circle and a personality vector embedded in an LLM to simulate the other 15 non-human reviewers.

 

2.  The proposals

15 projects were submitted during our BGI round, and each reviewer evaluated each proposal under the optics of each voting algorithm.

 

3 . The data

Each [interaction/combination] [between/of] reviewer, voting algorithm, and proposal resulted in a specific outcome. Some preprocessing was needed to end up with a resulting score and later on with a ranking. 

Graded (Score) voting was the easiest; there’s no relative evaluation to the subset of proposals, and the project has the score it was given. Quadratic voting, on the other hand, has the number of tokens relative to the total amount. And a similar thing occurs with pairwise voting: its score is the wins divided by the total number of match-ups.

After all this information is gathered, we can calculate some correlation metrics, Spearman, kendall tau, and do some basic statistics with the dataset.

 

Results

Let’s go back to our purpose here: compare how each voting algorithm compares to the rest and analyze LLM-aided evaluation.

This is how the distributions looked after the scores were preprocessed for each voting method.

Variance here is the population variance, so the sum of quadratic differences between instances and the mean, divided by the number of instances. And average being the arithmetic mean, so the sum of instances divided by the number of them (not to be mistaken for other mean values, such as median or mode).

 

Some binning (distribution of evaluation values in these buckets or intervals) was first applied to make the chart easier to understand, and then we can kind of get a sense of how each method works:

  • Graded (Score) method: “normal” distribution skewed to the right. Reviewers are (maybe) being optimistic because there’s no relative comparison done in the sample. Each project is evaluated without any constraints with respect to the others.

  • Quadratic method: With each vote costing tokens, we can now understand the distribution skewing to the left. The evaluator has to remove tokens from a given project if they want to better score others.

  • Pairwise method: absolute mess of a variance, no order, just chaos. A lot of range, no concentration regions. A good reflection of the contradiction the reviewers can run into: the Condorcet paradox. This happens when you get a cycle where you shouldn’t: project A > project B > project C > project A.

 

Target metrics used and the associated dimensions’ results 

  • Correlation to Expert Benchmark: using Spearman and Kendall correlation of ordinal data metrics, we tried to understand the monotonic agreement of different sequences. What I mean by the latter sentence is that we want to see if they vary together (monotonic), so increasing one set would mean increasing the other. A perfect score is simply two rankings progressing together. The base of comparison was the official ranking done by the DEEP experts, which didn’t take into account any specific algorithm bias. Pairwise was the process with the best numbers (both correlation metrics, Spearman and Kendall), even though all measured values were really close to zero.

  • Mean Absolute Ranking Error (MARE) with Expert Benchmark: a bit self-explanatory, it’s how far a specific evaluation rank is from the experts’ benchmark. Pairwise had the lowest number, which means the best one.

 

  • Intra-project Variance: measures how consistent a process is. If a really good proposal is evaluated by different processes and different reviewers, we should see a small variance. Here, Pairwise also stood out, but negatively, the highest value of all of the three.

 

  • LLM vs Human Correlation: Lastly, the same correlation metrics were used to understand if LLM to Human (and the other way around) was significant or not. Turns out it’s not (within our experiment). But an interesting thing to note is that LLMs are more closely related to the typical LLM (an aggregation of all 15), as we got an LLM to LLM Spearman of 0.9. Also, LLM as the reviewer type or group has lower variance, which is something that we could have mistakenly inferred from the correlation metrics (but in our case, it’s true).

Not everything can be broken down by a metric, and some stuff might remain as qualitative indicators. Despite having correlation metrics, mean average error, and variance to get a sense of consistency, we also wanted to understand if an algorithm was easy to understand, time-consuming, and the reviewers’ personal preferences as well. Here’s a more concise display of everything put together:


Demystifying famous metrics:

Spearman’s Rho (ρ): A rank correlation coefficient. Measures whether the ordering of items is preserved (monotonic relationship). Range: -1 (perfect inverse) → 0 (no correlation) → +1 (perfect match). Higher = better alignment with real rank. 

Kendall’s Tau (τ): Another rank correlation metric, based on pairwise concordance/discordance. Range: -1 (perfect inverse) → 0 (no agreement) → +1 (perfect match). More robust when ties or small ranking sets exist. Higher = better.

Conclusion

In the end, we don’t have an end. As mentioned, this is an experiment without any inferential analysis done. What we set ourselves to do was to start a conversation of best practices when it comes to (again) voting process choice and LLM-aided evaluation.

With that said, here are some other not-to-generalize conclusions:

  1. Pragmatic considerations: Some methods may be easier to explain and apply, even if they are less precise in certain metrics. This should be taken into account, the learning curve can influence the results even when we supposedly have a more robust method.
  2. Behavioral outcomes: In our practice, reviewers’ choices show that different methods influence results differently. We don’t need metrics for that, just go ahead and visualize the distributions (same proposals going through different processes are evaluated differently).
  3. Information trade-offs: Methods that capture intensity of preferences may yield richer insights, but at the cost of complexity. This was true with Pairwise and Quadratic not being as preferred as Graded (Score) voting.
  4. Normative principles: No method perfectly satisfies all axiomatic criteria—paradoxes and trade-offs are inherent (Condorcet i.e). 

 

In light of this, the analysis suggested that the choice of a “best” voting method should not rest solely on quantitative performance (MARE, Spearman, Kendall), but must also consider contextual priorities: whether simplicity, fairness, robustness, or expressiveness is most valued. 

Ultimately, the dashboard developed here enables stakeholders to navigate these trade-offs: 

  • Reviewers can reflect on how their preferred methods align with objective benchmarks. 
  • Organizers can decide whether to prioritize accuracy, stability, or ease of use.

 

Also, going back to the second goal, LLMs are ultimately a tool. We found it hard to correlate it with human evaluations, but it sure is consistent. It will depend on numerous factors, such as how it’s trained, how it’s prompted to act, what criteria it’s using, …

We don’t want to blindly use LLMs that are just predicting the next word without strict, deterministic rules. But we also don’t want to not follow technological trends that could make our lives easier.

I definitely vouch for a hybrid approach, with LLM helping humans by:

  • Enforcing clearer internal logic.
  • Reducing arbitrary rank inversions.
  • Making scoring and ranking more aligned.
  • And humans, in turn, help LLMs by: 
  • Providing contextual grounding.
  • Correcting over-regularized or overly literal reasoning.

Injecting expert intuition where formal criteria fall short.

We should talk, experiment, test, collaborate, and begin discussions. Not everything is set in stone, and procedures can and should be optimized and improved. 

I hope that, with this little venture, we can start other, more defining challenges that will bring us closer to the creation of a democratic, decentralized, beneficial AGI by fulfilling ideas and empowering many builders across our big, blue world.

 

References

  • Stanford Encyclopedia of Philosophy — Voting Methods

  • 🔗 The tool