Tools
Reverse Engineering DUPR's Pickleball Rating Algorithm
2025-12-31
0 views
admin
Goals & Non-Goals ## High-Level Architecture ## Model Types and Findings ## Final Architecture ## Key Findings About DUPR's Algorithm ## Resources TL;DR: I built a machine learning tool that predicts DUPR rating changes with 86% accuracy using web scraping, Python, and Gradient Boosting. After scraping 6,844 player-match records from 35 players, I discovered DUPR uses a heavily modified ELO system with rating compression at extremes and opponent strength as the dominant factor. DUPR (Dynamic Universal Pickleball Rating) is the standard rating system for competitive pickleball, but it's not super clear how it gets calculated. For example, I could play somebody of a much lower DUPR than me, but my DUPR could still go down if I don't absolutely crush them. Or I could play somebody of a higher DUPR and lose, but my DUPR could still go up. So I wanted to answer: Can I predict my rating change before playing a match? Existing solutions didn't help. There is some information on their website around match recency, volume, and reliability that is good conceptually but isn't specific. DUPR does have an app with a "forecast" tool where I can input 4 players (to simulate doubles) but it will only tell you what it predicts the score should be if you were to play. It won't tell you how ratings will changed based off a hypothetical score you input. Here's how the data flow worked. 1) Selenium scrapes player match histories from pickleball.com. It will get back some HTML like this: <div class="hidden md:block"> <table> <tr> <td>Dec 29, 2024</td> <td>Jessica Wang26 | F | WA, USA</td> <td>5.088 → 5.126 (+0.038)</td> <td>Partner: 5.353 → 5.383 (+0.030)</td> <td>Opponents: 5.148, 5.297</td> <td>Score: 15-1</td> </tr> </table>
</div> 2) BeautifulSoup parses rating data from dynamically rendered pages and returns back data in this format: { 'date': '2024-12-29', 'team1_player1_rating_before': 5.088, 'team1_player1_rating_after': 5.126, 'team1_player1_rating_change': 0.038, 'team1_player2_rating_before': 5.353, 'team1_player2_rating_after': 5.383, 'team1_player2_rating_change': 0.030, 'team2_player1_rating_before': 5.148, 'team2_player2_rating_before': 5.297, 'team2_player1_rating_change': -0.093, # Inferred (they lost) 'team2_player2_rating_change': -0.175, 'game1_team1_score': 15, 'game1_team2_score': 1
} 3) pandas processes 6,844 player-match records into training data. Here's the code if you want to read it. 4) Gradient Boosting model trained on 14 engineered features, such as score margin, surprise level, and opponent's average rating. # For Team 1 Player 1 (rating 5.088)
features = { 'won': 1, 'rating_diff': 5.088 - 5.2225, # = -0.1345 'score_margin': 14, 'total_points': 14, 'partner_diff': 5.088 - 5.353, # = -0.265 'team_vs_opp': ((5.088 + 5.353)/2) - 5.2225, # = -0.002 'won_x_rating_diff': 1 * -0.1345, # = -0.1345 'won_x_score_margin': 1 * 14, # = 14 'rating_squared': 5.088 ** 2, # = 25.888 'surprise': 1 - (1 / (1 + 10**((5.2225 - 5.088)/4))), # = 0.52 'opp_spread': abs(5.148 - 5.297), # = 0.149 'player_rating': 5.088, 'partner_rating': 5.353, 'opp_avg': 5.2225
}
Feature vector: [1, -0.1345, 14, 14, -0.265, -0.002, -0.1345, 14, 25.888, 0.52, 0.149, 5.088, 5.353, 5.2225]
Target: 0.038 5) Flask API serves predictions, and static frontend sends match scenarios to API. So I could send this hypothetical score: curl -X POST http://localhost:8080/predict \ -H "Content-Type: application/json" \ -d '{ "team1_player1": 5.0, "team1_player2": 5.2, "team2_player1": 4.8, "team2_player2": 5.1, "team1_score": 11, "team2_score": 9 }' And the API would return this back: { "team1": { "player1": { "rating_before": 5.0, "rating_change": 0.022, "rating_after": 5.022 }, "player2": { "rating_before": 5.2, "rating_change": 0.016, "rating_after": 5.216 } }, "team2": { "player1": { "rating_before": 4.8, "rating_change": 0.042, "rating_after": 4.842 }, "player2": { "rating_before": 5.1, "rating_change": 0.049, "rating_after": 5.149 } Each graph shows how well a model predicts DUPR rating changes across all 13,832 training samples. The X-axis represents the actual rating change DUPR gave, while the Y-axis shows what each model predicted. The diagonal dashed line represents perfect prediction—if a model were 100% accurate, all dots would fall exactly on this line. You can visually see the improvement: poor models create horizontal clouds of dots, while good models create tight diagonal streams. Before diving into the models, here's what the metrics mean: 👉 R² (R-squared): Measures how well the model explains the variance in rating changes. A score of 1.0 is perfect prediction, 0.0 means the model is no better than guessing the average. An R² of 0.86 means the model explains 86% of why ratings change the way they do. 👉 MAE (Mean Absolute Error): The average prediction error in rating points. An MAE of 0.11 means predictions are off by ±0.11 points on average. Here are the 4 models we tested. 1) Linear Regression (Baseline)
What it is: Simple weighted sum of features. Assumes the relationship between inputs and rating change is a straight line. Features used: won, rating_diff, score_margin, total_points Why it failed: DUPR doesn't use linear math. A 4.0 player's rating change follows different rules than a 3.0 or 5.0 player. Linear models can't capture these rating-level specific behaviors. 2) Linear Regression with Engineered Features
What it is: Same linear model but with smarter features like ELO-style "surprise" (how unexpected the outcome was) and interaction terms. Features added: surprise (won - expected_outcome), won_x_rating_diff, partner_diff, rating_squared Breakthrough moment: Adding the surprise feature (ELO expected outcome) nearly 5x'd the R². This confirmed DUPR uses ELO-like logic. However, this still was not good enough, since linear models still fundamentally can't handle DUPR's complex rules. 3) Gradient Boosting - Aggressive What it is: Gradient Boosting builds predictions by training a sequence of decision trees, where each tree learns to fix the mistakes of all previous trees. The first tree makes initial predictions based on the training data, but it won't be perfect—some predictions will be too high, others too low. The second tree doesn't try to predict the rating changes directly; instead, it predicts the errors (residuals) from the first tree. This correction is added to the first tree's prediction, creating a combined model that's more accurate. The process repeats: tree 3 predicts the remaining errors after trees 1 and 2, tree 4 fixes what tree 3 missed, and so on. Each tree is weighted by the learning rate (0.05 in our case), which controls how aggressively each new tree corrects previous mistakes. Lower learning rates make the model train slower but generalize better. After 100 trees, the final prediction is the sum of all individual tree predictions, creating a powerful non-linear model that can capture complex patterns like "4.5+ players follow different rules than 3.0 players" without being explicitly programmed with those rules. Hyperparameters: 150 estimators, max_depth=5, learning_rate=0.1 Why we didn't deploy it: Too accurate on training data, which means it is likely overfitting. Would probably perform worse on unseen matches. 4) Gradient Boosting - Balanced ✅ (DEPLOYED) Hyperparameters: 100 estimators, max_depth=3, learning_rate=0.05, min_samples_leaf=10 Deployed Model: Gradient Boosting (Balanced) - Model 3 Feature set (14 features): After analyzing 6,844 player-match records with the trained model, here's what actually matters: Opponent strength relative to yours (correlation: -0.329)
Playing up (underdog): Gain rating even when losing (+0.049 avg)
Playing down (favorite): Lose rating even when winning (-0.148 avg)
Rating differential is THE dominant factor Your absolute rating level (correlation: -0.368)
Higher-rated players experience more deflation
4.5+ players: Wins are worth LESS than losses (-0.006 difference)
<3.0 players: Heavy inflation regardless of outcome Expected outcome vs. actual outcome
DUPR uses ELO-style surprise calculation
Upset wins rewarded heavily (+0.400 for underdogs)
Expected wins barely rewarded (-0.148 for favorites) The Zero-Change Mystery:
• 10.3% of all matches result in exactly 0.000 rating change
• Happens at all rating levels, both wins and losses
• Theory: DUPR applies threshold rule when match provides "no new information"
• More common in balanced matchups (±0.3 rating diff) Strategic Implications:
• ✅ Play against higher-rated opponents (maximize rating gain)
• ✅ Win close matches as underdog (+0.400 avg boost)
• ❌ Don't worry about score margin (negligible impact)
• ❌ Avoid being heavily favored (>0.3 advantage = deflation) Bottom line: DUPR is a heavily modified ELO system with rating compression at extremes and aggressive deflation for favorites. The model achieved R²=0.86, meaning it captures 86% of DUPR's logic—the missing 14% is likely hidden factors like match recency or tournament context. Live demo: dupr-predictor.vercel.app Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse - GitHub: https://github.com/DaRubberDuckieee/pickleball-dupr-predictor
- Live Demo: dupr-predictor.vercel.app - There is some information on their website around match recency, volume, and reliability that is good conceptually but isn't specific.
- DUPR does have an app with a "forecast" tool where I can input 4 players (to simulate doubles) but it will only tell you what it predicts the score should be if you were to play. It won't tell you how ratings will changed based off a hypothetical score you input. - Scrape enough real life match data to reverse engineer the algorithm
- Build a predictive model based on that data with a "good enough" accuracy (good enough means I look at it and think it looks good enough)
- Deploy a web interface for predictions - Perfect accuracy (probably always off by a teeny bit)
- Singles ratings (focused on doubles only)
- Real-time scraping (one-time data collection was sufficient) - Scraper: Selenium + BeautifulSoup (Python)
- Data processing: pandas + numpy
- ML model: scikit-learn (Gradient Boosting Regressor)
- Backend: Flask API
- Frontend: Vanilla HTML/CSS/JS
- Deployment: Vercel (frontend) + Render (API) - MAE = 0.389 - MAE = 0.287 - MAE = 0.089 - MAE = 0.114 - 10× better R² than baseline (0.12 → 0.86)
- Handles rating-level specific rules (3.0 players vs 5.0 players behave differently)
- Doesn't overfit - Basic: won, score_margin, total_points
- Ratings: player_rating, partner_rating, opp_avg
- Derived: rating_diff, partner_diff, team_vs_opp, opp_spread
- Interactions: won_x_rating_diff, won_x_score_margin
- Non-linear: rating_squared
- ELO-style: surprise (won - expected_outcome) - Opponent strength relative to yours (correlation: -0.329)
Playing up (underdog): Gain rating even when losing (+0.049 avg)
Playing down (favorite): Lose rating even when winning (-0.148 avg)
Rating differential is THE dominant factor
- Your absolute rating level (correlation: -0.368)
Higher-rated players experience more deflation
4.5+ players: Wins are worth LESS than losses (-0.006 difference)
<3.0 players: Heavy inflation regardless of outcome
- Expected outcome vs. actual outcome
DUPR uses ELO-style surprise calculation
Upset wins rewarded heavily (+0.400 for underdogs)
Expected wins barely rewarded (-0.148 for favorites) - Score margin (correlation: +0.018)
Winning 11-0 vs 11-9 changes your rating by only ~0.05 points
Blowouts don't give bonus points
- Win/loss alone (+0.040 difference)
Winning helps, but barely
WHO you beat matters 50× more than IF you beat them - GitHub: https://github.com/DaRubberDuckieee/pickleball-dupr-predictor
- Live Demo: dupr-predictor.vercel.app
how-totutorialguidedev.toaimachine learningmlpythongitgithub