Tools
Tools: Python Logistic Regression: A Practical Guide with scikit-learn and statsmodels (p-values, Odds Ratios, and ROC)
2026-01-23
0 views
admin
Python Logistic Regression: A Practical Guide with scikit-learn and statsmodels (p-values, Odds Ratios, and ROC) ## What is Logistic Regression? ## Difference from Linear Regression ## Preparation: Loading and Scaling Data ## Implementation 1: scikit-learn (Machine Learning Focus) ## Implementation 2: statsmodels (Statistical Focus) ## Interpreting Results: Odds Ratios ## Visualizing Accuracy: ROC Curve and AUC ## Conclusion Whether you're predicting the probability of an event occurring or analyzing how specific factors influence an outcome, Logistic Regression remains one of the most fundamental and powerful tools in data science. In business settings, it's frequently used for binary classification problems, such as "Will a customer buy this product?" or "Is this email spam?" When implementing this in Python, your choice of library depends on your goal: In this guide, we’ll walk through implementation using both libraries, and cover essential interpretation techniques like odds ratios and ROC curves. Despite its name, Logistic Regression is primarily used for classification, not numerical regression. It predicts the probability that an observation belongs to one of two classes (0 or 1). While linear regression predicts a continuous numerical value, logistic regression uses the Sigmoid function to squash the output between 0 and 1. If the output exceeds a threshold (typically 0.5), it is classified as "1" (Event occurred); otherwise, it is "0" (Event did not occur). We’ll use the "Breast Cancer Wisconsin" dataset, a classic binary classification problem included in scikit-learn. Why Standardize?
Logistic Regression is sensitive to the scale of input features. Using StandardScaler to ensure a mean of 0 and variance of 1 helps the model converge faster and makes the resulting coefficients comparable. scikit-learn is the go-to library for building predictive models. It’s concise and follows a standardized workflow. The output report provides not just Accuracy, but also Precision, Recall, and the F1-score, giving you a complete picture of the model's performance. If you need to know why a model made a prediction—specifically, which variables are statistically significant—statsmodels is the better choice. In the resulting summary, look for the P>|z| column. A p-value of less than 0.05 generally indicates that the feature is statistically significant to the outcome. In business, explaining coefficients can be difficult. Odds Ratios are much more intuitive. An odds ratio represents the ratio of the probability of an event happening to the probability of it not happening. Since Logistic Regression coefficients are in "log-odds," we convert them using the exponential function. If an Odds Ratio is greater than 1, an increase in that feature increases the probability of the target being "1." For example, an odds ratio of 2.5 means that for every one-unit increase in the feature, the odds of the event occurring increase by 2.5 times. To evaluate how well your model distinguishes between the two classes, we plot the ROC Curve and calculate the AUC (Area Under the Curve). An AUC score ranges from 0.5 (random guessing) to 1.0 (perfect model). The closer the curve is to the top-left corner, the more robust your model is. Building a logistic regression model is straightforward, but interpreting it correctly is where the real value lies. Use scikit-learn for quick, high-accuracy predictions, and turn to statsmodels when you need to justify your findings with statistical rigor. Originally published at: [https://code-izumi.com/python/logistic-regression/] Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse COMMAND_BLOCK:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler # Load data
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target') # Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42, stratify=y
) # Standardization: Crucial for Logistic Regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler # Load data
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target') # Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42, stratify=y
) # Standardization: Crucial for Logistic Regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) COMMAND_BLOCK:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler # Load data
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name='target') # Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42, stratify=y
) # Standardization: Crucial for Logistic Regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) COMMAND_BLOCK:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report # Initialize model
# 'C' is the inverse regularization strength (smaller means stronger regularization)
model = LogisticRegression(C=1.0, random_state=42) # Training
model.fit(X_train_scaled, y_train) # Prediction
y_pred = model.predict(X_test_scaled) # Evaluation
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred)) Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report # Initialize model
# 'C' is the inverse regularization strength (smaller means stronger regularization)
model = LogisticRegression(C=1.0, random_state=42) # Training
model.fit(X_train_scaled, y_train) # Prediction
y_pred = model.predict(X_test_scaled) # Evaluation
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred)) COMMAND_BLOCK:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report # Initialize model
# 'C' is the inverse regularization strength (smaller means stronger regularization)
model = LogisticRegression(C=1.0, random_state=42) # Training
model.fit(X_train_scaled, y_train) # Prediction
y_pred = model.predict(X_test_scaled) # Evaluation
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred)) COMMAND_BLOCK:
import statsmodels.api as sm # statsmodels requires adding a constant term (intercept) manually
X_train_const = sm.add_constant(X_train_scaled) # Build and train the Logit model
logit_model = sm.Logit(y_train, X_train_const)
result = logit_model.fit() # Display the summary report
print(result.summary()) Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
import statsmodels.api as sm # statsmodels requires adding a constant term (intercept) manually
X_train_const = sm.add_constant(X_train_scaled) # Build and train the Logit model
logit_model = sm.Logit(y_train, X_train_const)
result = logit_model.fit() # Display the summary report
print(result.summary()) COMMAND_BLOCK:
import statsmodels.api as sm # statsmodels requires adding a constant term (intercept) manually
X_train_const = sm.add_constant(X_train_scaled) # Build and train the Logit model
logit_model = sm.Logit(y_train, X_train_const)
result = logit_model.fit() # Display the summary report
print(result.summary()) COMMAND_BLOCK:
# Extract coefficients and calculate Odds Ratios
coefficients = model.coef_[0]
coef_df = pd.DataFrame({ 'Feature': data.feature_names, 'Coefficient': coefficients, 'Odds_Ratio': np.exp(coefficients) }) print(coef_df.sort_values(by='Odds_Ratio', ascending=False).head()) Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
# Extract coefficients and calculate Odds Ratios
coefficients = model.coef_[0]
coef_df = pd.DataFrame({ 'Feature': data.feature_names, 'Coefficient': coefficients, 'Odds_Ratio': np.exp(coefficients) }) print(coef_df.sort_values(by='Odds_Ratio', ascending=False).head()) COMMAND_BLOCK:
# Extract coefficients and calculate Odds Ratios
coefficients = model.coef_[0]
coef_df = pd.DataFrame({ 'Feature': data.feature_names, 'Coefficient': coefficients, 'Odds_Ratio': np.exp(coefficients) }) print(coef_df.sort_values(by='Odds_Ratio', ascending=False).head()) COMMAND_BLOCK:
from sklearn.metrics import roc_curve, roc_auc_score # Get predicted probabilities for class 1
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1] # Calculate FPR, TPR, and AUC
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
auc_score = roc_auc_score(y_test, y_pred_proba) # Plotting
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc_score:.4f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Guess')
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('ROC Curve for Logistic Regression')
plt.legend()
plt.grid()
plt.show() Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
from sklearn.metrics import roc_curve, roc_auc_score # Get predicted probabilities for class 1
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1] # Calculate FPR, TPR, and AUC
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
auc_score = roc_auc_score(y_test, y_pred_proba) # Plotting
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc_score:.4f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Guess')
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('ROC Curve for Logistic Regression')
plt.legend()
plt.grid()
plt.show() COMMAND_BLOCK:
from sklearn.metrics import roc_curve, roc_auc_score # Get predicted probabilities for class 1
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1] # Calculate FPR, TPR, and AUC
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
auc_score = roc_auc_score(y_test, y_pred_proba) # Plotting
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc_score:.4f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Guess')
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('ROC Curve for Logistic Regression')
plt.legend()
plt.grid()
plt.show() - Use scikit-learn if you prioritize predictive accuracy and machine learning workflows.
- Use statsmodels if you need detailed statistical summaries, such as p-values and confidence intervals.
how-totutorialguidedev.toaimachine learningpythongit