Predicting Walmart Store Underperformance

My Role

What I Contributed

As part of a 5-person analytics team, I led the underperformance definition framework and the feature engineering pipeline — designing the logic that separates true performance failure from structural or seasonal noise. My focus was building a model that gives Walmart store managers a meaningful early warning signal, not just a ranking of who sold the least.

I also drove the segmentation analysis to distinguish localized department-level issues from store-wide declines, and contributed to the final stakeholder recommendations.

Tools & Methods

Python Pandas NumPy Scikit-learn Matplotlib Seaborn CART (Decision Tree) Feature Engineering Binary Classification Rolling Statistics Time-Series Splitting

The Problem

Ranking Isn't Diagnosis

Walmart's existing approach identified bottom-performing stores by raw sales rank — but "who sold the least" is not the same as underperformance. That method had three critical blind spots.

📊

Always Excludes Some Stores

Someone is always in the bottom 20%, even if every store is healthy. It's just a rank, not a signal.

🌨️

Ignores Seasonality & Context

A Neighborhood Market will almost always fall below a Supercenter in absolute sales — that's structural, not poor performance.

📉

Blind to Momentum

A store in the bottom 20% but trending up looks identical to one that has been declining for 12 months.

Core Gap: No baseline. No context. No way to separate expected from unexpected.

The Approach

Explore. Predict. Intervene.

We designed a three-phase framework to move Walmart from reactive reporting to proactive, context-aware flagging of store-department combinations at risk.

Phase 01

Explore

Analyzed 143 weeks of historical sales data across 45 stores to map seasonality patterns, store type differences, and department-level variance. Identified $80.9M holiday sales spikes and structural size gaps between store types.

Phase 02

Predict

Engineered 11+ custom features capturing momentum, volatility, peer-relative positioning, and external shocks. Trained a CART binary classifier on an 80/20 temporal split to predict next-week underperformance per store × department pair.

Phase 03

Intervene

Translated predictions into a ranked risk heatmap across the top 10 stores × 10 departments, enabling managers to investigate early and adjust inventory and promotional strategies before declines worsen.

The Data

About Our Dataset

Sourced from Kaggle's Walmart Recruiting: Store Sales Forecasting competition — covering nearly 3 years of anonymized weekly sales, store characteristics, and macro-economic features.

File	Contents	Key Stats
sales.csv	Weekly sales by store and department	421,570 observations · Feb 2010 – Oct 2012
stores.csv	Store type (A/B/C) and physical size	45 stores · Types: Supercenter, Discount, Neighborhood
features.csv	Economic & promotional markdowns, CPI, fuel prices, temperature	5 markdown fields · Holiday flags · Macro indicators

Feature Engineering

Building the Early Warning Signal

We engineered 11 features across four conceptual groups — each designed to capture a different dimension of underperformance risk that raw sales data misses.

Momentum & Trend Deterioration

Rolling Stats (4w Mean, 13w Std/CV)

Captures short-term momentum and scales volatility across departments.

Sales MoM Growth & Acceleration

Identifies slowing growth rates as an early warning signal before absolute declines appear.

Drop_4w

Flags sudden, sharp declines — stronger predictors of failure than gradual shifts.

Relative Positioning

Sales vs. Peer (Residual_Z)

Isolates local issues (staffing, inventory) from broader market trends using z-scored residuals.

Dept_Sales_Share

Distinguishes localized departmental weakness from general store-wide decline.

External Shocks

Macro Shocks (Fuel & CPI Deviation)

Captures inflation and fuel price spikes that suppress disposable income and purchasing power.

Context (Weeks_to_Holiday, Temp_Deviation)

Adjusts expectations for high-demand windows and weather-sensitive seasonal departments.

Operational Signals

Markdown Intensity (Has_Markdown, Ratio)

Heavy discounts relative to baseline sales signal potential inventory distress or weak organic demand.

Revenue Efficiency (Sales_Per_SqFt)

Normalizes performance across different store footprints, enabling fair cross-store comparison.

Key Findings

What the Data Revealed

Before modeling, the exploratory analysis surfaced three critical patterns that shaped our entire prediction strategy.

📅

Seasonality Dominates

Sales are heavily dominated by year-end holiday peaks. Any underperformance signal must account for this seasonal baseline to avoid false positives during off-peak periods.

$80.9M peak week · Dec 2010

🏪

Store Type Creates Structural Gaps

Type A Supercenters (150k+ sq ft) generate fundamentally different sales volumes than Type C Neighborhood Markets. Comparing raw sales across types is misleading — size-normalized metrics are essential.

3 distinct store formats identified

⚠️

Underperformance is a Minority Signal

~19.7% of store-department combinations were flagged as underperforming in any given week, validating that our 20% threshold is selective rather than overly aggressive.

1 in 5 combinations at risk weekly

The Model

Binary Classification with CART

We trained a Decision Tree classifier (CART) on a temporal 80/20 split — preserving the time sequence rather than randomizing — to predict whether each store × department pair would underperform the following week.

Base Model Performance

Accuracy 81.5%

Precision 62.6%

Recall 5.95%

F1 Score 10.9%

Class imbalance caused the base model to miss ~94% of actual underperformers. Threshold tuning and class weighting were applied to improve recall.

🎯 Top Predictors

Residual_Z — highest importance (0.30). Detects when sales fall below peer expectations, flagging localized failure vs. market trends.

Rolling_Std_13w / CV_13w — second and third. Volatile departments face significantly higher underperformance risk.

Drop_4w — sudden sharp declines proved stronger predictors than gradual drift.

🗺️ Risk Heatmap Output

The final model outputs a store × department risk probability matrix. Store 43 – Dept 52 showed a 0.96 predicted probability — the highest in the test period, enabling proactive management intervention.

The Impact

What This Means for Stakeholders

For Store Managers

A weekly early warning list of their highest-risk store-department combinations, enabling proactive inventory adjustments and targeted promotions before sales decline becomes visible in reporting.

For Regional Operations

Portfolio-wide risk visibility that distinguishes systemic regional issues from isolated store problems — enabling smarter resource allocation and escalation decisions.

For Analytics Teams

A reproducible, extensible feature engineering framework that can incorporate new data signals (geographic enrichment, department name mapping) to further improve prediction quality.

Key Lesson

Moving from reactive ranking to proactive classification requires context-aware baselines. Without accounting for store type, seasonality, and momentum, you're not measuring performance — you're measuring size.

Limitations & Next Steps

What We'd Do With More Data

Gap

No rural/suburban/urban classification in the dataset

Impact

Can't determine if underperformance is location-driven. Want: Census / Geo Enrichment

Gap

Store type labels (A, B, C) inferred — no official definitions

Impact

Model can't use store format as a meaningful signal. Want: Additional store attributes

Gap

Department IDs exist but names are missing

Impact

Can't pinpoint which category is hurting store performance. Want: Department name mapping

Gap

Markdown fields lack event-level context

Impact

Can't explain why Markdown 1 & 4 correlate positively — no causal story. Want: Markdown event logs