Tools
Tools: Nairobi Property Listings Scraping: 500+ Property Listings with Smart Size Extraction
2026-02-18
0 views
admin
The Challenge ## Core Scraping Logic ## The Size Extraction Problem ## Why This Approach? ## Integration with the Scraper ## Error Handling & Resilience ## Data Dictionary: Schema as Code ## Lessons for Production Scrapers ## Next Steps: Data Cleaning ## Python #WebScraping #BeautifulSoup #DataScience #MachineLearning #Kenya #OpenSource #Regex Building a house price prediction model for Nairobi requires a dataset that simply doesn’t exist in the open. You have to build it yourself by scraping property portals. But real estate data is very messy, listings are inconsistent, poorly structured, and is often contradictory. Taking the size property for instance, one listing gives size as "4,350 sq. ft." Another says "Approx. 350 – 400 sqm." A third buries three villa sizes inside a paragraph. So, how do you extract clean, usable data from this chaos? In this post, I’ll walk you through the architecture of a scraper that collected 528 listings with 10 structured fields, focusing on the size extraction logic that handles ranges, commas, and mixed units. The scraper iterates over pages, extracts needed data from each listing card, and then fetches more data from the detail page for richer information. I used requests with retries and BeautifulSoup for parsing. Property size appears in many forms on the site: Some listings even contain multiple size mentions (built‑up area, terrace, garden). We need to extract the first meaningful built‑up size. With all that chaos, the extract_size_from_text() function handles it all. Unit normalization allows fair comparison across different measurement systems. Thresholds (30 sqm, 300 sqft, 0.05 acre) eliminate noise from tiny areas that are clearly not the main house or are clearly not logical. Max selection handles listings that describe multiple areas (e.g., terraces, gardens). This means that we take the largest, which typically corresponds to the built‑up area. Regex patterns are flexible enough to catch variations like "sqm", "m²", "square meters", "sq. ft.", "sqft", and acre fractions. The scraper collects size from two places: Listing card – quick size from swiper slides or truncated description. Detail page – full description that contains a more accurate size. If found, it overrides the card size. Then in scrape_listing: Retry logic with fetch_page – requests up to 3 times with a 2‑second delay between attempts. Fallback data– if detail page fails, return basic info (size from card). Checkpointing – main loop stops when max_listings stops at max_listings (800) is reached to avoid over‑scraping. Polite delays – 1s between detail requests, 2s between pages to prevent IP blocking. I defined the schema upfront to ensure consistency: This is saved as data_dictionary.json, serving as documentation for collaborators and future‑me. Results after running pages 1–40: Always normalise units – you can't compare apples and oranges. Filter implausible values – they corrupt your model. Have a fallback – if detail page fails, keep the card data. Document your schema – you'll thank yourself later. Respect the source – rate limiting isn't optional. All code is available on GitHub: github Contributions welcome! Feel free to open issues or PRs. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse COMMAND_BLOCK:
nairobi-house-price-prediction/
├── data/
│ └── raw_listings.csv # raw csv output
├── notebooks/
│ └── extraction.ipynb # scraping logic
├── src/
│ ├── scraper.py # Core scraping functions
│ └── utils.py # Helpers (parsers, cleaners)
├── data_dictionary.json # Schema definition
└── requirements.txt Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
nairobi-house-price-prediction/
├── data/
│ └── raw_listings.csv # raw csv output
├── notebooks/
│ └── extraction.ipynb # scraping logic
├── src/
│ ├── scraper.py # Core scraping functions
│ └── utils.py # Helpers (parsers, cleaners)
├── data_dictionary.json # Schema definition
└── requirements.txt COMMAND_BLOCK:
nairobi-house-price-prediction/
├── data/
│ └── raw_listings.csv # raw csv output
├── notebooks/
│ └── extraction.ipynb # scraping logic
├── src/
│ ├── scraper.py # Core scraping functions
│ └── utils.py # Helpers (parsers, cleaners)
├── data_dictionary.json # Schema definition
└── requirements.txt COMMAND_BLOCK:
import re def extract_size_from_text(text): """ Extract built-up/property size from messy real estate descriptions. Returns original string of the most plausible size, or "N/A". """ if not text: return "N/A" text = text.replace(",", "") candidates = [] # (size_in_sqm, original_text) # 1. Ranges in sqm range_matches = re.findall( r'(\d+(\.\d+)?)\s*(?:–|-|to)\s*(\d+(\.\d+)?)\s*(sqm|m²|square meters?)', text, re.IGNORECASE ) for match in range_matches: low = float(match[0]) high = float(match[2]) if high >= 30: candidates.append((high, f"{match[0]}–{match[2]} sqm")) # 2. Single sqm sqm_matches = re.findall( r'(\d+(\.\d+)?)\s*(sqm|m²|square meters?)', text, re.IGNORECASE ) for match in sqm_matches: val = float(match[0]) if val >= 30: candidates.append((val, f"{match[0]} sqm")) # 3. Square feet (convert to sqm) sqft_matches = re.findall( r'(\d+(\.\d+)?)\s*(sq\.?\s*ft\.?|sqft)', text, re.IGNORECASE ) for match in sqft_matches: sqft = float(match[0]) if sqft >= 300: sqm = sqft * 0.092903 candidates.append((sqm, f"{match[0]} sq ft")) # 4. Acres (including fractions) acre_matches = re.findall( r'(\d+/\d+|\d+(\.\d+)?)\s*-?\s*(acre)', text, re.IGNORECASE ) for match in acre_matches: raw = match[0] if '/' in raw: num, den = raw.split('/') acres = float(num) / float(den) else: acres = float(raw) if acres >= 0.05: sqm = acres * 4046.86 candidates.append((sqm, f"{raw} acre")) if candidates: plausible = [c for c in candidates if c[0] >= 30] if plausible: # Return the largest (most likely built-up area) return max(plausible, key=lambda x: x[0])[1] return "N/A" Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
import re def extract_size_from_text(text): """ Extract built-up/property size from messy real estate descriptions. Returns original string of the most plausible size, or "N/A". """ if not text: return "N/A" text = text.replace(",", "") candidates = [] # (size_in_sqm, original_text) # 1. Ranges in sqm range_matches = re.findall( r'(\d+(\.\d+)?)\s*(?:–|-|to)\s*(\d+(\.\d+)?)\s*(sqm|m²|square meters?)', text, re.IGNORECASE ) for match in range_matches: low = float(match[0]) high = float(match[2]) if high >= 30: candidates.append((high, f"{match[0]}–{match[2]} sqm")) # 2. Single sqm sqm_matches = re.findall( r'(\d+(\.\d+)?)\s*(sqm|m²|square meters?)', text, re.IGNORECASE ) for match in sqm_matches: val = float(match[0]) if val >= 30: candidates.append((val, f"{match[0]} sqm")) # 3. Square feet (convert to sqm) sqft_matches = re.findall( r'(\d+(\.\d+)?)\s*(sq\.?\s*ft\.?|sqft)', text, re.IGNORECASE ) for match in sqft_matches: sqft = float(match[0]) if sqft >= 300: sqm = sqft * 0.092903 candidates.append((sqm, f"{match[0]} sq ft")) # 4. Acres (including fractions) acre_matches = re.findall( r'(\d+/\d+|\d+(\.\d+)?)\s*-?\s*(acre)', text, re.IGNORECASE ) for match in acre_matches: raw = match[0] if '/' in raw: num, den = raw.split('/') acres = float(num) / float(den) else: acres = float(raw) if acres >= 0.05: sqm = acres * 4046.86 candidates.append((sqm, f"{raw} acre")) if candidates: plausible = [c for c in candidates if c[0] >= 30] if plausible: # Return the largest (most likely built-up area) return max(plausible, key=lambda x: x[0])[1] return "N/A" COMMAND_BLOCK:
import re def extract_size_from_text(text): """ Extract built-up/property size from messy real estate descriptions. Returns original string of the most plausible size, or "N/A". """ if not text: return "N/A" text = text.replace(",", "") candidates = [] # (size_in_sqm, original_text) # 1. Ranges in sqm range_matches = re.findall( r'(\d+(\.\d+)?)\s*(?:–|-|to)\s*(\d+(\.\d+)?)\s*(sqm|m²|square meters?)', text, re.IGNORECASE ) for match in range_matches: low = float(match[0]) high = float(match[2]) if high >= 30: candidates.append((high, f"{match[0]}–{match[2]} sqm")) # 2. Single sqm sqm_matches = re.findall( r'(\d+(\.\d+)?)\s*(sqm|m²|square meters?)', text, re.IGNORECASE ) for match in sqm_matches: val = float(match[0]) if val >= 30: candidates.append((val, f"{match[0]} sqm")) # 3. Square feet (convert to sqm) sqft_matches = re.findall( r'(\d+(\.\d+)?)\s*(sq\.?\s*ft\.?|sqft)', text, re.IGNORECASE ) for match in sqft_matches: sqft = float(match[0]) if sqft >= 300: sqm = sqft * 0.092903 candidates.append((sqm, f"{match[0]} sq ft")) # 4. Acres (including fractions) acre_matches = re.findall( r'(\d+/\d+|\d+(\.\d+)?)\s*-?\s*(acre)', text, re.IGNORECASE ) for match in acre_matches: raw = match[0] if '/' in raw: num, den = raw.split('/') acres = float(num) / float(den) else: acres = float(raw) if acres >= 0.05: sqm = acres * 4046.86 candidates.append((sqm, f"{raw} acre")) if candidates: plausible = [c for c in candidates if c[0] >= 30] if plausible: # Return the largest (most likely built-up area) return max(plausible, key=lambda x: x[0])[1] return "N/A" COMMAND_BLOCK:
def extract_bedrooms_bathrooms_size(listing): bedrooms = bathrooms = "N/A" size_from_swiper = "N/A" # ... extract from swiper slides ... # Extract from description on the card desc_div = listing.find('div', id='truncatedDescription') if desc_div: desc_text = desc_div.get_text(" ", strip=True) size_from_desc = extract_size_from_text(desc_text) size = size_from_desc if size_from_desc != "N/A" else size_from_swiper return bedrooms, bathrooms, size Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
def extract_bedrooms_bathrooms_size(listing): bedrooms = bathrooms = "N/A" size_from_swiper = "N/A" # ... extract from swiper slides ... # Extract from description on the card desc_div = listing.find('div', id='truncatedDescription') if desc_div: desc_text = desc_div.get_text(" ", strip=True) size_from_desc = extract_size_from_text(desc_text) size = size_from_desc if size_from_desc != "N/A" else size_from_swiper return bedrooms, bathrooms, size COMMAND_BLOCK:
def extract_bedrooms_bathrooms_size(listing): bedrooms = bathrooms = "N/A" size_from_swiper = "N/A" # ... extract from swiper slides ... # Extract from description on the card desc_div = listing.find('div', id='truncatedDescription') if desc_div: desc_text = desc_div.get_text(" ", strip=True) size_from_desc = extract_size_from_text(desc_text) size = size_from_desc if size_from_desc != "N/A" else size_from_swiper return bedrooms, bathrooms, size COMMAND_BLOCK:
def extract_bedrooms_bathrooms_size(listing): bedrooms = bathrooms = "N/A" size_from_swiper = "N/A" # ... extract from swiper slides ... # Extract from description on the card desc_div = listing.find('div', id='truncatedDescription') if desc_div: desc_text = desc_div.get_text(" ", strip=True) size_from_desc = extract_size_from_text(desc_text) size = size_from_desc if size_from_desc != "N/A" else size_from_swiper return bedrooms, bathrooms, size Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
def extract_bedrooms_bathrooms_size(listing): bedrooms = bathrooms = "N/A" size_from_swiper = "N/A" # ... extract from swiper slides ... # Extract from description on the card desc_div = listing.find('div', id='truncatedDescription') if desc_div: desc_text = desc_div.get_text(" ", strip=True) size_from_desc = extract_size_from_text(desc_text) size = size_from_desc if size_from_desc != "N/A" else size_from_swiper return bedrooms, bathrooms, size COMMAND_BLOCK:
def extract_bedrooms_bathrooms_size(listing): bedrooms = bathrooms = "N/A" size_from_swiper = "N/A" # ... extract from swiper slides ... # Extract from description on the card desc_div = listing.find('div', id='truncatedDescription') if desc_div: desc_text = desc_div.get_text(" ", strip=True) size_from_desc = extract_size_from_text(desc_text) size = size_from_desc if size_from_desc != "N/A" else size_from_swiper return bedrooms, bathrooms, size COMMAND_BLOCK:
if size_from_detail != "N/A": size = size_from_detail # override card size Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
if size_from_detail != "N/A": size = size_from_detail # override card size COMMAND_BLOCK:
if size_from_detail != "N/A": size = size_from_detail # override card size COMMAND_BLOCK:
data_dictionary = [ {"Column": "Title", "Type": "String", "Description": "Property name"}, {"Column": "Property Type", "Type": "String", "Description": "Apartment, Townhouse, etc."}, # ... etc.
] Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
data_dictionary = [ {"Column": "Title", "Type": "String", "Description": "Property name"}, {"Column": "Property Type", "Type": "String", "Description": "Apartment, Townhouse, etc."}, # ... etc.
] COMMAND_BLOCK:
data_dictionary = [ {"Column": "Title", "Type": "String", "Description": "Property name"}, {"Column": "Property Type", "Type": "String", "Description": "Apartment, Townhouse, etc."}, # ... etc.
] COMMAND_BLOCK:
START_PAGE = 1
END_PAGE = 40 MAX_LISTINGS = 800 # max # Scrape
df = scrape_pages(START_PAGE, END_PAGE, MAX_LISTINGS) print(len(df)) # 800 print(df.columns)
# ['Title', 'Property Type', 'Price', 'Location', 'Bedrooms', 'Bathrooms', 'Size', 'Amenities', 'Surroundings', 'Created At'] Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
START_PAGE = 1
END_PAGE = 40 MAX_LISTINGS = 800 # max # Scrape
df = scrape_pages(START_PAGE, END_PAGE, MAX_LISTINGS) print(len(df)) # 800 print(df.columns)
# ['Title', 'Property Type', 'Price', 'Location', 'Bedrooms', 'Bathrooms', 'Size', 'Amenities', 'Surroundings', 'Created At'] COMMAND_BLOCK:
START_PAGE = 1
END_PAGE = 40 MAX_LISTINGS = 800 # max # Scrape
df = scrape_pages(START_PAGE, END_PAGE, MAX_LISTINGS) print(len(df)) # 800 print(df.columns)
# ['Title', 'Property Type', 'Price', 'Location', 'Bedrooms', 'Bathrooms', 'Size', 'Amenities', 'Surroundings', 'Created At'] - 800 listings from pages 1–40
- 10 fields per listing
- Size captured in listings if available
- 0 critical failures – thanks to retries and fallbacks
- Raw CSV saved to data/raw_listings.csv - Load raw_listings.csv
- Remove duplicates
- Standardize location strings
- Convert price to integer (remove “KSh”, commas)
- Convert size to numeric (extract first number from ranges, convert acres to sqm)
- Create features: price_per_sqft, amenity_score (count of amenities), month from Created At
- Basic EDA and save as clean_listings.csv
how-totutorialguidedev.toaipythongitgithub