Alright, so yesterday I was messing around with some data about dim sum. Yeah, dim sum! I know, random, right?

Basically, I started by scraping some info off a couple of different websites that list dim sum dishes. It was a pain, let me tell you. Websites are never formatted the same way, so I had to write different scrapers for each one. Used Python with Beautiful Soup – good ol’ reliable.
Next, I had all this messy text data. Time to clean it up! This involved a lot of regex, removing weird characters, and standardizing the names of the dishes. For example, some sites called it “Har Gow,” others “Har Gau,” and still others “Shrimp Dumpling.” Had to make them all consistent. I used Pandas for this part. So much easier to deal with data in a table format.
After cleaning, I started analyzing it. Simple stuff, like finding the most common ingredients (shrimp, pork, obviously). Then I tried to group the dishes by type – dumplings, rolls, buns, etc. This was tricky because there’s a lot of overlap. Is siu mai a dumpling or a “open-topped dumpling”? Decisions, decisions…
Then, I got a wild hair and decided to visualize some of the data. Nothing fancy, just some basic bar charts and pie charts using Matplotlib. Showed things like the distribution of dish types and the most frequent ingredients. It looked kinda cool, actually.
Finally, I put it all together in a simple report. Just a Markdown file with some text and the charts embedded in it. Could probably make it look nicer, but hey, it gets the job done. I even added some random facts about dim sum that I found online, just for fun.

Here’s a quick rundown of the tools I used:
- Python: The workhorse for everything.
- Beautiful Soup: For scraping web data.
- Pandas: For cleaning and manipulating data.
- Matplotlib: For creating visualizations.
It wasn’t anything groundbreaking, but it was a fun little project. I learned a bit more about dim sum (and web scraping), and I got to practice my data analysis skills. Plus, now I have a handy reference guide for ordering dim sum next time I go!
If I were to do it again, I’d probably spend more time on the data cleaning part. It’s always the most tedious, but it makes a huge difference in the quality of the analysis. Also, I might try to find a way to automatically categorize the dishes. That would save a lot of time and effort.