Here are some popular blogs on building large language models:
Building a Large Language Model (LLM) from scratch is a multi-stage engineering process that involves everything from data preparation to complex neural network architecture implementation. The most comprehensive resource on this topic is the book " Build a Large Language Model (From Scratch)
Stripping HTML tags, fixing encoding issues, and removing "garbage" text.