Web Scraping Tutorial: Extracting Metal Festival Data with TypeScript, Playwright & AI
Web Scraping Tutorial: Extracting Metal Festival Data with TypeScript, Playwright & AI
Problem
Ever wanted to extract data from a complex website automatically but dynamic HTML prevented you? Need structured data from thousands of festivals without manual work? In this tutorial, I’ll teach you how to build an intelligent scraper that extracts data from https://www.metalfestivals.top using modern technologies.
What You’ll Learn
In this article, we detail step-by-step how we built a web scraper that:
- Automatically navigates sites with dynamic JavaScript content
- Extracts structured data using AI (Google Gemini)
- Validates and structures data with TypeScript
- Handles errors and provides realistic User-Agents
- Saves everything in organized JSON
Why It’s Useful:
- Automates repetitive data collection tasks
- Avoids writing complicated CSS selectors
- Uses AI to “understand” the page automatically
- Simple and straightforward implementation
Prerequisites
- Node.js 18+ installed on your machine
- npm or pnpm for dependency management
- A valid API key from Google Gemini: https://aistudio.google.com/app/api-keys
- Basic understanding of JavaScript/TypeScript
1/5 Step 1: Project Setup and Dependencies
The first step is creating the base structure. Our project needs:
Main Dependencies:
playwright: Navigate sites with JavaScript renderingllm-scraper: Library combining Playwright + LLMzod: Schema validation with types@ai-sdk/google: Google Gemini AI integrationdotenv: Manage environment variables
package.json:
{
"name": "festivals-scraper",
"version": "1.0.0",
"type": "module",
"description": "Web scraper for extracting festival data using LLM",
"main": "dist/index.js",
"scripts": {
"start": "node dist/scraper/index.js",
"build": "tsc",
"dev": "npm run build && node dist/scraper/index.js"
},
"dependencies": {
"@ai-sdk/google": "^1.x",
"dotenv": "^16.3.1",
"llm-scraper": "latest",
"playwright": "^1.48.1",
"zod": "^3.22.4"
},
"devDependencies": {
"@types/node": "^22.7.5",
"typescript": "^5.6.3"
}
}Commands:
# 1. Install dependencies
npm install
# 2. Install Playwright browsers
npx playwright install
# 3. Create .env file
cp .env.example .envFolder Structure:
festivals-scraper/
├── src/
│ └── scraper/
│ ├── index.ts ← Main script
│ ├── schema.ts ← Zod validation
│ └── llmProvider.ts ← Gemini configuration
├── dist/ ← Compiled (generated)
├── package.json
├── tsconfig.json
└── .env ← Credentials (don't push)
Pro Tip: Use npm run dev -- "URL" which automatically compiles TypeScript and runs the scraper.
2/5 Step 2: Define Data Schema with Zod
Before scraping, you MUST define WHAT data you want to extract. Here’s where Zod shines.
The Problem: Without a clear schema, the LLM doesn’t know what to extract. With Zod, we give precise instructions.
src/scraper/schema.ts:
import { z } from 'zod';
/**
* Band schema with key information
*/
const bandSchema = z.object({
name: z.string()
.describe("Band name."),
origin: z.string()
.describe("Band's country of origin."),
subgenres: z.array(z.string())
.describe("List of music subgenres."),
});
/**
* MAIN schema: Defines entire festival structure
* LLM will use this to know exactly what to extract
*/
export const festivalSchema = z.object({
festivalName: z.string()
.describe("Complete official festival name."),
location: z.object({
city: z.string()
.describe("Festival city."),
country: z.string()
.describe("Festival country."),
}).describe("Geographic location."),
bands: z.array(bandSchema)
.describe("List of performing bands."),
});
// TypeScript infers type automatically
export type FestivalData = z.infer<typeof festivalSchema>;Expected Result:
{
"festivalName": "Samaïn Fest",
"location": {
"city": "La Mézière",
"country": "France"
},
"bands": [
{
"name": "Septicflesh",
"origin": "Greece",
"subgenres": ["Symphonic Death Metal"]
},
{
"name": "Ufomammut",
"origin": "Italy",
"subgenres": ["Sludge Metal"]
}
]
}Troubleshooting: If the LLM doesn’t extract certain fields, add clearer .describe() instructions.
3/5 Step 3: Configure Google Gemini
We’re using Google Gemini as our LLM provider - it’s free, fast, and powerful.
src/scraper/llmProvider.ts:
import { google } from '@ai-sdk/google';
import dotenv from 'dotenv';
dotenv.config();
/**
* Get Google Gemini LLM provider
*/
export function getLLMProvider(): any {
console.log('Using provider: Google Gemini');
return google('gemini-2.0-flash');
}.env (configuration):
# Google Gemini API Key
# Get your free API key at: https://aistudio.google.com/app/api-keys
GOOGLE_GENERATIVE_AI_API_KEY="YOUR_API_KEY_HERE"4/5 Step 4: Implement Main Scraper
This is the heart of the project. Here we bring everything together.
src/scraper/index.ts:
import { chromium } from 'playwright';
import LLMScraper from 'llm-scraper';
import { festivalSchema, FestivalData } from './schema.js';
import { getLLMProvider } from './llmProvider.js';
import dotenv from 'dotenv';
import fs from 'fs';
import path from 'path';
import { fileURLToPath } from 'url';
dotenv.config();
const __filename = fileURLToPath(import.meta.url);
const __dirname = path.dirname(__filename);
/**
* STEP 1: Validate and get festival URL
* User passes it like: npm run dev -- "https://example.com"
*/
function getFestivalUrl(): string {
const args = process.argv.slice(2);
if (args.length === 0) {
console.error('❌ Error: You must provide a festival URL as argument');
console.error('\nUsage:');
console.error(' npm run dev -- "https://www.metalfestivals.top/festivals/{festival_id}/lineup"');
process.exit(1);
}
const url = args[0];
// Validate it's a valid URL
try {
new URL(url);
return url;
} catch (error) {
console.error(`❌ Error: URL is not valid: ${url}`);
process.exit(1);
}
}
/**
* MAIN FUNCTION: Orchestrate entire scraping process
*/
async function main() {
const FESTIVAL_URL = getFestivalUrl();
console.log('🎵 Festival Scraper - Extracting data...\n');
console.log(`🔗 URL: ${FESTIVAL_URL}\n`);
// STEP 2: Initialize LLM provider
const llm = getLLMProvider();
// STEP 3: Create Scraper instance
const scraper = new LLMScraper(llm);
// STEP 4: Launch browser
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage({
// Realistic User-Agent to avoid blocking
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
});
try {
console.log(`📄 Navigating to: ${FESTIVAL_URL}`);
// STEP 5: Set headers to look like a real browser
await page.setExtraHTTPHeaders({
'Accept-Language': 'en-US,en;q=0.9',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Referer': 'https://www.google.com/',
});
// STEP 6: Navigate to URL (wait for JavaScript)
await page.goto(FESTIVAL_URL, {
waitUntil: 'domcontentloaded',
timeout: 30000
});
// STEP 7: Wait for everything to load (JS can be slow)
await page.waitForTimeout(3000);
console.log('✅ Page loaded. Extracting data with AI...\n');
// STEP 8: MAGIC - LLM extracts data per our schema
const { data } = await scraper.run(page, festivalSchema, {
format: 'html', // Pass full HTML to LLM for context
});
console.log('✅ Extraction completed!\n');
console.log('📊 Extracted data:');
console.log(JSON.stringify(data, null, 2));
// STEP 9: Save results to JSON
const outputPath = path.join(process.cwd(), 'festival-data.json');
fs.writeFileSync(outputPath, JSON.stringify(data, null, 2));
console.log(`\n💾 Saved to: ${outputPath}`);
} catch (error) {
console.error('❌ Error during scraping:', error);
process.exit(1);
} finally {
// STEP 10: Always close browser
await browser.close();
}
}
main();Step by Step Flow:
1. Validate URL ✓
↓
2. Initialize Gemini ✓
↓
3. Launch Playwright ✓
↓
4. Set User-Agent + Headers ✓
↓
5. Navigate to URL ✓
↓
6. Wait for JavaScript ✓
↓
7. Pass HTML + Schema to Gemini ✓
↓
8. Gemini extracts data per schema ✓
↓
9. Save JSON ✓
↓
10. Close browser ✓
Troubleshooting:
- “Timeout waiting for page”: Increase to 60000ms in
goto() - “Chromium error”: Run
npx playwright install - “Incomplete data”: Increase
waitForTimeoutto 5000ms or add more.describe()to schema fields
5/5 Step 5: Compile and Run
Now that we have all the code, it’s time to execute.
tsconfig.json (TypeScript configuration):
{
"compilerOptions": {
"target": "ES2020",
"module": "ES2020",
"moduleResolution": "node",
"lib": ["ES2020"],
"outDir": "./dist",
"rootDir": "./src",
"strict": true,
"esModuleInterop": true,
"skipLibCheck": true,
"forceConsistentCasingInFileNames": true
},
"include": ["src/**/*"],
"exclude": ["node_modules", "dist"]
}Compile and Execute:
# OPTION 1: Compile + Execute in one command (recommended)
npm run dev -- "https://www.metalfestivals.top/festivals/1/lineup"
# OPTION 2: Compile first, then execute
npm run build
npm start -- "https://www.metalfestivals.top/festivals/1/lineup"
# OPTION 3: Execute directly (if already compiled)
node dist/scraper/index.js "https://www.metalfestivals.top/festivals/1/lineup"Expected Output:
🎵 Festival Scraper - Extracting data...
🔗 URL: https://www.metalfestivals.top/festivals/1/lineup
Using provider: Google Gemini
📄 Navigating to: https://www.metalfestivals.top/festivals/1/lineup
✅ Page loaded. Extracting data with AI...
✅ Extraction completed!
📊 Extracted data:
{
"festivalName": "Samaïn Fest",
"location": {
"city": "La Mézière",
"country": "France"
},
"bands": [
{
"name": "Aetheria Conscientia",
"origin": "France",
"subgenres": ["Atmospheric Black Metal"]
},
// ... 16 more bands ...
]
}
💾 Saved to: /path/festival-data.json
✅ Done! Your data is in festival-data.json
Final Result
We’ve built a simple, effective, professional scraper that:
✅ Navigates sites with JavaScript rendering (Playwright) ✅ Automatically extracts structured data (Gemini AI) ✅ Robustly validates types (Zod + TypeScript) ✅ Handles errors gracefully ✅ Saves results in organized JSON ✅ Takes just 5 minutes to set up
Final Project Structure:
festivals-scraper/
├── src/scraper/
│ ├── index.ts # Main script (108 lines)
│ ├── schema.ts # Zod validation (40 lines)
│ └── llmProvider.ts # Gemini config (16 lines)
├── dist/ # Compiled (generated)
├── festival-data.json # ← RESULT
├── .env # Environment variables
└── package.json # Dependencies
Quick Start:
# 1. Install
npm install && npx playwright install
# 2. Configure .env with your Gemini API key
echo 'GOOGLE_GENERATIVE_AI_API_KEY="YOUR_API_KEY"' > .env
# 3. Run
npm run dev -- "https://www.metalfestivals.top/festivals/1/lineup"
# 4. See results
cat festival-data.jsonPro Tips
- Scrape multiple URLs in parallel:
npm start -- "url1" &
npm start -- "url2" &
npm start -- "url3" &
wait- Adding new fields is trivial - just edit
schema.ts:
// Add ticket price
export const festivalSchema = z.object({
// ... other fields ...
ticketPrice: z.string().optional()
.describe("Average ticket price"),
});- Use different URLs easily:
npm run dev -- "https://www.metalfestivals.top/festivals/1/lineup"
npm run dev -- "https://www.metalfestivals.top/festivals/2/lineup"
npm run dev -- "https://www.metalfestivals.top/festivals/3/lineup"Common Pitfalls
- ❌ Forget User-Agent → You get blocked. ✅ Always use realistic User-Agent
- ❌ Don’t wait for JavaScript → Empty data. ✅ Use
waitForTimeout(3000) - ❌ API key in repository → Security risk. ✅ Use
.envin.gitignore - ❌ No error handling → Script fails silently. ✅ Always use try/catch
#TypeScript #Playwright #WebScraping #LLM #AI #Automation #Tutorial
Try this tutorial and share your results!
Did you manage to extract data? Adapted it to another site? Share your results and let us know how you did it!
Frequently Asked Questions:
- Can I use it with another site? Yes, just change the URL and schema
- Cost? Free! Google Gemini gives you 60 requests/minute for free
- How do I extract more fields? Add them to the schema with
.describe()and Gemini will extract them
Complete Repository
You can review the complete code and run it yourself in the demo-scraper repository:
GitHub Repository: https://github.com/juliocabrera820/demo-scraper
Clone it and start scraping!