Web Scraping Tutorial: Extracting Metal Festival Data with TypeScript, Playwright & AI

2025-11-13 13 min TypeScript

Web Scraping Tutorial: Extracting Metal Festival Data with TypeScript, Playwright & AI

Problem

Ever wanted to extract data from a complex website automatically but dynamic HTML prevented you? Need structured data from thousands of festivals without manual work? In this tutorial, I’ll teach you how to build an intelligent scraper that extracts data from https://www.metalfestivals.top using modern technologies.

What You’ll Learn

In this article, we detail step-by-step how we built a web scraper that:

  • Automatically navigates sites with dynamic JavaScript content
  • Extracts structured data using AI (Google Gemini)
  • Validates and structures data with TypeScript
  • Handles errors and provides realistic User-Agents
  • Saves everything in organized JSON

Why It’s Useful:

  • Automates repetitive data collection tasks
  • Avoids writing complicated CSS selectors
  • Uses AI to “understand” the page automatically
  • Simple and straightforward implementation

Prerequisites


1/5 Step 1: Project Setup and Dependencies

The first step is creating the base structure. Our project needs:

Main Dependencies:

  • playwright: Navigate sites with JavaScript rendering
  • llm-scraper: Library combining Playwright + LLM
  • zod: Schema validation with types
  • @ai-sdk/google: Google Gemini AI integration
  • dotenv: Manage environment variables

package.json:

{
  "name": "festivals-scraper",
  "version": "1.0.0",
  "type": "module",
  "description": "Web scraper for extracting festival data using LLM",
  "main": "dist/index.js",
  "scripts": {
    "start": "node dist/scraper/index.js",
    "build": "tsc",
    "dev": "npm run build && node dist/scraper/index.js"
  },
  "dependencies": {
    "@ai-sdk/google": "^1.x",
    "dotenv": "^16.3.1",
    "llm-scraper": "latest",
    "playwright": "^1.48.1",
    "zod": "^3.22.4"
  },
  "devDependencies": {
    "@types/node": "^22.7.5",
    "typescript": "^5.6.3"
  }
}

Commands:

# 1. Install dependencies
npm install
 
# 2. Install Playwright browsers
npx playwright install
 
# 3. Create .env file
cp .env.example .env

Folder Structure:

festivals-scraper/
├── src/
│   └── scraper/
│       ├── index.ts          ← Main script
│       ├── schema.ts         ← Zod validation
│       └── llmProvider.ts    ← Gemini configuration
├── dist/                     ← Compiled (generated)
├── package.json
├── tsconfig.json
└── .env                      ← Credentials (don't push)

Pro Tip: Use npm run dev -- "URL" which automatically compiles TypeScript and runs the scraper.


2/5 Step 2: Define Data Schema with Zod

Before scraping, you MUST define WHAT data you want to extract. Here’s where Zod shines.

The Problem: Without a clear schema, the LLM doesn’t know what to extract. With Zod, we give precise instructions.

src/scraper/schema.ts:

import { z } from 'zod';
 
/**
 * Band schema with key information
 */
const bandSchema = z.object({
  name: z.string()
    .describe("Band name."),
  origin: z.string()
    .describe("Band's country of origin."),
  subgenres: z.array(z.string())
    .describe("List of music subgenres."),
});
 
/**
 * MAIN schema: Defines entire festival structure
 * LLM will use this to know exactly what to extract
 */
export const festivalSchema = z.object({
  festivalName: z.string()
    .describe("Complete official festival name."),
  location: z.object({
    city: z.string()
      .describe("Festival city."),
    country: z.string()
      .describe("Festival country."),
  }).describe("Geographic location."),
  bands: z.array(bandSchema)
    .describe("List of performing bands."),
});
 
// TypeScript infers type automatically
export type FestivalData = z.infer<typeof festivalSchema>;

Expected Result:

{
  "festivalName": "Samaïn Fest",
  "location": {
    "city": "La Mézière",
    "country": "France"
  },
  "bands": [
    {
      "name": "Septicflesh",
      "origin": "Greece",
      "subgenres": ["Symphonic Death Metal"]
    },
    {
      "name": "Ufomammut",
      "origin": "Italy",
      "subgenres": ["Sludge Metal"]
    }
  ]
}

Troubleshooting: If the LLM doesn’t extract certain fields, add clearer .describe() instructions.


3/5 Step 3: Configure Google Gemini

We’re using Google Gemini as our LLM provider - it’s free, fast, and powerful.

src/scraper/llmProvider.ts:

import { google } from '@ai-sdk/google';
import dotenv from 'dotenv';
 
dotenv.config();
 
/**
 * Get Google Gemini LLM provider
 */
export function getLLMProvider(): any {
  console.log('Using provider: Google Gemini');
  return google('gemini-2.0-flash');
}

.env (configuration):

# Google Gemini API Key
# Get your free API key at: https://aistudio.google.com/app/api-keys
GOOGLE_GENERATIVE_AI_API_KEY="YOUR_API_KEY_HERE"

4/5 Step 4: Implement Main Scraper

This is the heart of the project. Here we bring everything together.

src/scraper/index.ts:

import { chromium } from 'playwright';
import LLMScraper from 'llm-scraper';
import { festivalSchema, FestivalData } from './schema.js';
import { getLLMProvider } from './llmProvider.js';
import dotenv from 'dotenv';
import fs from 'fs';
import path from 'path';
import { fileURLToPath } from 'url';
 
dotenv.config();
 
const __filename = fileURLToPath(import.meta.url);
const __dirname = path.dirname(__filename);
 
/**
 * STEP 1: Validate and get festival URL
 * User passes it like: npm run dev -- "https://example.com"
 */
function getFestivalUrl(): string {
  const args = process.argv.slice(2);
  
  if (args.length === 0) {
    console.error('❌ Error: You must provide a festival URL as argument');
    console.error('\nUsage:');
    console.error('  npm run dev -- "https://www.metalfestivals.top/festivals/{festival_id}/lineup"');
    process.exit(1);
  }
 
  const url = args[0];
  
  // Validate it's a valid URL
  try {
    new URL(url);
    return url;
  } catch (error) {
    console.error(`❌ Error: URL is not valid: ${url}`);
    process.exit(1);
  }
}
 
/**
 * MAIN FUNCTION: Orchestrate entire scraping process
 */
async function main() {
  const FESTIVAL_URL = getFestivalUrl();
 
  console.log('🎵 Festival Scraper - Extracting data...\n');
  console.log(`🔗 URL: ${FESTIVAL_URL}\n`);
 
  // STEP 2: Initialize LLM provider
  const llm = getLLMProvider();
 
  // STEP 3: Create Scraper instance
  const scraper = new LLMScraper(llm);
 
  // STEP 4: Launch browser
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage({
    // Realistic User-Agent to avoid blocking
    userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
  });
 
  try {
    console.log(`📄 Navigating to: ${FESTIVAL_URL}`);
    
    // STEP 5: Set headers to look like a real browser
    await page.setExtraHTTPHeaders({
      'Accept-Language': 'en-US,en;q=0.9',
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
      'Referer': 'https://www.google.com/',
    });
 
    // STEP 6: Navigate to URL (wait for JavaScript)
    await page.goto(FESTIVAL_URL, { 
      waitUntil: 'domcontentloaded', 
      timeout: 30000 
    });
 
    // STEP 7: Wait for everything to load (JS can be slow)
    await page.waitForTimeout(3000);
 
    console.log('✅ Page loaded. Extracting data with AI...\n');
 
    // STEP 8: MAGIC - LLM extracts data per our schema
    const { data } = await scraper.run(page, festivalSchema, {
      format: 'html', // Pass full HTML to LLM for context
    });
 
    console.log('✅ Extraction completed!\n');
    console.log('📊 Extracted data:');
    console.log(JSON.stringify(data, null, 2));
 
    // STEP 9: Save results to JSON
    const outputPath = path.join(process.cwd(), 'festival-data.json');
    fs.writeFileSync(outputPath, JSON.stringify(data, null, 2));
    console.log(`\n💾 Saved to: ${outputPath}`);
 
  } catch (error) {
    console.error('❌ Error during scraping:', error);
    process.exit(1);
  } finally {
    // STEP 10: Always close browser
    await browser.close();
  }
}
 
main();

Step by Step Flow:

1. Validate URL ✓
   ↓
2. Initialize Gemini ✓
   ↓
3. Launch Playwright ✓
   ↓
4. Set User-Agent + Headers ✓
   ↓
5. Navigate to URL ✓
   ↓
6. Wait for JavaScript ✓
   ↓
7. Pass HTML + Schema to Gemini ✓
   ↓
8. Gemini extracts data per schema ✓
   ↓
9. Save JSON ✓
   ↓
10. Close browser ✓

Troubleshooting:

  • “Timeout waiting for page”: Increase to 60000ms in goto()
  • “Chromium error”: Run npx playwright install
  • “Incomplete data”: Increase waitForTimeout to 5000ms or add more .describe() to schema fields

5/5 Step 5: Compile and Run

Now that we have all the code, it’s time to execute.

tsconfig.json (TypeScript configuration):

{
  "compilerOptions": {
    "target": "ES2020",
    "module": "ES2020",
    "moduleResolution": "node",
    "lib": ["ES2020"],
    "outDir": "./dist",
    "rootDir": "./src",
    "strict": true,
    "esModuleInterop": true,
    "skipLibCheck": true,
    "forceConsistentCasingInFileNames": true
  },
  "include": ["src/**/*"],
  "exclude": ["node_modules", "dist"]
}

Compile and Execute:

# OPTION 1: Compile + Execute in one command (recommended)
npm run dev -- "https://www.metalfestivals.top/festivals/1/lineup"
 
# OPTION 2: Compile first, then execute
npm run build
npm start -- "https://www.metalfestivals.top/festivals/1/lineup"
 
# OPTION 3: Execute directly (if already compiled)
node dist/scraper/index.js "https://www.metalfestivals.top/festivals/1/lineup"

Expected Output:

🎵 Festival Scraper - Extracting data...

🔗 URL: https://www.metalfestivals.top/festivals/1/lineup

Using provider: Google Gemini
📄 Navigating to: https://www.metalfestivals.top/festivals/1/lineup
✅ Page loaded. Extracting data with AI...

✅ Extraction completed!

📊 Extracted data:
{
  "festivalName": "Samaïn Fest",
  "location": {
    "city": "La Mézière",
    "country": "France"
  },
  "bands": [
    {
      "name": "Aetheria Conscientia",
      "origin": "France",
      "subgenres": ["Atmospheric Black Metal"]
    },
    // ... 16 more bands ...
  ]
}

💾 Saved to: /path/festival-data.json

Done! Your data is in festival-data.json


Final Result

We’ve built a simple, effective, professional scraper that:

✅ Navigates sites with JavaScript rendering (Playwright) ✅ Automatically extracts structured data (Gemini AI) ✅ Robustly validates types (Zod + TypeScript) ✅ Handles errors gracefully ✅ Saves results in organized JSON ✅ Takes just 5 minutes to set up

Final Project Structure:

festivals-scraper/
├── src/scraper/
│   ├── index.ts          # Main script (108 lines)
│   ├── schema.ts         # Zod validation (40 lines)
│   └── llmProvider.ts    # Gemini config (16 lines)
├── dist/                 # Compiled (generated)
├── festival-data.json    # ← RESULT
├── .env                  # Environment variables
└── package.json          # Dependencies

Quick Start:

# 1. Install
npm install && npx playwright install
 
# 2. Configure .env with your Gemini API key
echo 'GOOGLE_GENERATIVE_AI_API_KEY="YOUR_API_KEY"' > .env
 
# 3. Run
npm run dev -- "https://www.metalfestivals.top/festivals/1/lineup"
 
# 4. See results
cat festival-data.json

Pro Tips

  1. Scrape multiple URLs in parallel:
npm start -- "url1" &
npm start -- "url2" &
npm start -- "url3" &
wait
  1. Adding new fields is trivial - just edit schema.ts:
// Add ticket price
export const festivalSchema = z.object({
  // ... other fields ...
  ticketPrice: z.string().optional()
    .describe("Average ticket price"),
});
  1. Use different URLs easily:
npm run dev -- "https://www.metalfestivals.top/festivals/1/lineup"
npm run dev -- "https://www.metalfestivals.top/festivals/2/lineup"
npm run dev -- "https://www.metalfestivals.top/festivals/3/lineup"

Common Pitfalls

  • ❌ Forget User-Agent → You get blocked. ✅ Always use realistic User-Agent
  • ❌ Don’t wait for JavaScript → Empty data. ✅ Use waitForTimeout(3000)
  • ❌ API key in repository → Security risk. ✅ Use .env in .gitignore
  • ❌ No error handling → Script fails silently. ✅ Always use try/catch

#TypeScript #Playwright #WebScraping #LLM #AI #Automation #Tutorial

Try this tutorial and share your results!

Did you manage to extract data? Adapted it to another site? Share your results and let us know how you did it!

Frequently Asked Questions:

  • Can I use it with another site? Yes, just change the URL and schema
  • Cost? Free! Google Gemini gives you 60 requests/minute for free
  • How do I extract more fields? Add them to the schema with .describe() and Gemini will extract them

Complete Repository

You can review the complete code and run it yourself in the demo-scraper repository:

GitHub Repository: https://github.com/juliocabrera820/demo-scraper

Clone it and start scraping!