Extracting some data from a website? /u/agent_wolfe Python Education

I’m having some difficulty with this python program to extract some data from a website.

import requests from bs4 import BeautifulSoup import re import json import csv def extract_data_from_url(url): try: response = requests.get(url) response.raise_for_status() soup = BeautifulSoup(response.text, 'html.parser') script_tags = soup.find_all('script') print(f"Number of script tags found: {len(script_tags)}") prompt = "Not Found" date = "Not Found" for index, script in enumerate(script_tags): print(f"Checking script {index+1}...") if "metadata" in script.text: # Look for the presence of 'metadata' print (script.text) # PRINTS OUT SCRIPT CHUNK HERE else: print(f"Script {index+1} does not contain 'metadata'.") return { 'URL': url, 'Prompt': prompt, 'Date': date, 'Lyrics': prompt } except Exception as e: print(f"Error processing {url}: {e}") return { 'URL': url, 'Prompt': 'Error', 'Date': 'Error', 'Lyrics': 'Error' } def main(): # Input: List of URLs input_file = 'urls.txt' # File containing one URL per line output_file = 'extracted_data.csv' # File to save the extracted data try: with open(input_file, 'r') as file: urls = [line.strip() for line in file if line.strip()] # Extract data from each URL extracted_data = [] for url in urls: print(f"Processing: {url}") data = extract_data_from_url(url) extracted_data.append(data) # Save the extracted data to a CSV file with open(output_file, 'w', newline='', encoding='utf-8') as csvfile: fieldnames = ['URL', 'Title', 'Prompt', 'Date', 'Lyrics'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() writer.writerows(extracted_data) print(f"Data extraction complete. Saved to {output_file}") except FileNotFoundError: print(f"Error: The file {input_file} was not found.") except Exception as e: print(f"An unexpected error occurred: {e}") if __name__ == '__main__': main() 

What’s happening:

  • I have each URL in the urls.txt on a separate line.
  • I am looking for “Title”, “Prompt”, “Date”, and “Tags” which are stored in a <script> as "metadata"
  • When I run the python program, it goes to each webpage and tries to locate each <script> section. There are 49 of them.
  • (The data ends up as HTML, but I think the program loads the page before that happens.)
  • The script with the metadata is Script 46.
  • When it finds the correct one, it # PRINTS OUT SCRIPT CHUNK HERE

That’s where I’ve gotten stuck. I’m not sure how to get the correct details out of this chunk of data? Here’s what it’s printing out:

Checking script 46... self.__next_f.push([1,"16:I[some stuff...,[(some stuff.....],some stuff...]n6:[some stuff..."metadata":{"tags":"TAGS","prompt":"PROMPT",some stuff...},some stuff...,"created_at":"DATE"]) 

(It contained 2250 characters but I tried to simplify it to just the parts I need.)

TLDR:

The program is finding the right script section with Metadata. I’m just not sure how to get the Tags, Prompt, and Date so they can be saved to the CSV file.

submitted by /u/agent_wolfe
[link] [comments]

​r/learnpython I’m having some difficulty with this python program to extract some data from a website. import requests from bs4 import BeautifulSoup import re import json import csv def extract_data_from_url(url): try: response = requests.get(url) response.raise_for_status() soup = BeautifulSoup(response.text, ‘html.parser’) script_tags = soup.find_all(‘script’) print(f”Number of script tags found: {len(script_tags)}”) prompt = “Not Found” date = “Not Found” for index, script in enumerate(script_tags): print(f”Checking script {index+1}…”) if “metadata” in script.text: # Look for the presence of ‘metadata’ print (script.text) # PRINTS OUT SCRIPT CHUNK HERE else: print(f”Script {index+1} does not contain ‘metadata’.”) return { ‘URL’: url, ‘Prompt’: prompt, ‘Date’: date, ‘Lyrics’: prompt } except Exception as e: print(f”Error processing {url}: {e}”) return { ‘URL’: url, ‘Prompt’: ‘Error’, ‘Date’: ‘Error’, ‘Lyrics’: ‘Error’ } def main(): # Input: List of URLs input_file = ‘urls.txt’ # File containing one URL per line output_file = ‘extracted_data.csv’ # File to save the extracted data try: with open(input_file, ‘r’) as file: urls = [line.strip() for line in file if line.strip()] # Extract data from each URL extracted_data = [] for url in urls: print(f”Processing: {url}”) data = extract_data_from_url(url) extracted_data.append(data) # Save the extracted data to a CSV file with open(output_file, ‘w’, newline=”, encoding=’utf-8′) as csvfile: fieldnames = [‘URL’, ‘Title’, ‘Prompt’, ‘Date’, ‘Lyrics’] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() writer.writerows(extracted_data) print(f”Data extraction complete. Saved to {output_file}”) except FileNotFoundError: print(f”Error: The file {input_file} was not found.”) except Exception as e: print(f”An unexpected error occurred: {e}”) if __name__ == ‘__main__’: main() What’s happening: I have each URL in the urls.txt on a separate line. I am looking for “Title”, “Prompt”, “Date”, and “Tags” which are stored in a <script> as “metadata” When I run the python program, it goes to each webpage and tries to locate each <script> section. There are 49 of them. (The data ends up as HTML, but I think the program loads the page before that happens.) The script with the metadata is Script 46. When it finds the correct one, it # PRINTS OUT SCRIPT CHUNK HERE That’s where I’ve gotten stuck. I’m not sure how to get the correct details out of this chunk of data? Here’s what it’s printing out: Checking script 46… self.__next_f.push([1,”16:I[some stuff…,[(some stuff…..],some stuff…]n6:[some stuff…”metadata”:{“tags”:”TAGS”,”prompt”:”PROMPT”,some stuff…},some stuff…,”created_at”:”DATE”]) (It contained 2250 characters but I tried to simplify it to just the parts I need.) TLDR: The program is finding the right script section with Metadata. I’m just not sure how to get the Tags, Prompt, and Date so they can be saved to the CSV file. submitted by /u/agent_wolfe [link] [comments] 

I’m having some difficulty with this python program to extract some data from a website.

import requests from bs4 import BeautifulSoup import re import json import csv def extract_data_from_url(url): try: response = requests.get(url) response.raise_for_status() soup = BeautifulSoup(response.text, 'html.parser') script_tags = soup.find_all('script') print(f"Number of script tags found: {len(script_tags)}") prompt = "Not Found" date = "Not Found" for index, script in enumerate(script_tags): print(f"Checking script {index+1}...") if "metadata" in script.text: # Look for the presence of 'metadata' print (script.text) # PRINTS OUT SCRIPT CHUNK HERE else: print(f"Script {index+1} does not contain 'metadata'.") return { 'URL': url, 'Prompt': prompt, 'Date': date, 'Lyrics': prompt } except Exception as e: print(f"Error processing {url}: {e}") return { 'URL': url, 'Prompt': 'Error', 'Date': 'Error', 'Lyrics': 'Error' } def main(): # Input: List of URLs input_file = 'urls.txt' # File containing one URL per line output_file = 'extracted_data.csv' # File to save the extracted data try: with open(input_file, 'r') as file: urls = [line.strip() for line in file if line.strip()] # Extract data from each URL extracted_data = [] for url in urls: print(f"Processing: {url}") data = extract_data_from_url(url) extracted_data.append(data) # Save the extracted data to a CSV file with open(output_file, 'w', newline='', encoding='utf-8') as csvfile: fieldnames = ['URL', 'Title', 'Prompt', 'Date', 'Lyrics'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() writer.writerows(extracted_data) print(f"Data extraction complete. Saved to {output_file}") except FileNotFoundError: print(f"Error: The file {input_file} was not found.") except Exception as e: print(f"An unexpected error occurred: {e}") if __name__ == '__main__': main() 

What’s happening:

  • I have each URL in the urls.txt on a separate line.
  • I am looking for “Title”, “Prompt”, “Date”, and “Tags” which are stored in a <script> as "metadata"
  • When I run the python program, it goes to each webpage and tries to locate each <script> section. There are 49 of them.
  • (The data ends up as HTML, but I think the program loads the page before that happens.)
  • The script with the metadata is Script 46.
  • When it finds the correct one, it # PRINTS OUT SCRIPT CHUNK HERE

That’s where I’ve gotten stuck. I’m not sure how to get the correct details out of this chunk of data? Here’s what it’s printing out:

Checking script 46... self.__next_f.push([1,"16:I[some stuff...,[(some stuff.....],some stuff...]n6:[some stuff..."metadata":{"tags":"TAGS","prompt":"PROMPT",some stuff...},some stuff...,"created_at":"DATE"]) 

(It contained 2250 characters but I tried to simplify it to just the parts I need.)

TLDR:

The program is finding the right script section with Metadata. I’m just not sure how to get the Tags, Prompt, and Date so they can be saved to the CSV file.

submitted by /u/agent_wolfe
[link] [comments] 

Leave a Reply

Your email address will not be published. Required fields are marked *