I am interested in retrieving data from a website that uses an endpoint where you can get this data by accessing an endpoint that takes game IDs as a query parameter of comma-separated values. Testing the endpoint seems to work both with a single ID and multiple IDs.
I am scraping football games, so I perform requests for multiple bookmakers in an asynchronous way. For five of those IDs, the returned JSON is about 500,000 lines and about 16 to 18 MB in terms of memory, and those numbers are divided by 5 for a single game ID.
I have two options:
- Construct a request for each game and then gather the tasks associated with querying them.
- Use a single request with all the game IDs and map the events returned with the games internally in my script, as I already keep the
game_id
for each game obtained earlier in the script flow.
Is there any performance difference (a significant one, at least) between these two operations? Which one should I implement? If it plays a role , the above query is run in parallel with another queries , I have about 6 scrapers and each of them is running requests quering the same games at another bookmakers at the same moment
submitted by /u/WinterDazzling
[link] [comments]
r/learnpython I am interested in retrieving data from a website that uses an endpoint where you can get this data by accessing an endpoint that takes game IDs as a query parameter of comma-separated values. Testing the endpoint seems to work both with a single ID and multiple IDs. I am scraping football games, so I perform requests for multiple bookmakers in an asynchronous way. For five of those IDs, the returned JSON is about 500,000 lines and about 16 to 18 MB in terms of memory, and those numbers are divided by 5 for a single game ID. I have two options: Construct a request for each game and then gather the tasks associated with querying them. Use a single request with all the game IDs and map the events returned with the games internally in my script, as I already keep the game_id for each game obtained earlier in the script flow. Is there any performance difference (a significant one, at least) between these two operations? Which one should I implement? If it plays a role , the above query is run in parallel with another queries , I have about 6 scrapers and each of them is running requests quering the same games at another bookmakers at the same moment submitted by /u/WinterDazzling [link] [comments]
I am interested in retrieving data from a website that uses an endpoint where you can get this data by accessing an endpoint that takes game IDs as a query parameter of comma-separated values. Testing the endpoint seems to work both with a single ID and multiple IDs.
I am scraping football games, so I perform requests for multiple bookmakers in an asynchronous way. For five of those IDs, the returned JSON is about 500,000 lines and about 16 to 18 MB in terms of memory, and those numbers are divided by 5 for a single game ID.
I have two options:
- Construct a request for each game and then gather the tasks associated with querying them.
- Use a single request with all the game IDs and map the events returned with the games internally in my script, as I already keep the
game_id
for each game obtained earlier in the script flow.
Is there any performance difference (a significant one, at least) between these two operations? Which one should I implement? If it plays a role , the above query is run in parallel with another queries , I have about 6 scrapers and each of them is running requests quering the same games at another bookmakers at the same moment
submitted by /u/WinterDazzling
[link] [comments]