How to Optimize Downloading 900 Million Records from an API to BigQuery with Python? /u/OldWanderingEngineer Python Education

Hello everyone,

I’m working on a project where I need to connect to an API to download 900 million records and upload them to a BigQuery table. I already have a working code, but it’s too slow to handle this volume of data in a reasonable amount of time.

Here’s what my code does:

Makes a request to the API, which returns a JSON with 200 records per request.
Uses Python to concatenate the next page token and continues fetching data page by page.
After a certain number of requests, the data is grouped into a DataFrame and uploaded to BigQuery.
Currently, I manage to process about 1 million records per hour, making 1 request every 0.6 seconds. This means the entire process would take over a month.

I’ve tried:

Saving DataFrames into dictionaries and stacking them into lists before uploading, but this didn’t make a noticeable difference.
Asynchronous loading, but the API does not support asynchronous operations.

I’m looking for ways to optimize this process, whether by improving my current code or using advanced data handling strategies. I understand it won’t be possible to reduce the time to just a few hours, but any improvement would be greatly appreciated.

Any ideas, suggestions, or similar experiences would be incredibly helpful. Thank you!

submitted by /u/OldWanderingEngineer
[link] [comments]

r/learnpython Hello everyone, I’m working on a project where I need to connect to an API to download 900 million records and upload them to a BigQuery table. I already have a working code, but it’s too slow to handle this volume of data in a reasonable amount of time. Here’s what my code does: Makes a request to the API, which returns a JSON with 200 records per request. Uses Python to concatenate the next page token and continues fetching data page by page. After a certain number of requests, the data is grouped into a DataFrame and uploaded to BigQuery. Currently, I manage to process about 1 million records per hour, making 1 request every 0.6 seconds. This means the entire process would take over a month. I’ve tried: Saving DataFrames into dictionaries and stacking them into lists before uploading, but this didn’t make a noticeable difference. Asynchronous loading, but the API does not support asynchronous operations. I’m looking for ways to optimize this process, whether by improving my current code or using advanced data handling strategies. I understand it won’t be possible to reduce the time to just a few hours, but any improvement would be greatly appreciated. Any ideas, suggestions, or similar experiences would be incredibly helpful. Thank you! submitted by /u/OldWanderingEngineer [link] [comments]

Hello everyone,

Here’s what my code does:

Makes a request to the API, which returns a JSON with 200 records per request.
Uses Python to concatenate the next page token and continues fetching data page by page.
After a certain number of requests, the data is grouped into a DataFrame and uploaded to BigQuery.
Currently, I manage to process about 1 million records per hour, making 1 request every 0.6 seconds. This means the entire process would take over a month.

I’ve tried:

Saving DataFrames into dictionaries and stacking them into lists before uploading, but this didn’t make a noticeable difference.
Asynchronous loading, but the API does not support asynchronous operations.

Any ideas, suggestions, or similar experiences would be incredibly helpful. Thank you!

submitted by /u/OldWanderingEngineer
[link] [comments]

Leave a Reply Cancel reply