DNC Tech Choices: Why we chose Google BigQuery Migrating to a new data warehouse This blog post walks through how we chose and migrated to Phoenix, our data warehouse. Thanks to a number of hard-working colleagues, this work was underway when I joined the DNC in the summer of 2019. Thank you to Ben Matasar for his collaboration on this post. Where we were in 2018 The data warehouse that served as our foundation during the 2016 and 2018 election cycles dated back to Obama’s 2012 run. It was fed by ETL pipelines custom-built by previous generations of the team. As people came and went with each election cycle, maintenance suffered and tech debt on this system piled up. This is a familiar challenge in political tech. Data, analytics, and digital organizing tools are all extremely vital to winning an election, but they often languish after an election is over. While the warehouse had served us well in our earlier victories, we needed a new data warehouse that would support Democrats up-and-down the ballot for the 2020 cycle (and future election cycles). Why build vs buy? and why Google BigQuery? When we started evaluating different data warehouse options, we had a few objectives in mind: Serve not just the next presidential candidate, but all Democratic candidates across all 50 states and D.C. Solution: Prioritize multi-tenancy and a high amount of concurrency Easily scale up for the frenetic intensity of an election, but also scale down with a limited team for its aftermath. Solution: Buy a managed service over building and operating our own instance. Reduce the amount of custom solutions that are susceptible to documentation rot and knowledge loss. Solution: Privilege cloud services or open source frameworks that are actively maintained outside of the DNC. Build tools for users with varying levels of sophistication. Solution: Create an ecosystem with a variety of services for power users and laypeople. Allow ourselves to onboard and offboard hundreds of staff members across dozens of organizations in a secure and scalable way. Solution: Prioritize a platform that can integrate with single-sign on, centralized provisioning, and role-based access. Buy into a platform with best-in-industry security practices. After evaluating Amazon Redshift, Azure, Snowflake, Google BigQuery, and the potential of simply upgrading our legacy platform, we chose to adopt Google BigQuery. It satisfied our major goals for multi-tenancy, usage concurrency, and being a service that could be managed with a relatively small team. However, a big part of the appeal of moving to Google Cloud Platform was that scaling up and scaling down our infrastructure was a natural part of the product, and we would only pay for what we would use. From bursts of building to continuous stewardship In December 2018, we started building a proof-of-concept. By early 2019, we kicked off our MVP. In the fallow period between elections, we prioritized re-building syncs for the data that mattered most to our users at state parties and sister committees — voter file data and data from VoteBuilder. It was essential that we migrate users at state parties to “Phoenix”, the new warehouse, as early as possible. This would enable our users to get started in the new data warehouse, and reduce our need to support the legacy warehouse. By midsummer 2019, only internal DNC data teams were using the old platform. By the fall, we completed migrating the archives. We said goodbye to the old warehouse with a Viking funeral. Where we are in 2020 In the year and a half since we began the migration, we have been able to do so much more now that we’re on BigQuery and Google Cloud Platform. State party data teams, sister committees, presidential primary campaigns, leverage Phoenix to manage, access, and organize data. State parties, sister committees, and campaigns up-and-down the ballot use Phoenix, the DNC’s data warehouse The reduced support overhead has allowed us to go deep on projects like automating and streamlining our voter file updates, as well as building out new data pipelines with other progressive data partners. Seamless integration with Google Sheets has also made our data more accessible to users, while more sophisticated tools like MLFlow have been an essential part of our data modeling infrastructure. Beyond 2020 Phoenix will be a reliable and secure warehouse for Democratic data beyond the 2020 election cycle. This long-term infrastructure allows our team and Democratic campaigns to focus on building tools and talking to voters, rather than managing a database cluster. Across the progressive ecosystem, our teams are small, our funding is tight, and our deadlines are tighter. We are in the business of winning elections, not husbanding a database cluster. By using Google BigQuery, we are able to focus on building tools that are making a difference in 2020 (and beyond). And that’s something that excites us all.