Obtaining and Validating Big Data
In this video Konstantinos Pouliasis presents a toolkit combining bash scripting, a refined postgres database design and Node.js libraries for effectively fetching, validating and storing big data publicly available from governmental resources.
Basic unix scripting programming constructs and specific commands are being presented. Curl command is utilized to fetch the zipped CSV files, the unzip command is coupled with sed to extract data and achieve some first data cleanup. A database side validation of CSV data integrity is being exposed and, next, a database design technique utilizing Postgres partitioning is being elaborated as appropriate for large seasonal data sets. Finally, the library pg-pool with its multithreaded connection capabilities is presented as an ideal complementary to automatizing data insertion using Node.js.
Project Members: Konstantinos Pouliasis