PJ Tatlow and Faculty Mentor Stephen Piccolo, Department of Biology
Over the course of the past year I have been able to put a lot of work into creating a tool for scientists, those with computational background and without, that provides a simple web interface for downloading data from large, publically available datasets. It allows users to select a dataset that we have pre-processed, enter filters based on the samples they want to research, and download only that portion of the data. This will save researchers a lot of time, as well as reduce their reliance on bioinformaticians. Because of the increasing number of biological datasets that are becoming publically available, we hope this tool will facilitate more scientists taking advantage of the wealth of data that is now at their fingertips.
Once they’ve selected filters that give them the samples they’re interested in, they go to the downloads page where they can select the genes that they want to study, the sample variables to include in their download, and finally download the data. The Python server then downloads just the samples and variables they requested in the file format of their choosing (TSV, CSV, or JSON). We have also integrated Geney with a third-party visualization tool called Plot.ly (https://plot.ly), a very powerful plotting tool that can create pretty much any type of chart and export images. This pairing should allow users to not only analyze data easier, but also show their discoveries through high quality and easy to create plots.
But not only are people going to be able to use Geney on our servers with the datasets we’ve prepared, but I’ve developed it in a way that will make it easy for others to bring up their own Geney server. Many applications that focus on working with big data require large-scale distributed databases that require lots of management and large servers. We decided to take a federated approach, so that the load can be shared across a variety of web-servers hosted by different groups rather than one very expensive site. Additionally, if scientists have private data that they cannot share they can host their own private version of Geney to allow easy internal access. Because of this federated architecture, we needed the datasets to be highly portable. To do this, we researched many different embedded data formats that allow fast searching access, as well as created a pipeline which allows datasets to be created and put into the correct Geney format. We call this pipeline WishBuilder (https://srp33.github.io/WishBuilder), and we’ve already prepared over 50 datasets for people to use with Geney, with many more to come!