Working with csv/tabbed files in Python

Hello everyone! I’ve recently been tasked to work with big data and, in particular, Hadoop and Yarn.
The journey is starting off with writing some mapreduce code and I’ve chosen Python to write it since Java is much harder for me and I don’t know much about it.

Onto my issue now: I need to extract from the map fuction all the rows that correspond to a specific pattern. I’m working on this data and what I need for my reduce function is to exract the row with the highest value per location and sum all the values of the locations that belong to the same continent. And, in the end, order everything from highest to lowest value.

I’ve been looking around the internet but I can’t seem to find a solution for it. Using python is already a stretch and I can’t use things like Panda that requires dependencies.

Thanks in advance for the help!

I wish I could help, but I do know this is possible. I had to write a similar program that had to read a file and separate by space, comma, tab, etc. I chose a semicolon, but I don’t have enough knowledge to truly help you out. When I’m off work I’ll sit down with my program and see if I can assist you further!

1 Like

what other dependencies does pandas need?

also if you do get pandas working this is a pretty good cheatsheet

1 Like

Thanks for getting back to me. Reading the file has not been an issue luckly. Usingg with open and csv.reader or csv.DictReader I’ve been able to open the csv file, select the columns I need and selecting a between two dates.
I hope this helps you narrow down what I need help for. I appreciate it.

1 Like

What I meant to say is that Pandas needs to be installed to be used and it’s not part of python so I’d like to avoid it.

I mean that’s a valid concern, but it’s a really useful library for analyzing big data. Dataframes are huge, the functions really save a lot of effort writing code to loop through data. If you plan on doing visualizations, machine learning, or any future data analysis, it’s a tool used by almost all data analysts and data scientists.

Unfortunately I’m really stretching what I can use for this assignment. I should’ve been using java but I’m ass at it and I also have some hate towards it. I don’t expect my teacher to install something on top of python to review what I did.

it’s not a huge issue to install an additional python library, but it really depends on what you need to show with python.

If you use Jupyter notebook, you can host post it on github, and it will show the code as well as the output. Your teacher wouldn’t need to install python to view the outputs

this can get the nlargest for a specific column
https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.nlargest.html

you are gonna want a groupby for this one
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

you are gonna want a sort
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html

1 Like

Thanks for providing all these sources to check out, I really appreciate it.
What I need is do is select the highest value in group of rows.
But what you just posted just made me think of a possible solution: grouping all the values using the ISO_code, then selecting the highest values for each group and then add all the values that have the same continent in the row. It might work.

2 Likes

Is this a school assignment?? :thinking:

you could also use sql to group by and sum
https://www.w3schools.com/sql/sql_groupby.asp

it really depends on what the class is about

  • if it’s a stats class, they probably care more about the final value
  • if it’s a Computer Science class, they probably care more about the programming
  • If it’s a Data Science class, they probably care about the method of how you got the final values

if you want to get into the field of data science, picking up python more useful, but if you are doing more business and databases, sql would probably be more useful in the future. Java good to learn if you are trying to get into programming

University assignment. But I suck big times at programming, never truly got into it so that’s why I’m here.

This one.

I know SQL, but that’s a bit too much for this. What should to the job mostly is Hadoop. Using SQL would defeat the purpose of the assignment.

I don’t really understand why a data science class would use java. Outside of intro programming classes (and hell even those are shifting to kotlin) no stats or data science classes really uses java.

The majority of the industry uses either R or Python. A couple companies use tableu for visualizations, but most exploratory data analysis is done with R or Python. On the business side of data analysis you might see some excel.

Scientific Research prefers matlab or R. Working in matlab without the live editor is like working in java programming.

1 Like

We livin’ in the 1500s here. I don’t understand either why they wanted us to use java, and that’s why I’m ignoring the way it was explained to us and switched to python.

1 Like

if the whole data science program forces you to learn java, I dunno if I’d stay with them.
Unless it was a branch of the computer science department, where their hope is you learn more of the computer science that statistics aspect of data science.

Even then it still is a stretch, because most DS programs will use what the industry uses

@WolfTech716 @bedHedd

This is as far as I went with the reducer function. For now I’m passing it an output.txt file that comes from the mapper function. What I need to do now is sum all the numeric values in the second column if they belong to the same continent.
So in the first column I have Asia beside some numeric entries. I need to sum all the numbers beside Asia between themselves.
I really don’t know how to do it, not even have an intuition on how to do it.

with open("output.txt", "r") as reduced:
reduced_reader = csv.reader(reduced, delimiter='\t')
cases_date = datetime.date(2020, 4, 30)
for lines in reduced_reader:
    current_date = str(lines[6])
    current = date.fromisoformat(current_date)
    if lines[0] != '' and lines[1] != '':
        if cases_date == current:
            print (lines[0], lines[1])
1 Like

I’m confused. Are you still working with pandas or did you just implement the groupby function with python?

No Pandas for now. I just realized that to select the highest value in the list, since they’re total covid cases, I just needed to select the last date of the range of dates I’m working on.