Amazon Interview Coding Challenge


My Amazon interview

Quite some time ago, I was contacted by Amazon for a position in the AWS DNS team. I was curious to hear more about it and agreed to enter their recruitment process. The 1st step was a coding challenge.

In this post I will described the challenge and my answer.

The challenge

Here is the challenge I received through email:

From: <redacted@amazon.com>
To: <benoit.ganne@gmail.com>
Subject: Amazon Coding Challenge

Hi Benoit,

Thank you for taking the time to complete our question. We take this step to see an example of your software development skills, you are free to use any programming language and tools you feel comfortable with. You have 24 hours (starting from 15:00 Dublin Time) to reply with your answer but we expect this question should take no more than 2 to 3 hours. How soon you respond is not considered when we evaluate your submission (2 hours or 22 hours are considered equal). Please submit the best production-level code you can and confirm the reception of this message at you earliest convenience.

Please just attach the source files, archives (e.g. .zip, .rar) will be removed by our mail servers.

We will evaluate your answer on:

  *   Correctness, does it work?
  *   Is it tested?
  *   Is it well designed?
  *   Is the code easy to comprehend?
  *   Would this code be easy to extend or maintain?

Your implementation must be your own, making use of your standard library. Please feel free to consult any relevant documentation. Please provide a complete solution including any instructions an engineer would need to build and run your software, e.g. a README, Makefile, setup.py, build.xml, etc. You can use any programming language you see fit. The question is written in Python for simplicity.

Implement a class to generate statistics that provides the following methods:

  *   add(val) # Store a new floating point value with a timestamp. The class should store values for up to 60 seconds of age.
  *   get_all() # Return a time-sorted list of (timestamp, value)
  *   get_p70() # Return the 70 percentile of the stored values

The class should be memory and cpu-efficient (imagine running 50000 add() and 10000 get_p70() every hour). Explain the performance in Big-O notation for N add() and K get_p70(). Give some considerations about the scalability of the solution. Please provide appropriate unit-testing as you see fit.


Examples:

for x in (9, 3, 3, 4, 5, 4.9, 8, 3.3, 2, 0.1):
    stats.add(x)

print stats.get_p70()  # 5

for x in (9, 3, 3, 4, 5, 4.9, 8, 3.3, 2, 0.1):
    sleep(9)
    stats.add(x)

print stats.get_p70()  # 4.9


Kind regards,
<redacted>

The problem is pretty straightforward, but as usual the devil is in the details. I personally think that part of the job of a good software engineer is to ask for precision when the specs are too vague. They always are anyway. So I asked for precision on topics which look important to me. It turns out they did not seem to care much (surprise surprise, it is a fake problem anyway):

From: <redacted@amazon.com>
To: <benoit.ganne@gmail.com>
Subject: Amazon Coding Challenge

Great to see your interest here. Reply is inline

> Hi <redacted>,

> Thanks for the problem, I will work on it this evening, comfortably from
> home :) . I have some questions though, in order to be sure to have
> correctly understood the requirements:
> *   add(val) # Store a new floating point value with a timestamp. The
>     class should store values for up to 60 seconds of age.
> - I suppose timestamp should be stored as UNIX timestamp (seconds since
> the Epoch)
> - I suppose floating points should be stored as IEEE-754 double precision
> (as 'float' type in Python)
> - when speaking about values of 60s of age, I understand that the stored
> values strictly older than 60s should no longer be considered (and
> probably discarded / overwritten for efficiency reasons) for get_all()
> and get_p70().

It is up for you to decide which type of float to use. You may discard values older than
60 seconds when returning get_all() and get_p70()

> *   get_all() # Return a time-sorted list of (timestamp, value)
> In case of equal timestamps, does the value should be considered as a
> secondary sorting key, should I preserve the insertion order, or is it
> irrelevant (arbitrary)?

It is better to preserve insertion order

> *   get_p70() # Return the 70 percentile of the stored values
> I understand from examples that the definition of percentile is of the
> nearest rank type: https://en.wikipedia.org/wiki/Percentile#Nearest_rank

This is one of possible ways to calculate the percentile. It is fine for you to implement percentile this way

> Could you confirm me that my understanding is correct?
> Regards,

With all the information I needed, I started to work on it when back home.

My answer to the challenge

The 1st choice I had to make was about the language I was going to use.

My languages of choice are C, C++ or Python. I am very proficient in C and Python. I have to confess I am a bit more rusty on C++.

For coding efficiency reasons, I started by considering Python. I really like Python, but in the subject they highlighted the fact they wanted a complexity analysis of the different operations, and it is very hard in Python, because:

  • you need to know the implementation of the all the data structures/objects you are using
  • the memory management is dynamic, making it hard to know when memory will be allocated/reclaimed
  • the memory allocator is the standard C memory allocator (malloc(3p)), which complexity is implementation-dependent (and unpredictable much of the time)

All-in-all, to be able to do a complexity analysis, you need to dig into Python internals and to replace the memory allocators among other things. This is not uninteresting by far, but it is not the main point of the challenge and we have less than 24h remaining.

This leaves us with C or C++. C is really poor in data-structures and algorithmic support, and C++ is extremely appealing to me (meta-programming! algorithmic libraries! ...) It is even more so since C++11 (standardized concurrent memory models * ! lambdas! ...).

As I already pointed I use C++ much less than C or Python but I decided to go with C++. After all I did it on my spare time: I wanted it to be fun and interesting.

You can find my solution on GitHub. I am pretty happy with it, especially considering:

  • it was done in a few hours on my spare time
  • I was a bit rusty on C++, so I had to look up docs quite a bit

I think I designed an efficient implementation and a clean interface (objects, template for compile-time specialization, iterator, standard algorithm, etc.) with tests and documentation. Of course it can definitely be improved, especially the iterator interface.

Epilogue

After passing successfully the coding challenge, I had two phone interviews but it stopped there.

I have to confess I had a lot of fun with the coding challenge. My only regret is that despite asking about it I did not received any feedback about my solution.


*. OK, this one has been imported to C11. But anyway, it was standardized by C++.

Questions? Remarks? Do not hesitated to contact me!