Wednesday, September 14, 2016

Hadoop 2.7.3 on Windows 7


1. Download Hadoop 2.7.3 from Apache
2. Decompress to C:\env, thus HADOOP_HOME=C:\env\hadoop-2.7.3
3. Dowload Windows specific stuff (https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1/bin) and copy to %HADOOP_HOME%\bin
4. Set env variable HADOOP_HOME=C:\env\hadoop-2.7.3 and update PATH=%PATH%;%HADOOP_HOME%\bin
5. Update Hadoop conf (single node cluster setup instructions)
6. Execute %HADOOP_HOME%\sbin\start-dfs.cmd and %HADOOP_HOME%\sbin\start-yarn.cmd

Note: If you followed the instructions as is, NameNode Port is 9000.

So try some HDFS command like

hdfs dfs -put test.txt hdfs://localhost:9000/sample.text
hdfs dfs -ls hdfs://localhost:9000/

That's it!

Wednesday, February 19, 2014

Friday, January 31, 2014

REST API Design Rulebook

REST API Design Rulebook

Disclaimer: The following content is a modified index of REST API Design Rulebook.

Identifier Design with URIs


URI Format

  • Rule: Forward slash separator (/) must be used to indicate a hierarchical relationship
  • Rule: A trailing forward slash (/) should not be included in URIs
  • Rule: Hyphens (-) should be used to improve the readability of URIs
  • Rule: Underscores (_) should not be used in URIs
  • Rule: Lowercase letters should be preferred in URI paths
  • Rule: File extensions should not be included in URIs

URI Authority Design

  • Rule: Consistent subdomain names should be used for your APIs
  • Rule: Consistent subdomain names should be used for your client developer portal

URI Path Design

  • Rule: A singular noun should be used for document names
  • Rule: A plural noun should be used for collection names
  • Rule: A plural noun should be used for store names
  • Rule: A verb or verb phrase should be used for controller names
  • Rule: Variable path segments may be substituted with identity-based values
  • Rule: CRUD function names should not be used in URIs

URI Query Design

  • Rule: The query component of a URI may be used to filter collections or stores
  • Rule: The query component of a URI should be used to paginate collection or store results


Interaction Design with HTTP

Request Methods

  • Rule: GET and POST must not be used to tunnel other request methods
  • Rule: GET must be used to retrieve a representation of a resource
  • Rule: HEAD should be used to retrieve response headers
  • Rule: PUT must be used to both insert and update a stored resource
  • Rule: PUT must be used to update mutable resources
  • Rule: POST must be used to create a new resource in a collection
  • Rule: POST must be used to execute controllers
  • Rule: DELETE must be used to remove a resource from its parent
  • Rule: OPTIONS should be used to retrieve metadata that describes a resource’s available interactions

Response Status Codes

  • Rule: 200 (“OK”) should be used to indicate nonspecific success
  • Rule: 200 (“OK”) must not be used to communicate errors in the response body
  • Rule: 201 (“Created”) must be used to indicate successful resource creation
  • Rule: 202 (“Accepted”) must be used to indicate successful start of an asynchronous action
  • Rule: 204 (“No Content”) should be used when the response body is intentionally empty
  • Rule: 301 (“Moved Permanently”) should be used to relocate resources
  • Rule: 302 (“Found”) should not be used
  • Rule: 303 (“See Other”) should be used to refer the client to a different URI
  • Rule: 304 (“Not Modified”) should be used to preserve bandwidth
  • Rule: 307 (“Temporary Redirect”) should be used to tell clients to resubmit the request to another URI
  • Rule: 400 (“Bad Request”) may be used to indicate nonspecific failure
  • Rule: 401 (“Unauthorized”) must be used when there is a problem with the client’s credentials
  • Rule: 403 (“Forbidden”) should be used to forbid access regardless of authorization state
  • Rule: 404 (“Not Found”) must be used when a client’s URI cannot be mapped to a resource
  • Rule: 405 (“Method Not Allowed”) must be used when the HTTP method is not supported
  • Rule: 406 (“Not Acceptable”) must be used when the requested media type cannot be served
  • Rule: 409 (“Conflict”) should be used to indicate a violation of resource state
  • Rule: 412 (“Precondition Failed”) should be used to support conditional operations
  • Rule: 415 (“Unsupported Media Type”) must be used when the media type of a request’s payload cannot be processed
  • Rule: 500 (“Internal Server Error”) should be used to indicate API malfunction

Metadata Design

HTTP Headers

  • Rule: Content-Type must be used
  • Rule: Content-Length should be used
  • Rule: Last-Modified should be used in responses
  • Rule: ETag should be used in responses
  • Rule: Stores must support conditional PUT requests
  • Rule: Location must be used to specify the URI of a newly created resource
  • Rule: Cache-Control, Expires, and Date response headers should be used to encourage caching
  • Rule: Cache-Control, Expires, and Pragma response headers may be used to discourage caching
  • Rule: Caching should be encouraged
  • Rule: Expiration caching headers should be used with 200 (“OK”) responses
  • Rule: Expiration caching headers may optionally be used with 3xx and 4xx responses
  • Rule: Custom HTTP headers must not be used to change the behavior of HTTP methods


Media Types

  • Rule: Application-specific media types should be used
  • Rule: Media type negotiation should be supported when multiple representations are available
  • Rule: Media type selection using a query parameter may be supported


Representation Design

Message Body Format

  • Rule: JSON should be supported for resource representation
  • Rule: JSON must be well-formed
  • Rule: XML and other formats may optionally be used for resource representation
  • Rule: Additional envelopes must not be created

Hypermedia Representation

  • Rule: A consistent form should be used to represent links
  • Rule: A consistent form should be used to represent link relations
  • Rule: A consistent form should be used to advertise links
  • Rule: A self link should be included in response message body representations
  • Rule: Minimize the number of advertised “entry point” API URIs
  • Rule: Links should be used to advertise a resource’s available actions in a state-sensitive manner

Media Type Representation

  • Rule: A consistent form should be used to represent media type formats
  • Rule: A consistent form should be used to represent media type schemas

Error Representation

  • Rule: A consistent form should be used to represent errors
  • Rule: A consistent form should be used to represent error responses
  • Rule: Consistent error types should be used for common error conditions


Client Concerns

Versioning

  • Rule: New URIs should be used to introduce new concepts
  • Rule: Schemas should be used to manage representational form versions
  • Rule: Entity tags should be used to manage representational state versions

Security

  • Rule: OAuth may be used to protect resources
  • Rule: API management solutions may be used to protect resources
  • Response Representation Composition
  • Rule: The query component of a URI should be used to support partial responses
  • Rule: The query component of a URI should be used to embed linked resources


JavaScript Clients

  • Rule: JSONP should be supported to provide multi-origin read access from JavaScript
  • Rule: CORS should be supported to provide multi-origin read/write access from JavaScript

Uniform Implementation

  • Principle: A REST API should be designed, not coded
  • Principle: Programmers and their organizations benefit from consistency

Thursday, January 16, 2014

A poor man's lesson plan to Data Science

Basic

Data Structure & Algorithms (Coursera Part I : Princeton University, Coursera Part II : Princeton University)
Analysis of Algorithms (Coursera Stanford University, Coursera Princeton University)
Algorithm Design (Slides: Princeton University)

Statistics

Statistics : Making Sense of Data (Cousera Toronto University)
Probabilistic Graphical Models (Coursera Stanford University)

Data Mining/Data Science/Machine Learning

Introduction to Data Mining (BooksiteInstructor Solution Manual)
Introduction to Data Science (Coursera Washington University)
Core Concepts of Data Analysis (Coursera)
Machine Learning (Coursera Stanford University, Coursera Washington University)
Web Intelligence and Big Data (Coursera)
Introduction to Recommendar Systems (Coursera)

Technologies

Hadoop, Hive, R (http://bigdatauniversity.com/http://www.statmethods.net/index.html)

Sunday, October 21, 2012

Embedded MySQL in Java With Connector/MXJ and 64-bit Linux

http://blog.palominolabs.com/2011/10/03/embedded-mysql-on-java-with-connectormxj-and-64-bit-linux/


This really helped me as I had to write some integration test which uses MySQL import feature and using other in-memory DB was not a good option.

Another tip: if you need to wait in an integration/unit test, you can use CountDownLatch.

Thursday, October 4, 2012

ICONIX Process for OOAD

The ICONIX Process is an open, free-to-use object modeling process. It’s minimal, use case driven, and agile. The process focuses on the area that lies in between use cases and code. Its emphasis is on what needs to happen at that point in the life cycle where you’re starting out: you have a start on some use cases, and now you need to do good analysis and design.

See more at http://iconixprocess.com/iconix-process/


Wednesday, April 11, 2012

Guava Event Bus Example


package guava;

import java.io.File;

import org.apache.log4j.Logger;

import com.google.common.eventbus.EventBus;
import com.google.common.eventbus.Subscribe;

public class FileSizeApp {
private static final Logger LOG = Logger.getLogger(FileSizeApp.class);

private final EventBus eventBus = new EventBus("FileSizeEventBus");

private long filesPending;
private long totalSize;

private long start = System.nanoTime();

private void process(File file) {
eventBus.register(this);

eventBus.post(new ProcessFileEvent(file));
}

@Subscribe
public void processFile(ProcessFileEvent e) {
filesPending++;
eventBus.post(new CalculateSizeEvent(e.getFile()));

if (LOG.isDebugEnabled()) {
LOG.debug(filesPending + ": " + e.getFile().getAbsolutePath());
}
}

@Subscribe
public void calculateSize(CalculateSizeEvent e) {
long size = 0;
File file = e.getFile();

if (file.isFile()) {
size = file.length();
} else {
File[] children = file.listFiles();

if (children != null) {
for (File child : children)
if (child.isFile()) {
size += child.length();
} else {
eventBus.post(new ProcessFileEvent(child));
}
}
}

eventBus.post(new FileSizeEvent(size));
}

@Subscribe
public void fileSize(FileSizeEvent e) {
totalSize += e.getSize();
filesPending--;

LOG.info(filesPending + ": " + e.getSize() + ", " + totalSize);

if (filesPending == 0) {
System.out.println("Total size: " + totalSize);
System.out.println("Time taken (s): " + (System.nanoTime() - start) / 1.0e9);

System.exit(0);
}
}

private static class ProcessFileEvent {
private final File file;

public ProcessFileEvent(File file) {
this.file = file;
}

public File getFile() {
return file;
}
}

private static class CalculateSizeEvent {
private final File file;

public CalculateSizeEvent(File file) {
this.file = file;
}

public File getFile() {
return file;
}
}

private static class FileSizeEvent {
private final long size;

public FileSizeEvent(long size) {
this.size = size;
}

public long getSize() {
return size;
}
}

public static void main(String[] args) {
final String fileName = args[0];
final FileSizeApp app = new FileSizeApp();

System.out.println("Calculating file size for: " + fileName);
app.process(new File(fileName));
}
}