Pass-by-value vs Pass-by-reference in Java and C++

In this article I illustrate what it means to pass-by-value as opposed to pass-by-reference with a focus on Java vs C++.

The question often asked is this : Is Java pass-by-reference ?
A common and often erroneous answer is : Java is pass by reference for objects, and pass-by-value for primitives.
This is WRONG. To illustrate why this is so, let me refer you to this quote by the father
of Java himself, James Gosling:

Some people will say incorrectly that objects are passed “by reference.” In programming language design, the term pass by reference properly means that when an argument is passed to a function, the invoked function gets a reference to the original value, not a copy of its value. If the function modifies its parameter, the value in the calling code will be changed because the argument and parameter use the same slot in memory…. The Java programming language does not pass objects by reference; it passes object references by value. Because two copies of the same reference refer to the same actual object, changes made through one reference variable are visible through the other. There is exactly one parameter passing mode — pass by value — and that helps keep things simple.

— James Gosling, et al., The Java Programming Language, 4th Edition

The above clearly states that Java passes object references by value meaning that when the reference is passed, a copy of that reference (which is an address) is passed. Since the copy of the reference and the reference refer to the same object, if a call is made to a method that modifies the object in Java, that object is modified, hence the line “Because two copies of the same reference refer to the same actual object, changes made through one reference variable are visible through the other”.

I will now illustrate what pass-by-reference means, via a clear example in C++.

Let us create the following files in an appropriate directory with the following contents:

PassByReference.hpp:
#ifndef PassByReference_hpp
#define PassByReference_hpp
void swapIntByRef(int& iParam, int& jParam);
void swapIntByVal(int iParam, int jParam);
#endif

PassByReference.cpp:
#include <iostream>
#include "PassByReference.hpp"
using namespace std;

int main()
{
int i=1000;
int j=2300;

cout << "Illustration of Pass By Reference:\n";
cout << "Before: i= " << i << " j=" << j;
cout << "\n";
swapIntByRef(i,j);
cout << "After: i= " << i << " j=" << j;
cout << "\n";

cout << "\nIllustration of Pass By Value:\n";

i=1100;
j=2500;

cout << "Before: i= " << i << " j=" << j;
cout << "\n";
swapIntByVal(i,j);
cout << "After: i= " << i << " j=" << j;
cout << "\n";

}

void swapIntByRef(int& iParam, int& jParam)
{
int temp(iParam);
iParam=jParam;
jParam=temp;
}

void swapIntByVal(int iParam, int jParam)
{
int temp(iParam);
iParam=jParam;
jParam=temp;
}

We now compile and run the code (assuming you have the g++ compiler):

g++ -o PassByReference PassByReference.cpp
./PassByReference
Illustration of Pass By Reference:
Before: i=1000 j=2300
After: i=2300 j=1000

Illustration of Pass By Value:
Before: i=1100 j=2500
After: i=1100 j=2500

The results above perfectly illustrate the difference between passing by reference vas pass-by-value, at least from the C++ point of view.
By using the reference operator &, when the value of i is passed to the swapIntByRef function, the actual parameter value is modified in the function such that when the function returns back to the main() function that calls it and the values of i and j are printed out, the values of i and j have been swapped.

In the latter case of pass-by-value, copies of i and j are passed, not references via the & operator.
The result of this is that even though an attempt is made to swap the values in the swapIntByVal function, the original actual parameter values remain unchanged, and this is what we see in the result.

The latter case is what prevails in Java even for all cases, even in the case of objects.

Here is an illustration in Java for both primitives and object references:

Create the file PassByValueDemo.java:

public class PassByValueDemo {

public static void main(String[] args) {
int i=1000;
int j=2300;
System.out.println("Primitives Case");
System.out.println("----------------");
System.out.println(" Before: i=" + i + " j=" + j);
swapInt(i,j);

System.out.println(" After: i=" + i + " j=" + j + "\n");

System.out.println("Wrapper Case");
System.out.println("--------------");
Integer iw=1000;
Integer jw=2300;
System.out.println(" Before: iw=" + iw + " jw=" + jw);
swapInteger(iw,jw);

System.out.println(" After: iw=" + iw + " jw=" + jw);

}

static void swapInt(int iParam, int jParam)
{
int temp=jParam;
jParam=iParam;
iParam=temp;
System.out.println(" iParam=" + iParam + " jParam=" + jParam);

}

static void swapInteger(Integer iParam, Integer jParam)
{
Integer temp=jParam;
jParam=iParam;
iParam=temp;
System.out.println(" iParam=" + iParam + " jParam=" + jParam);
}

}

We now compile and run the code:

javac PassByValueDemo.java

java PassByValueDemo

which produces:

Primitives Case
----------------
Before: i=1000 j=2300
iParam=2300 jParam=1000
After: i=1000 j=2300

Wrapper Case
--------------
Before: iw=1000 jw=2300
iParam=2300 jParam=1000
After: iw=1000 jw=2300

Thus we can see that in both cases of primitive and wrapper classes the values of the actual parameters i and j remain unchanged in the calling routine main. There is no way to achieve the effect we observed in the C++ method PassByRef in Java where the original actual parameters are changed. The underlying object that the reference refers to can be changed via a call to a modifying method on the referenced object, but the reference parameter is always a copy of the original actual parameter.

Summary

  • C++ supports pass-by-value and pass by reference via its & operator.

  • Java supports pass-by-value ONLY. What is erroneously thought of as pass-by-reference is really pass-by-value of an object reference.

Future of 21st Century Databases meetup

Some takeaways from Future of 21st Century Databases meetup hosted by AppNexus in NYC and a roundtable featuring NoSQL db heavyweights :

  • Eliot Horowitz, CTO and Co-Founder, 10gen / MongoDB
  • Barry Morris, Founder and CEO, NuoDB
  • Bob Wiederhold, President and CEO, Couchbase

NuoDB
SQL interface but NOSQL underneath
Doesn’t have a document model as yet
The How to elastically scale SQL problem has been solved
NuoDB isn’t open source yet.
Largest known installation: 400 nodes w/ sharding

CouchBase
Many travel sites use CouchDB – e.g. Orbitz migrated its cache from Oracle Coherence
to Couchbase. Full presentation here.
Couchbase doesn’t consider NuoDB a competitor since NuoDB isn’t really document database.
Couchbase DevDays – ways to build up skills
Largest known installation: 80 nodes w/ no application sharding
SQL databases will still be around – but growth area is NOSQL

MongoDB
Mongo has document features more than Couchbase
Largest known installation: 100 nodes w/ no application sharding

Installing Hortonworks Hadoop on Amazon EC2

After viewing this youtube video and following article How to Hadoop , I was able to successfully install Hortonworks Hadoop on a 4-node cluster on Amazon EC2. However I needed to make the following additions below to make it work properly:

1. Make sure that you install postgresql-8.4, not postgresql-9.1

If you did yum install postgresql, it may have installed postgresql-9.1
if that is the case, then you need to erase it:

yum erase postgresql

Download the 8.4 version
curl -O http://yum.postgresql.org/8.4/redhat/rhel-5-x86_64/pgdg-centos-8.4-3.noarch.rpm

yum install postgresql84
yum install postgresql84-server

2. edit the file /usr/sbin/ambari-server.py as follows:

Change

PG_HBA_DIR = "/var/lib/pgsql/data/"

to

PG_HBA_DIR = "/var/lib/pgsql/8.4/data/"

3. Make sure that the bindir for postgresql is added to the PATH

export PATH=$PATH:/usr/pgsql-8.4/bin

4. Make sure that the “ambari-server” user is created and add the following

Otherwise, you will obtain the following error:
...
internal exception: org.postgresql.util.psqlexception: fatal: ident authentication failed for user "ambari-server"

Also, edit the /var/lib/pgsql/8.4/data/pg_hba.conf file
and add the following line:

host    all         all         127.0.0.1/32          md5
solution referenced from : http://stackoverflow.com/questions/4562471/connecting-to-local-instance-of-postgresql-with-jdbc

MapReduce and Hadoop – an Introduction

The Data Problem

The advent of the Web 2.0 age has enabled the collection of massive amounts of data from web logs, GPS tracking data, sensor data, twitter feeds, mobile phone logs, etc etc.

The nature of the data problem we face is not the collection of it, but rather how to process such huge amounts of data (in petabytes) often on a daily basis so as to obtain meaningful intelligence in order to make decisions on it. The data processing problem has coincided with the gradual erosion of Moore’s law in which the limits of processing power has been gradually reached, while at the same time the cost of hardware has kept falling.

Enter MapReduce

MapReduce is a programming paradigm used to process large datasets by parallelizing and distributing the work among independent processes or tasks. It is derived from a functional programming model. MapReduce was proposed as an algorithm to help address the data processing problem by making use of the latest trends in hardware computing: cheap hardware and distributed computing.
The basic idea is to have cheap commodity machines arranged in clusters running independent tasks among which large data sets could be distributed and processed.

The advantages of MapReduce are as follows.
It provides:

  • Clean abstraction for developers
  • Fault tolerance and recovery
  • Monitoring and Status updates
  • Automatic parallelization and Distribution
  • Input-Output Scheduling

Basic Idea

It consists of 2 main functions :
Map([key,value])
A list of key-value pairs is provided to the map function, which applies a function f to each element of the list, and returns a list of intermediate key-value pairs.
Reduce([key,value])
Reduce receives as input the list of intermediate key-value pairs produced by map and combines these together to produce the final output.
All values corr. to the same key will be processed at the same reducer node.

Applications of Map Reduce

  • Inverted Index Construction
  • Document Clustering
  • Web link-graph traversal
  • Statistical Machine Translation
  • Machine Learning

Hadoop

Hadoop provides a open-source implementation of MapReduce.
It is a highly scalable distributed computing framework which enables distributed processing of huge data sets across a cluster of machines via MapReduce
It provides failure detection & failover at the application layer as well as high availability.

Hadoop Environment Setup

The key features of a Hadoop setup are as follows:

DataNode
provides data storage services for the shared file system
NameNode
Manages metadata of cluster and DataNodes
Secondary NameNode
Backup for NameNode
JobTracker
Manages job & resources in cluster
TaskTracker
Manages tasks
Balancer
Does cluster balancing

Data Science Reading List

Based on a discussion on the LinkedIn Data Science group and my experience, here’s are some articles/books that should be included in a good Data Science reading list for programmers and developers, with many available on the Safari O’Reilly website:

Introduction

Big Data and Hadoop

Data Mining, Analysis and Statistics

Machine Learning

Data Visualization

Data Science, Big Data and MapReduce

To start one’s journey into the world of Data Science and Analytics, it behooves one to gain an understanding of the most touted terms in this field. The paper Data Science and Distributed Intelligence provides a good explanation of the often used terms Big Data, Data Science and Map Reduce. Here is my synopsis of these terms based on a reading of this paper:

Big Data refers to the proliferation of large amounts of data in huge databases and very high rate streaming data that produces it. Examples are : web server logs due to user clicks, cell phone logs, twitter feeds, facebook comments etc. An interesting fact is that 90% of Big Data was produced only in the last 2 years.

Data Science is a term that describes the process of gathering data, analyzing it and obtaining information out of it to produce a “data product”. Such data may involve large unstructured data sets. Hence the term ‘Data Science’ is often used in conjunction with Big Data.Since the processing techniques for Big Data often involve functional programming paradigms map and reduce, this has given rise to the MapReduce algorithm, popularized by Google.

MapReduce is an algorithm that helps mitigate the problem of processing Big Data and large data sets by breaking down the processing into smaller more manageable chunks.
Hadoop is the popular open source implementation of MapReduce