Sandbox Hortonworks

http://hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/

Chargement de données dans HDFS

données compressées

https://s3.amazonaws.com/hw-sandbox/tutorial1/NYSE-2000-2001.tsv.gz

Select the ‘Upload’ button
Select ‘Files’ and a pop-up window will appear.
Click the button which says, ‘Upload a file’.
Locate the example data file you downloaded and select it.
A progress meter will appear. The upload may take a few moments.

Visualisation dans HCatalog

Définition du métamodèle (type et caractéristique de chaque colonne)

Requêtage dans Hive

Les bases HCatalog sont directement accessibles.

On voit la décomposition en jobs Map Reduce dans les logs

select count(*) from nyse2001

select * from nyse2001 where stock_symbol= "IBM"

Pig

Step 1: Create and name the script
Step 2: Loading the data

a = LOAD ‘default.nyse_stocks’ USING org.apache.hive.hcatalog.pig.HCatLoader();

Step 3: Select all records starting with IBM

b = filter a by stock_symbol == 'IBM';
c = group b all;

Step 4: iterate and average

d = FOREACH c GENERATE AVG(b.stock_volume);
DUMP d;

Step 5: save the script and execute it

paramètre de pig :

-useHCatalog (entrée)

Tutorial 2 PIG

http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-pig/

batting = load 'Batting.csv' using PigStorage(',');
runs = FOREACH batting GENERATE $0 as playerID, $1 as year, $8 as runs;
grp_data = GROUP runs BY (year);
max_runs = FOREACH grp_data GENERATE group as grp,MAX(runs.runs) as max_runs;
join_max_run = JOIN max_runs by ($0, max_runs), runs by (year,runs);  
join_data = FOREACH join_max_run GENERATE $0 as year, $2 as playerID, $1 as runs;  
dump join_data;

Tutorial 3 Hive

http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/

create table temp_batting (col_value STRING);
LOAD DATA INPATH '/user/hue/Batting.csv' OVERWRITE INTO TABLE temp_batting;
create table batting (player_id STRING, year INT, runs INT);
insert overwrite table batting  
SELECT  
regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 1) player_id,  
regexp_extract(col_value, '^(?:([^,]*)\,?){2}', 1) year,  
regexp_extract(col_value, '^(?:([^,]*)\,?){9}', 1) run  
from temp_batting;

SELECT year, max(runs) FROM batting GROUP BY year;

SELECT a.year, a.player_id, a.runs from batting a  
JOIN (SELECT year, max(runs) runs FROM batting GROUP BY year ) b  
ON (a.year = b.year AND a.runs = b.runs) ;